Helen Davidson reports for the Guardian on a new study of Australian languages:

Most Indigenous languages in Australia likely originated from a remote spot in far north Queensland as recently as 4,000 years ago, before slowly spreading across the country, a new study has claimed.

The paper, published in the journal Nature on Tuesday, mapped the origins of the Pama-Nyungan family of languages, which encompasses about 90% of the continent. It traced the dominant family of languages back to an area near an isolated place known today as Burketown.

“All the languages from the Torres Strait to Bunbury, from the Pilbara to the Grampians, are descended from a single ancestor language that spread across the continent to all but the Kimberley and the Top End,” wrote co-author Claire Bowern, professor of linguistics at Yale University. “Where this language came from, how old it is, and how it spread, has been something of a puzzle.” […]

The researchers used an adapted computer model originally designed to map the spread of viruses, and built a family tree of “cognates” – identical or similar words across multiple languages. Pama-Nyungan is one of 28 Aboriginal or Torres Strait Islander language families. In contrast, Europe had four such families.

The results traced Pama-Nyungan back to a site south of the Gulf of Carpentaria, and indicate it emerged in the mid-Holocene period 4,000 to 6,000 years ago and rapidly replacing the existing languages.

It also aligned with archaeological discoveries in the region, including tool technologies which could explain the expansion across the continent, as could changes in ceremonies or marriage customs, the paper said. […]

“We’ve got a clearer picture now of when and where things started expanding, but there remains this question about exactly what drove it,” co-author Prof Quentin Atkinson of Auckland University’s school of psychology told Guardian Australia.

“What could possibly have happened to allow this group – initially one language – to spread across 90% of Australia and replace everyone else?”

Bowern discusses the study here, with a couple of useful maps; the paper itself, “The origin and expansion of Pama–Nyungan languages across Australia,” by Remco R. Bouckaert, Claire Bowern, and Quentin D. Atkinson, is behind a paywall. Thanks, Trevor!


  1. The prevailing view in Australia is that the Aboriginal peoples have inhabited their (respective) lands for 60 thousand years, and have been the custodians and traditional owners of that land for all that time. It would be interesting to see whether there has been any commentary on the paper in the light of that.

    In the article, they mention only 4 language families in Europe. Seems they are not counting any language families from the European part of the former USSR or Turkey.

  2. Dr. Atkinson was one of the authors of the horribly flawed work that concluded that Indo-European originated in Anatolia; the lead author of this paper, Dr. Bowern, is an expert in Australian languages, so this work is obviously much more worth paying attention to, but I do still wonder how reliable the whole computational-phylogenetics approach is. Does linguistic expertise fix the real problems, or merely eliminate the most telling mistakes?

    > In contrast, Europe had four such families.

    I wonder what time-point is identified by the “had”, and which four families they attribute to that time?

  3. January First-of-May says:

    As far as I understand, if you ignore Basque for not being in a family, and ignore the North Caucasus as not being in Europe (I think it’s enough to set the boundary by the Kuma-Manych depression*), you do end up with four European language families – Indo-European (duh), Uralic, Turkic, and Mongolic (Kalmyk).

    If you happen to believe in Altaic, that four goes down to three, leaving a space for either Basque or Semitic (Maltese).

    I might have missed some obscure minor language, however.

    *) I tried to check on the maps, and only got more confused. It does, however, appear that even if there are any places north of the Kuma-Manych depression where people speak any kind of North Caucasian language, said places are probably either immigrant communities, or single villages within a few miles of the boundary. (Possibly both.)

  4. Oh, oops, I didn’t see your link to Dr. Bowern’s discussion, which answers my question immediately: “Europe has just four language families, Indo-European, Basque, Finno-Ugric and Semitic”.

  5. January First-of-May says:

    That’s a weird list – I’d have expected Turkic instead of Semitic. Which part of Europe is the “Semitic” intended for – Malta?

  6. How about

    France 6,000,000[1]
    Spain 1,600,000–1,800,000[2]
    Italy 680,000[3]
    Germany 1,000,000+[4]
    United Kingdom 500,000[5]
    Netherlands 480,000–613,800[6]
    Belgium 500,000
    Sweden 424,981[7]

  7. January First-of-May says:

    I know that there are lots of Jews in Europe, but I was pretty sure all of them (or effectively all of them) spoke either whatever the local language was, a slightly different version of whatever the local language was, or Yiddish.

    I guess they do use Hebrew in some contexts, however.

  8. Hehe, that was the list of European countries with major Arabic-speaking populations circa 2018.

  9. There are six million Arabic-speakers (and half a million Maltese-speakers) in Europe, but then again there are two and a half million Chinese speakers, and nobody says that Sino-Tibetan is a language family of Europe. (In both cases, all topolects are lumped.)

    In plain fact, Europe is 94% Indo-European-speaking, roughly divided equally between Germanic, Slavic, and Romance speakers. The remaining 6% are mostly evenly divided between Uralic and Turkic speakers. The rest is basically noise.

    (Isaac Asimov once described the Solar System as “four planets plus rubble.”)

  10. It is what it is… Like Bouckaert et al.’s 2012 IE paper, this paper uses as its inputs 1. the documented locations of the attested daughter languages, 2. a cognate matrix, and 3. a model of language spread.
    Since the first-branching subgroups of PN are all near the bottom of the Gulf of Carpentaria, it is not surprising that the PN homeland is placed there (though the over-precise placement at Burketown, in the Guardian article, is not warranted).

    In the case of IE, some of the model’s conclusions clearly contradicted what is known from better evidence: it placed the most likely locations of proto-Celtic in Wales and Ireland, or at most as far east as Luxembourg; and it placed the range of likely locations of Proto-Indo-Iranian in a band reaching from NE India to Anatolia, reaching no further north than Azerbaijan. Likewise, it placed PIE in Anatolia, which has few adherents. Fundamentally, the model is limited by what data it incorporates. In the case of Europe, the observed locations of daughter languages gave a distorted idea of where their ancestors were, because of language replacement, which Bouckaert’s model does not and perhaps cannot account for.

    Pama-Nyungan languages are much less studied than IE ones, and beggars can’t be choosers. This may be better than any other PN model out there, but its conclusions are no more reliable than the IE ones. I suppose if you’re willing to be a few hundreds of kilometers off, it’s good enough.

    Another interesting (or bothersome) point is that the new phylogeny in the paper (superseding that in Bowern and Atkinson’s older paper) brings Tangkic into Pama-Nyungan, and not even as a top-level outgroup. Tangkic is usually considered non-PN, or perhaps a sister family. In the tree it’s presented as a sister of Yolngu, and they say “Tangkic is among the first groups to separate, although there is also some signal in the data placing the Tangkic branch as a remote sister to the Yolngu languages.” That’s something I’d like to see a more definite statement about.

  11. Fundamentally, the model is limited by what data it incorporates.

    Is there any model of which that is not true ? An apparently opposing claim at this contentiously vague level is that a model limits the data it can incorporate. Maybe each limits the other, that would be nice. Or is it all écriture automatique ?

  12. Trond Engen says:

    I’m interested to know if the model has been improved since it failed so spectacularly on IE.

    I’ll admit that one of the more glaring issues of the IE paper is of little relevance for Pama-Nyungan. There’s practically no external evidence to fail to calibrate against — no attested ancient languages and no previous stages and locations recovered from adstrates. Not that this makes the results more precise, but now it’s just uncertainty, not a flaw.

    Another issue, that it didn’t take into account variation in speed in different times and landscapes, could also be less relevant for a continent where all travel was by foot and military organization above the local level was unknown. But different situations would still lead to different forces of migration. On that note, it’s at least interesting that the model seems to yield periods of stability in fertile regions and fast movement through deserts.

    Did anybody get the animated gif to run?

  13. Which is a bit unexpected, since in deserts the heat imposes a slow pace. And in fertile regions the men must move fast to fertilize as much as possible before they die.

    On the other hand, anything that passes through a desert must move fast to survive, with or without a model – unless the model is equipped with air conditioning.

  14. J.W. Brewer says:

    Surely the most parsimonious explanation for the list of the 4 language families attested in “Europe” is that territory outside the EU != “Europe.” It’s like the reverse of Metternich’s old quip that Italy was (pre-unification) purely a geographical expression – Europe is now purely a political expression. Of course if you’d used the Council of Europe rather than the EU you would have gotten different boundaries and a larger list of language families.

  15. Trond Engen says:

    Even with Europe := EU and languages := non-diaspora languages you’ll have thousands of Turkic speakers in the Balkan countries and Cyprus. The more so if we exclude the effects of modern-era ethnic cleansing.

  16. J.W. Brewer says:

    Well, whether those Turkic speakers are “diaspora” speakers depends on the timeframe you’re using, I suppose. If one treats languages arriving in the Americas post-1492 as not-really-indigenous (and similar for Australia with a rather later cut-off date), might one not do the same for languages arriving in Europe post-1453 as an artifact of imperialism and colonialism coming from elsewhere? Not that this is necessarily how Profs Atkinson and Bowern were thinking about it …

  17. Not strictly material, but the Ottomans entered Europe in 1354.

  18. J.W. Brewer says:

    Sure. I guess the question is how long after 1354 you had Turkic-speaking populations on the ground that were not obviously non-indigenous if you didn’t happen to know the history. Notions of indigenousness are in general easier to administer if your area under discussion has both clear boundaries and a clear date upon which its prior pristine state of isolation was suddenly intruded upon by the non-indigenous. Europe/Asia/(North) Africa don’t work very well for that model vis a vis each other (is Arabic an African language? Was Punic?). Of course Turkic entered the present-day EU much earlier with the Bulgars but they were eventually linguistically and (largely?) culturally assimilated by the indigenes (or at least, the folks who had gotten there a little bit earlier).

  19. Trond Engen says:

    I was also thinking about the earlier intrusions of Turks from the Steppe into the eastern Balkans, but I suppose there aren’t many Gagauz speakers left in Romania and Bulgaria.

  20. ə de vivre says:

    I don’t think diasporic is the opposite of indigenous. It would sound strange to me to call NA English a ‘diasporic language.’ In any event, the context in which Turkish-language communities appeared in Europe (however you define Europe) is different enough from the European colonization of the Americas that it would be misleading at best to call Turkish a ‘colonial’ language. I think there has to be a larger power-imbalance and more extreme cultural othering for ‘colonial’ to be appropriate (but, cards on the table here, I’m kind of tired of the ‘everything everywhere is colonial’ trend that never seems to quite die out). If Turkish in the Balkans is diasporic, then I think you’d have to say that, in most of France, the French language is diasporic.

  21. SFReader says:

    Of course it’s diasporic. It was brought there by Italic speaking immigrants in 1st century AD.

  22. Greg Pandatshang says:

    Hmm, so nothing in this article about the shipwrecked-South Indians model of Pama-Nyungan origins?

  23. J.W. Brewer says:

    Assuming arguendo that “indigenous” is a useful and coherent concept, is there a word other than just “non-indigenous” that usefully describes languages presently spoken in area X (and possibly spoken there for some centuries in the past) that are not “indigenous” to X? I agree that “diasporic” isn’t broad enough.

    Obviously one advantage of using a higher level of generality, like “Europe” rather than “France” (or “Gaul,” to use an arguably more “indigenous” toponym) is that you get fewer questions you need to answer because intra-continental migrations, conquests, and language shitfts don’t need to be accounted for. And you have similar issues in the Americas where specific language communities that were certainly indigenous to the continent moved around quite a bit within the continent at dates within the last millenium.

  24. ə de vivre says:

    I think ‘indigenous’ is most useful in the context of settler colonialism. The farther away from that you get, the more massaging you have to do for it to apply.

    As for what counts as a ‘European’ language, I think it’s more a question of state ideology about self and other rather than anything about the languages’ history. The existence of French in Provence is consonant with the state’s idea of French identity, whereas the existence of Arabic in France goes against the state’s idea of French identity. Heck, whether Spanish, Italian, and Hungarian are European languages has changed over the years without the language communities themselves moving at all.

    That is, I think it’s current ideology rather than any formally definable history that determines what makes sense to call a properly European language.

  25. Trond Engen says:

    I used diaspora for “cultural (religious and/or ethno-linguistic) community formed by migration into minority”. Languages brought by new ruling elites or by settlers forming regional majorities are not diaspora languages in this sense.

  26. J.W. Brewer says:

    ə de vivre, I take your general point, but there are lots of parts of Africa where there was never any significant amount of intended-to-be-permanent settler colonization (i.e. the only Europeans on the ground were generally military or colonial administrator types who expected to rotate back home after their tour of duty was up, and maybe a few missionaries there on a more open ended basis) but where the one-time imperialist’s language (whether English or French or Portuguese etc.) is now widely spoken and may in fact be the language of education and government. “Indigenous” or something like it still seems a useful concept in contrasting the pre-existing languages to those more recent arrivals, even if the more recent arrivals have outlived the imperial era and seem to have settled in to stay for the indefinite future.

  27. ə de vivre says:

    I probably wouldn’t protest to someone using ‘indigenous’ in a sub-Saharan Africa context, but a colonial/non-colonial distinction seems like a more useful frame. To me, calling something an indigenous language implies that the people who speak it are indigenous. But without a non-indigenous community, I’m not sure it makes sense to call some groups indigenous.

    In Nigeria, for example, it makes sense to say that English is a colonial language, and the others (whatever the power relations are between them) are not. There are white people in Nigeria, their presence is tied to colonial history, but I don’t think indigenousness is what sets them apart. On a more impressionistic level, it would sound weird to me to say that the Igbo are indigenous to Nigeria. It sounds like a swipe at the Hausa, or an allusion to the Biafra Civil War.

    When I hear ‘indigenous’ used in an Indian context, it’s in opposition to the larger non-colonial languages rather than English specifically. If we accept indigenous as applicable, you’d have indigenous languages, non-colonial languages, and colonial languages—and that’s without touching how to talk about the effects of the Partition.

  28. Y’s first point (“the documented locations of the attested daughter languages”) is a serious flaw, because the assumption of this research seems to be (if I understand correctly) that the location and structure of present-day Pama-Nyungan languages must correspond to the initial spread of Pama-Nyungan. But this assumption flies in the face of what we can observe in better-known parts of the world: French and English are both Indo-European, but both spread on territories where another Indo-European language (Celtic) was already dominant, and in turn some clues point to Celtic having itself spread at the expense of other Indo-European languages. And of the Celtic languages spoken today, in turn, we know that two (Breton and Scottish Gaelic) expanded at the expense of Romance and Germanic, respectively.

    A similar situation is found elsewhere: Arabic is Semitic and has spread at the expense of other Semitic languages (Punic, Aramaic, South Arabian varities…), and indeed Aramaic in turn had earlier spread at the expense of yet other Semitic languages (Canaanite and East Semitic). Any attempt, thus, to analyze the spread of the Semitic languages in the Middle East + North Africa on the basis of the present-day distribution of Semitic languages would probably yield results which would prove…more entertaining than enlightening, let’s say.

    (Of course, if you consider Arabic and Maltese speakers on the one hand, or Turkish speakers on the other, within Europe today, you would probably come to even more entertaining conclusions…hmm, it would make for excellent satire: in a future world where Europe has been colonized by non-Europeans much in the same fashion that Australia was colonized by Europeans, an academic conference on European linguistics is weighing the various proposals regarding the Urheimat of Proto-European Semitic and Proto-European Turkish..).

    And finally, let’s consider two possibilities: 1-That the Proto-Pama-Nyungan homeland (PPNH) was located outside Australia and was instead located in present-day Indonesia or New Guinea (and that its closest relatives disappeared as a result of the spread of Austronesian), or 2-That the PPNH was located in the non-Pama-Nyungan-speaking parts of Australia (there is no law which prohibits speakers of a Pama-Nyungan from shifting to a non-Pama-Nyungan language). If either possibility is true, I wonder: Would their model, COULD their model have detected either possibility?

  29. January First-of-May says:

    It would be interesting to try working out a similar model for Uralic; as far as I understand, the Proto-Uralic homeland was probably located somewhere in the vicinity of modern Tatarstan, Bashkortostan, or Samara Oblast – areas that are today mostly occupied by speakers of Russian and of several Turkic languages (plus, admittedly, some remnants of Eastern Mari).

  30. J.W. Brewer says:

    Perhaps in most parts of the world with even a moderately complex history, “indigenous” is a slippery concept because there are often more than two sorts of groups and various groups have arrived at various points in time with various results in terms of present day status, power, and numerosity. So e.g. in the mainland part of Malaysia (f/k/a Malaya) the ethnic-Malays are a lot more “indigenous” in some meaningful sense (a sense which has current political resonance and relevance, regardless of what you think of how the politics of ethnicity have ended up playing out there) than the ethnic-Chinese, the ethnic-Indians, etc., but a lot less “indigenous” than the Orang Asal, and parallels to the Orang Asal can be found in plenty of other Asian countries. Is there an undercurrent that regardless of how long your ancestors have lived in the relevant territory, if you’ve held your own reasonably well vis a vis more recent arrivals you don’t really count as “indigenous” because a certain degree of marginalization and loss of status/power is inherent in the concept?

  31. aboriginal, autochthonic, indigenous, native – which is in style or “politically correct” today?

  32. David Marjanović says:

    *eyeroll* Viruses. Phylogeography is a Google search suggestion, gets 1.550.000 hits, has Wikipedia articles in English, German and 11 other languages that don’t make it to the first page of Google results shown to me, and there are no viruses there to be seen.

    In the case of IE, some of the model’s conclusions clearly contradicted what is known from better evidence: it placed the most likely locations of proto-Celtic in Wales and Ireland

    Wasn’t it limited to extant languages, in which case it obviously didn’t say anything about Proto-Celtic, only about Proto-Island Celtic?

  33. Trond Engen says:

    But back to Pama-Nyungan. With Indo-European they started with known branches. That’s hardly an option in Pama-Nyungan. I’m not sure exactly what they told the model to do, but if they used it to suggest a parsimonious sub-division of Pama-Nyungan based on the lowest number of linguistic steps and then reconstruct a trajectory of past migrations, which then turned out to look reasonable on the map, I’m willing to listen. Even if the actual path may be off with a half continent.

  34. David Marjanović says:

    The abstract certainly sounds like they did a phylogenetic analysis. I’ll read the paper tomorrow if I don’t forget.

  35. J.W. Brewer says:

    To Trond’s point, one problem with historical reconstructions like this where the data is thin is how plausible something needs to be to be taken seriously. The best available model (i.e. better than any rival proposal, considered one at a time) does not necessarily mean a model whose output is >50% likely to be true. It may well be that the best available model is the one that’s 10% likely to be true, as compared to 25 or 30 other rival proposals that are all individually no more than 5% likely to be true. But how interested should we actually be in the leading candidate when the real leading candidate percentagewise is “probably something quite different from the currently leading model’s output, but we really don’t know what”?

  36. “aboriginal, autochthonic, indigenous, native – which is in style or “politically correct” today?”


  37. Land bridges.

  38. David Marjanović says:

    I didn’t forget, but only just got around to reading the paper. Rather than copying & pasting half of it here, I recommend reading the whole thing. It represents a huge amount of work; as in other Nature-group papers, the “paper” is really just an extended abstract, while the “supplementary information” is where the real paper is. The “paper” makes clear that the authors considered a long list of potential sources of error, tested their potential impact and found their results stand unchallenged.

    The dataset is purely lexical, though, and it makes the same error as the infamous IE work in coding the presence or absence of each cognate set as a binary character. However, that should make the branches too long, not too short. Evidently the archeological calibration date, “constrained […] using a gamma distribution with a 95% highest posterior density interval between 5 ka and 3 ka and a probability mass skewed towards younger ages within this range (modelled as a gamma distribution with a 3,000 year offset, α = 2 and β = 359)”, compensates for that.

    Hmm, so nothing in this article about the shipwrecked-South Indians model of Pama-Nyungan origins?

    Oh yes, right on the first page: “An expansion at this time is consistent with evidence from early genetic studies for gene flow into Australia from India ~4–5 ka^21, but more recent work has called these findings into question^3,22–24.”

    Another interesting (or bothersome) point is that the new phylogeny in the paper (superseding that in Bowern and Atkinson’s older paper) brings Tangkic into Pama-Nyungan, and not even as a top-level outgroup. Tangkic is usually considered non-PN, or perhaps a sister family. In the tree it’s presented as a sister of Yolngu, and they say “Tangkic is among the first groups to separate, although there is also some signal in the data placing the Tangkic branch as a remote sister to the Yolngu languages.” That’s something I’d like to see a more definite statement about.

    A more definite statement is indeed lacking, but the next sentence offers some encouragement: “The two groups are separated by several non-Pama–Nyungan groups, which makes it less likely that the signal we observe is a result of recent loans.” Also, there’s this in the methods: “We did not assume a known outgroup. Instead, the appropriate root point on the tree was inferred under the assumption of a relaxed clock (see below). While we found considerable uncertainty in the root point and basal branches of the tree, our estimates for the location and timing of Pama–Nyungan origin were made across the posterior sample of trees, and hence all our inferences integrate over this phylogenetic uncertainty.”

  39. Thanks! With Claire Bowern as an author, I was pretty sure they weren’t making any rookie mistakes.

  40. Thanks! With Claire Bowern as an author, I was pretty sure they weren’t making any rookie mistakes.
    The input data, as you say, is meticulously put together. There were some errors in the data set which Bowern had used in an earlier study, but they were relatively few. The paper takes pains to quantify the extent of those errors and to show that they did not significantly alter the final results, and that by extension nor would any remaining errors. (Note to Greenberg, wherever you may be: this is how lexical data are properly collected.)

    The two groups are separated by several non-Pama–Nyungan groups, which makes it less likely that the signal we observe is a result of recent loans. That’s a geographical argument, and not much of one. Who knows who was where 1,000 years ago?

    As with most linguistic phylogenetic work, what bothers me the most is that the algorithm doesn’t show its work. I can’t tell which lexical characters are essential to placing Tangkic within Pama-Nyungan and together with Yolngu. If those were available, one could agree with them or argue that they are loanwords based on some other arguments. As it is, we have a black box, and all we can do is accept it or hope for a better but equally black box in the future.

    David, does that issue come up in biological phylogenetics? Haplotype networks often explicitly mark changed characters, but otherwise, as far as I can tell, the field relies on black boxes everywhere.

  41. David Marjanović says:

    While you can’t directly read from a Bayesian tree which characters change states where, you can optimize the characters on the tree – using the same model of evolution that was used to calculate the tree – and then get the probabilities that they change states at any given node. That’s more work than on a most parsimonious tree.

    And then you want to go further and find out “which lexical characters are essential to” getting this tree and not any other. Here, regardless of method, the only way is to change those scores in the matrix and repeat the whole calculation. The contribution of any single score is almost always impossible to predict, and I’ve run into some very large surprises over the last 10 or 11 years – like changing one score of one species and watching another species at the other end of the tree change place instead of the changed one.

    That’s a geographical argument, and not much of one. Who knows who was where 1,000 years ago?

    Well, why would they have moved? And how fast? For the spread of Pama-Nyungan, the paper finds an average speed of just 140 m/year, much slower than IE with its horses and wagons.

  42. Bayesian, etc.: This is what I waas afraid of… that’s why I phrased the question as I did. So is that something that your crowd grumbles about? Traditional linguists grumble at phylogenetics in general, sure, but biological taxonomists are the ones who own these techniques.

    For the spread of Pama-Nyungan, the paper finds an average speed of just 140 m/year, much slower than IE with its horses and wagons.
    The Yolngu languages are clearly separated by a few hundred kilometers from the rest of PN, whereas Tangkic is adjacent to the putative source of Proto-Pama Nyungan, so some movement must have occurred.

