Automated Reconstruction of Ancient Languages.

This BBC Science story by Rebecca Morelle… well, really all I need do is point out that the affiliations of the authors of the paper it’s based on, “Automated reconstruction of ancient languages using probabilistic models of sound change,” are one Department of Statistics, one Department of Psychology, and two Computer Science Divisions. My basic response to any paper making claims about language that does not have at least one actual linguist on board is to toss it in the circular file. But I guess it’s possible that this system that “automatically and accurately reconstructs protolanguages from modern languages” (it slices! it dices!) might be useful in chewing through large quantities of data and spitting out correlations that could save linguists some time; at any rate, there it is for those who might find it of interest.


  1. The abstract says: “Over 85% of the system’s reconstructions are within one character of the manual reconstruction provided by a linguist specializing in Austronesian languages.” “Character” seems to mean phoneme here, plus extra ones for word boundaries. Although this is yet another PeNAS direct submission, at least it’s not the usual methods by the usual suspects. Someone on the project understands faithfulness and markedness, since they are explicitly mentioned by name and defined correctly.

    A particularly interesting result was the confirmation of what they call the functional-load hypothesis, which says that phones with low or zero functional load (meaning that there are few or no minimal pairs distinguished by them) can easily merge. An earlier test with only four languages could not reject the null hypothesis, but their 736-language set was able to do so easily. Nonetheless, I hardly expect English /h/ and /ŋ/ to merge any time soon.

  2. David Marjanović says

    The paper is three years old, and the news article two. The method has since then been used by linguists to discover (to their own surprise!) the hypothesis that the Chitimacha language of Louisiana is related to the Toto-Zoque family of central Mexico; I think you blogged about that. I’ve recently seen criticism of that hypothesis, but the author of the criticism doesn’t claim the similarities are artefactual or coincidental – instead he postulates contact.

    Nonetheless, I hardly expect English /h/ and /ŋ/ to merge any time soon.

    They’re nearly in complementary distribution, so at some point it becomes a matter of definition. The claim that [h] and [ŋ] are positional allophones of a single phoneme in German has indeed been made.

    (In German, though, you can get away with claiming that [ŋ] between vowels/”sonorants” is still simply /ng/ – that doesn’t work in most Englishes.)

  3. David Marjanović says

    Anyway, from reading the paper I got the impression that the method is designed to do the easier parts of what historical linguists do, and that it does them pretty well.

  4. The paper is three years old, and the news article two.

    Sigh. Ah well, maybe some haven’t seen it/them…

  5. Anyway, what’s three years to a historical linguist, who routinely cites work from the 19C? Even synchronic linguists routinely cite (unpublished!) works older than they are.

  6. Probabilistic models of sound change

    “Probability” in this case must refer to “likelihood of occurrence based on changes known to have occurred in the history of actual languages”. For language families with a known history such as the descendants of Latin, some changes are firmly established and a few have been shown to have occurred in a vast number of languages (for instance intervocalic lenition or weakening), but with families only recently recorded, and for which proto-language reconstruction is still in its infancy, “probable changes” are not necessarily obvious, and changes that seemed likely to have occurred can later be shown to be the results of errors. One thing that does happen is that given a correspondence between “x” in language A and “y” in language B, it is not obvious whether “x” has changed to “y”, or “y” to “x”, or whether “x” and “y” both result from changes from an earlier “*z*. And if “*z” has changed to “w” in language C but disppeared altogether in language D (after one or more weakenings), while other sounds have undergone unusual changes in D as well, it may be very difficult to consider D as related to A and B. These are only some of the difficulties. It seems therefore that an automated approach to reconstruction must be possible only after most of the rough work has been done by competent historical linguists, yielding data that can provide a model or template for an automatic process to be set in motion, yielding another set of data that can help linguistic work along but not replace the training and instincts of historical linguists.

  7. David Marjanović says

    “Probability” in this case must refer to “likelihood of occurrence based on changes known to have occurred in the history of actual languages”.

    Of course.

  8. marie-lucie says

    David: My point is that what is known may not be all there is to know, or all that is correctly identified.

  9. “Probability” in this case must refer to “likelihood of occurrence based on changes known to have occurred in the history of actual languages”…

    I don’t think so; from what I can gather from a quick skim of the paper, the system generates sets of sound changes as part of the whole process. That is, it tries to “learn” the (probabilistic) rules for sound changes that best reproduce the input cognates, given the known phylogeny of the languages. (Using predetermined sound-change rules might be what one of the previous “deterministic” systems that they contrast their system with would do, I’m guessing.) Their system also allows for different sets of dominant sound changes for different branches of a language tree.

    So they’re not starting from any set of firmly (or less-firmly) established, known historical changes. I think the idea is that their program starts off with all possible sound transformations (within the basic set of deletion, substitution, and insertion of single phonemes) being equally likely, and then adjusts the probabilities for each possible change based on how well or poorly this reproduces the final sets of cognates. (There’s an interesting variation where they drop the known cognate sets and just start off with sets of words having similar meanings, adding the possibility of brand new words replacing old words with some frequency — a crude way to account for things like borrowing, I supposed — and see how well this reproduces the cognate sets.)

    They do discuss some of the (highest-likelihood) resulting automatically-derived sound changes and argue that they agree with commonly accepted historical sound changes. (E.g., “sonorizations /p/ to /b/ and /t/ to /d/, voicing changes, debuccalizations /f/ to /h/ and /s/ to /h/, spirantizations /b/ to /v/ and /p/ to /f/“, and so forth.)

  10. One of the co-authors, David Hall, also worked on this IE phylogenetics study by Chang et al., which I think was discussed here at some point, though I’m not finding the relevant thread.

  11. marie-lucie says

    How do you access the full article?

  12. marie-lucie —

    Click on the link to the article (not the BBC story); then, click on the “Full Text” tab, next to the “Abstract” tab.

    Or, if you want all the gory details (in PDF form), click on the “PDF + SI” tab (same row as the other tabs).

  13. TR’s link will take you directly to the PDF version of the main article; the equivalent HTML version is:

    and the PDF version with supplementary info (“all the gory details”) is:

  14. marie-lucie says

    Thank you both. I had clicked on the link but could not get past the site.

    I do want all the gory details!

  15. Following article ends with “As is so often the case, specialists need to talk to each other across the boundaries of their specialties.”

  16. Right, as opposed to “talk to each other about somebody else’s specialty.”

  17. David Marjanović says

    Oh, so the program starts by deriving a model of evolution from the data, just like how it’s done in molecular phylogenetics! That’s good. 🙂

  18. David Marjanović says

    this IE phylogenetics study by Chang et al.

    I just finished reading it. It’s awesome, including the references section. Highly recommended!

    (Yeah, OK, I skipped over the math. 🙂 )

  19. Trond Engen says

    I’ve finished reading it too. Like you I like it, and like you I didn’t crush the numbers of the BEAST.

    Unlike last year’s attempt by Bouckaert et al., this paper does not try to pin the inferred history onto a map. That’s all well since it would be folly without enough external evidence to derive the path from geographical data — but I’d still like to see another try. External evidence might be archaeologically documented genetic or cultural changes, but genes aren’t everything, and cultural movements are probably too impressionistic to be useful alone and would have to be calibrated against linguistic data again.

    I could believe that with a similar tree for Uralic, and the intervowen IE and Uralic trees recalibrated to eachother, some geographical inferences from linguistics alone might be lifted from impressionistic to quantifiable. If those could be tied to genetics or archaeology, then something could be learned about what sort of changes that carry language. I don’t think the recalibration would mean much for early IE, though, since there’s so little Uralic vocabulary in IE that all early contact could have happened in dead ends,.

  20. David Marjanović says

    That’s an interesting idea. I agree someone should try it.

    BEAST, I should mention, is very widely used in molecular phylogenetics and dating today. And I don’t like the interpretation of the PIE plain velars as uvular, but that can’t have had any influence on the results.


  1. […] Hat is skeptical about the idea that computer programs could automatically reconstruct ancient […]

Speak Your Mind