Automated Reconstruction of Ancient Languages.

November 24, 2015 by languagehat 22 Comments

This BBC Science story by Rebecca Morelle… well, really all I need do is point out that the affiliations of the authors of the paper it’s based on, “Automated reconstruction of ancient languages using probabilistic models of sound change,” are one Department of Statistics, one Department of Psychology, and two Computer Science Divisions. My basic response to any paper making claims about language that does not have at least one actual linguist on board is to toss it in the circular file. But I guess it’s possible that this system that “automatically and accurately reconstructs protolanguages from modern languages” (it slices! it dices!) might be useful in chewing through large quantities of data and spitting out correlations that could save linguists some time; at any rate, there it is for those who might find it of interest.

Comments

John Cowan says

November 24, 2015 at 8:41 pm

The abstract says: “Over 85% of the system’s reconstructions are within one character of the manual reconstruction provided by a linguist specializing in Austronesian languages.” “Character” seems to mean phoneme here, plus extra ones for word boundaries. Although this is yet another PeNAS direct submission, at least it’s not the usual methods by the usual suspects. Someone on the project understands faithfulness and markedness, since they are explicitly mentioned by name and defined correctly.

A particularly interesting result was the confirmation of what they call the functional-load hypothesis, which says that phones with low or zero functional load (meaning that there are few or no minimal pairs distinguished by them) can easily merge. An earlier test with only four languages could not reject the null hypothesis, but their 736-language set was able to do so easily. Nonetheless, I hardly expect English /h/ and /ŋ/ to merge any time soon.
David Marjanović says

November 24, 2015 at 9:01 pm

The paper is three years old, and the news article two. The method has since then been used by linguists to discover (to their own surprise!) the hypothesis that the Chitimacha language of Louisiana is related to the Toto-Zoque family of central Mexico; I think you blogged about that. I’ve recently seen criticism of that hypothesis, but the author of the criticism doesn’t claim the similarities are artefactual or coincidental – instead he postulates contact.

Nonetheless, I hardly expect English /h/ and /ŋ/ to merge any time soon.

They’re nearly in complementary distribution, so at some point it becomes a matter of definition. The claim that [h] and [ŋ] are positional allophones of a single phoneme in German has indeed been made.

(In German, though, you can get away with claiming that [ŋ] between vowels/”sonorants” is still simply /ng/ – that doesn’t work in most Englishes.)
David Marjanović says

November 24, 2015 at 9:03 pm

Anyway, from reading the paper I got the impression that the method is designed to do the easier parts of what historical linguists do, and that it does them pretty well.
languagehat says

November 24, 2015 at 9:22 pm

The paper is three years old, and the news article two.

Sigh. Ah well, maybe some haven’t seen it/them…
John Cowan says

November 24, 2015 at 10:07 pm

Anyway, what’s three years to a historical linguist, who routinely cites work from the 19C? Even synchronic linguists routinely cite (unpublished!) works older than they are.
marie-lucie says

November 24, 2015 at 10:51 pm

Probabilistic models of sound change

“Probability” in this case must refer to “likelihood of occurrence based on changes known to have occurred in the history of actual languages”. For language families with a known history such as the descendants of Latin, some changes are firmly established and a few have been shown to have occurred in a vast number of languages (for instance intervocalic lenition or weakening), but with families only recently recorded, and for which proto-language reconstruction is still in its infancy, “probable changes” are not necessarily obvious, and changes that seemed likely to have occurred can later be shown to be the results of errors. One thing that does happen is that given a correspondence between “x” in language A and “y” in language B, it is not obvious whether “x” has changed to “y”, or “y” to “x”, or whether “x” and “y” both result from changes from an earlier “*z*. And if “*z” has changed to “w” in language C but disppeared altogether in language D (after one or more weakenings), while other sounds have undergone unusual changes in D as well, it may be very difficult to consider D as related to A and B. These are only some of the difficulties. It seems therefore that an automated approach to reconstruction must be possible only after most of the rough work has been done by competent historical linguists, yielding data that can provide a model or template for an automatic process to be set in motion, yielding another set of data that can help linguistic work along but not replace the training and instincts of historical linguists.
David Marjanović says

November 25, 2015 at 7:51 am

“Probability” in this case must refer to “likelihood of occurrence based on changes known to have occurred in the history of actual languages”.

Of course.
marie-lucie says

November 25, 2015 at 8:26 am

David: My point is that what is known may not be all there is to know, or all that is correctly identified.
Peter Erwin says

November 25, 2015 at 12:28 pm

“Probability” in this case must refer to “likelihood of occurrence based on changes known to have occurred in the history of actual languages”…

I don’t think so; from what I can gather from a quick skim of the paper, the system generates sets of sound changes as part of the whole process. That is, it tries to “learn” the (probabilistic) rules for sound changes that best reproduce the input cognates, given the known phylogeny of the languages. (Using predetermined sound-change rules might be what one of the previous “deterministic” systems that they contrast their system with would do, I’m guessing.) Their system also allows for different sets of dominant sound changes for different branches of a language tree.

So they’re not starting from any set of firmly (or less-firmly) established, known historical changes. I think the idea is that their program starts off with all possible sound transformations (within the basic set of deletion, substitution, and insertion of single phonemes) being equally likely, and then adjusts the probabilities for each possible change based on how well or poorly this reproduces the final sets of cognates. (There’s an interesting variation where they drop the known cognate sets and just start off with sets of words having similar meanings, adding the possibility of brand new words replacing old words with some frequency — a crude way to account for things like borrowing, I supposed — and see how well this reproduces the cognate sets.)

They do discuss some of the (highest-likelihood) resulting automatically-derived sound changes and argue that they agree with commonly accepted historical sound changes. (E.g., “sonorizations /p/ to /b/ and /t/ to /d/, voicing changes, debuccalizations /f/ to /h/ and /s/ to /h/, spirantizations /b/ to /v/ and /p/ to /f/“, and so forth.)
TR says

November 25, 2015 at 1:19 pm

One of the co-authors, David Hall, also worked on this IE phylogenetics study by Chang et al., which I think was discussed here at some point, though I’m not finding the relevant thread.
marie-lucie says

November 25, 2015 at 1:46 pm

How do you access the full article?
TR says

November 25, 2015 at 1:58 pm

http://www.pnas.org/content/110/11/4224.full.pdf
Peter Erwin says

November 25, 2015 at 2:00 pm

marie-lucie —

Click on the link to the article (not the BBC story); then, click on the “Full Text” tab, next to the “Abstract” tab.

Or, if you want all the gory details (in PDF form), click on the “PDF + SI” tab (same row as the other tabs).
Peter Erwin says

November 25, 2015 at 2:03 pm

TR’s link will take you directly to the PDF version of the main article; the equivalent HTML version is:

http://www.pnas.org/content/110/11/4224.full

and the PDF version with supplementary info (“all the gory details”) is:

http://www.pnas.org/content/110/11/4224.full.pdf?with-ds=yes
marie-lucie says

November 25, 2015 at 2:14 pm

Thank you both. I had clicked on the link but could not get past the site.

I do want all the gory details!
Alan says

November 25, 2015 at 3:51 pm

Following article ends with “As is so often the case, specialists need to talk to each other across the boundaries of their specialties.”
languagehat says

November 25, 2015 at 4:57 pm

Right, as opposed to “talk to each other about somebody else’s specialty.”
David Marjanović says

November 25, 2015 at 6:33 pm

Oh, so the program starts by deriving a model of evolution from the data, just like how it’s done in molecular phylogenetics! That’s good. 🙂
David Marjanović says

November 27, 2015 at 6:19 pm

this IE phylogenetics study by Chang et al.

I just finished reading it. It’s awesome, including the references section. Highly recommended!

(Yeah, OK, I skipped over the math. 🙂 )
Trond Engen says

November 29, 2015 at 7:53 pm

I’ve finished reading it too. Like you I like it, and like you I didn’t crush the numbers of the BEAST.

Unlike last year’s attempt by Bouckaert et al., this paper does not try to pin the inferred history onto a map. That’s all well since it would be folly without enough external evidence to derive the path from geographical data — but I’d still like to see another try. External evidence might be archaeologically documented genetic or cultural changes, but genes aren’t everything, and cultural movements are probably too impressionistic to be useful alone and would have to be calibrated against linguistic data again.

I could believe that with a similar tree for Uralic, and the intervowen IE and Uralic trees recalibrated to eachother, some geographical inferences from linguistics alone might be lifted from impressionistic to quantifiable. If those could be tied to genetics or archaeology, then something could be learned about what sort of changes that carry language. I don’t think the recalibration would mean much for early IE, though, since there’s so little Uralic vocabulary in IE that all early contact could have happened in dead ends,.
David Marjanović says

November 30, 2015 at 7:39 am

That’s an interesting idea. I agree someone should try it.

BEAST, I should mention, is very widely used in molecular phylogenetics and dating today. And I don’t like the interpretation of the PIE plain velars as uvular, but that can’t have had any influence on the results.

Trackbacks

[BLOG] Some Wednesday links | A Bit More Detail says:

November 25, 2015 at 3:46 pm

[…] Hat is skeptical about the idea that computer programs could automatically reconstruct ancient […]

Speak Your Mind

Commented-On Language Hat Posts (courtesy of J.C.; contains useful Random Link feature)

My Languages

My Hats

E-mail:
languagehat AT gmail DOT com

My name is Steve Dodson; I’m a retired copyeditor currently living in western Massachusetts after many years in New York City.

If your preferred feed is Twitter, you can follow @languagehat to get
links to new posts here as they appear. (I don’t otherwise participate
in Twitter.)

If you’re feeling generous:
my Amazon wish list

And you can support my book habit without even spending money on me by following my Amazon links to do your shopping (if, of course, you like shopping on Amazon); As an Amazon Associate I earn from qualifying purchases (I get a small percentage of every dollar spent while someone is following my referral links), and every month I get a gift certificate that allows me to buy a few books (or, if someone has bought a big-ticket item, even more). You will not only get your purchases, you will get my blessings and a karmic boost!

If your comment goes into moderation (which can happen if it has too many links or if the software just takes it into its head to be suspicious), I will usually set it free reasonably quickly… unless it happens during the night, say between 10 PM and 8 AM Eastern Time (US), in which case you’ll have to wait. And occasionally the software will decide a comment is spam and it won’t even go into moderation; if a comment disappears on you, send me an e-mail and I’ll try to rescue it. You have my apologies in advance. Also, my posts should be taken as conversation-starters; there is no expectation of “staying on topic,”and some of the best threads have gone in entirely unexpected directions. I have strong opinions and sometimes express myself more sharply than an ideal interlocutor might, but I try to avoid personal attacks, and I hope you will do the same.

All comments are copyright their original posters. Only messages signed “languagehat” are property of and attributable to languagehat.com. All other messages and opinions expressed herein are those of the author and do not necessarily state or reflect those of languagehat.com. Languagehat.com does not endorse any potential defamatory opinions of readers, and readers should post opinions regarding third parties at their own risk. Languagehat.com reserves the right to alter or delete any questionable material posted on this site.

Automated Reconstruction of Ancient Languages.

Comments

Trackbacks

Speak Your Mind

Archives

Search

Recent Posts

Recent Comments