Søren Wichmann, a Danish linguist, discusses a perennial problem and presents a promising solution; after describing the many ways in which it’s hard to pinpoint the difference between a language and a dialect, he continues:
Recently, two major obstacles in distinguishing language from dialect have been overcome. The first is how to measure differences between speech varieties – finding a value for D. In 2008, a number of linguists came together to form the Automated Similarity Judgment Program (ASJP), of which I am the daily curator and a founder. The ASJP painstakingly assembled a systematic, comparative dataset of languages that now contains 7,655 wordlists from what would be two-thirds of the world’s languages, if we assume for our purposes that languages are defined as in the ISO 639-3 code standard. Since each wordlist contains a fixed set of 40 concepts and are transcribed in a uniform manner, they can easily be compared, and a measure of difference can be obtained. The measure of difference between two words that has become most used is a version of the Levenshtein distance, named after Vladimir Levenshtein, a Soviet computer scientist who in 1965 devised an algorithm to compare two strings of symbols. He defined ‘distance’ as the number of substitutions, insertions and deletions needed to turn one string into the other. The Levenshtein distance can usefully be divided by the length of the longest of the two strings, because this puts all the distances on a scale from 0 to 1. This has become known as the normalised Levenshtein distance, or LDN.
The second obstacle is that perhaps ‘language’ and ‘dialect’ are concepts can that be defined only arbitrarily. Here, there is some more promising news. If we look at all the language families in the ASJP database for which database contributors have included a healthy portion of close varieties, we can begin to look for different behaviours of languages and dialects. An intriguing picture emerges: the distances tend to hover around either a relatively small value or a relatively large one, with a valley in between. As it turns out, the valley tends to lie in a narrow range around a mean of 0.48 LDN. Without losing significant precision, we can say that speech varieties tend to not be halfway similar in their basic vocabulary. Either they will tend to be more similar, in which case they can be defined as different dialects, or less similar, in which case they can be defined as different languages. Herein lies the distinction between language and dialect.
The phenomenon is probably a result of social circumstance. Dialects will drift apart as people settle in new places and shape new identities but, if there is still some contact, convergence can also be present so that speech varieties remain less than halfway similar (and therefore the same language). A small push in the direction of divergence, however, might cause the varieties to drift apart relatively rapidly, raising their Levenshtein distance, thereby qualifying them as distinct languages. Possibly there is a connection between the cut-off for distances between words on the standard list used by ASJP and corresponding distances in other parts of language structure that make for a point of serious loss of mutual intelligibility. In other words, the threshold for mutual intelligibility might correlate with the threshold between languages and dialects. We don’t know that yet, but it’s something to look into. […]
Finally, a technique derived from the datasets, called ASJP chronology, can be applied to establish the amount of time it takes for dialects to drift far enough apart to qualify as separate languages. The answer we have found, ignoring some margin of error, is 1,059 years. These findings can be corroborated by looking at how long it typically takes for an ancestral language of a language family to break up into daughters that subsequently become ancestors of subfamilies. This requires other techniques, but the results are similar: it takes about a millennium for dialects to become languages. We know this because we can now distinguish the two.
Thanks, jack!
One would have thought that “a Danish linguist affiliated with Leiden University in the Netherlands, Kazan Federal University in Russia, and Beijing Language University in China” would have more to say about Chinese.
Or one might surmise that he doesn’t want to jeopardize his affiliation with Beijing Language University in China.
Well, by the criterion that “if it’s more than 1059 years old, it’s more than one language”, the Sinitic language family is definitely not a single language – it’s about twice that old.
Edit: the article, which I should have read first, actually says Chinese and Arabic are several languages each – and links to this manuscript which I’m reading now.
Check out Table 2 in the manuscript (p. 7)!
I’m a little over my skill set here but the Kullback-Leibler divergence, a measure of how different one probability distribution is from another, seems like it should enter the discussion. Or is the Levenshtein distance a more practical approximation of the KL divergence?
Something is very strong with this table.
How on Earth breakup of East Slavic could be earlier than breakup of Slavic itself!
Levenshtein distance is a weird way of measuring divergence between actual language words because it seems to depend heavily on orthography.
And even if it’s based on phonemes (which this version seems to be), a single wide-ranging vowel shift or consonant shift would probably affect it disproportionately.
(Someone should probably try to enter Late Middle English into it and check if it comes out as the same language as Modern English…)
To comment on articles such as this, one should be in a generous mood and I am currently not. But I will comment anyway.
1) This seems to be an idiotic way to present results. Suppose you’ve discovered a way to measure distances, like with a meter stick. And went ahead and measured a bunch of distances. And then proudly announced that now you can exactly tell what it means to be far apart or near, because before people could only tell it vaguely. Stupid, right?
2) 1059 years? What about months and days?
3) Any word about transitivity? If their measure between varieties A and B is small and between B and C is small, but between A and C is large what do they conclude? It’s not like clustering problem is anything new, but “there are things that are close and there are things that are far” is not exactly solving it.
1059 years? What about months and days?
I was also struck by this. 1059 strikes me as absurdly precise, the sort of number people with no understanding of statistical ideas give, saying “that’s what the computer gives”. Why not just “about a thousand years”?
Levenshtein distance is a weird way of measuring divergence between actual language words because it seems to depend heavily on orthography.
All the words are transcribed into a common phonemic orthography, so I don’t think that’s an issue. (One could still quibble about the validity of their particular phonemic scheme.)
One would have thought that “a Danish linguist affiliated with Leiden University in the Netherlands, Kazan Federal University in Russia, and Beijing Language University in China” would have more to say about Chinese.
From the linked article: “Some pairs of speech varieties that are considered national languages, such as Bosnian and Croatian, fall way below the cut-off of LDN = 0.48 (the same language, regardless of Yugoslavia’s existence). Some fall not far below it, such as Hindi and Urdu (different languages, barely). And varieties of Arabic and Chinese, both of which are often thought of as single languages, soar above LDN = 0.48 (the varieties are themselves different languages).” (And, as David Marjanović points out, Table 2 in the manuscript version of the paper lists a pair of Chinese dialects/languages, which have an average Levenstein distance of 0.60.)
I’m a little over my skill set here but the Kullback-Leibler divergence, a measure of how different one probability distribution is from another, seems like it should enter the discussion. Or is the Levenshtein distance a more practical approximation of the KL divergence?
I’m not sure how it would. What is the “probability distribution” for a list of phonemic spellings of words from a single dialect/language?
You could, I suppose, treat the set computed Levenshtein distances for a pair of languages/dialects as a probability distribution, but then you’d need another pair to form the second probability distribution in order to compute the KL divergence. (So, no, I don’t think the Levenshtein distance is an approximation of the KL divergence.)
A relevant question would also be, 1059 years starting from what exactly? The first phonological isogloss, the first lexical isogloss, initial areal separation, initial administrative separation…? Mountains have very different heights depending on if we measure their height from the base terrain, from sea level, or from the center of the Earth, and at least a convention will be required before we can just state that “Kilimanjaro is 5895 meters tall”.
Of course, this suggests also a logical next step: the best way to demonstrate that some metric is arbitrary is to pick another metric from the same space and show that it gives different results (A–B >₁ B–C, but A–B <₂ B–C, etc.)
Something is very strong with this table.
How on Earth breakup of East Slavic could be earlier than breakup of Slavic itself!
Are you referring to the dates in the “Calibration Points” section of the dating paper? The confusing point is that their dates are “years BP [before present]”, so the date of “1450” for the breakup of Slavic means ~ 550 AD, while “760” for East Slavic means ~ 1250 AD.
Are you referring to the dates in the “Calibration Points” section of the dating paper? The confusing point is that their dates are “years BP [before present]”, so the date of “1450” for the breakup of Slavic means ~ 550 AD, while “760” for East Slavic means ~ 1250 AD.
I thought so too, and almost mentioned it, but I think they meant the actual table (either Table 1 or, more likely, Table 4), which gives the computed estimates of the breakup, and does give a larger number (= an earlier date) for the breakup of East Slavic than the breakup of Common Slavic.
1059 years starting from what exactly?
Considering that the calibration dates appear to be mixes of all of those four options (plus some that may be neither), there’s probably no single answer.
(“Epigraphic” is a mix of the two isogloss versions, “historical” is a mix of the two separation versions, “archeological” is mostly a best guess at areal separation.)
Something is very strong with this table.
Of course, I meant to write “wrong”, not “strong”!
Though some strong substances surely must have been consumed by authors if they managed to make East Slavic progenitor of Common Slavic….
@D.O.
Wikipedia confirms that the Levenshtine function is indeed a metric:
https://en.wikipedia.org/wiki/Metric_(mathematics)
so, it’s going to have all the properties one expects.
@ Peter Erwin Thank you, that helps. I think the KL divegence idea would apply to the frequency of use of the phonemes in the two topolects. I agree this is not very close to the Levenshtein distance.
@D.O.
Wikipedia confirms that the Levenshtine function is a metric:
https://en.wikipedia.org/wiki/Metric_(mathematics)
so it’s going to have all the nice distance-like properties one hopes for
I don’t dispute that Levenshtein distance is a metric (I don’t know about normalized distance, but that’s a small point), I am saying that if you want to go from a distance measure to a classifier you have to solve the clustering problem, which goes beyond finding a threshold of similarity.
SFReader, I understood your “strong” remark as a witticism, and a good one.
I had assumed that he meant to write “stronk”
The wordlist on which the dataset is based has only 40 words, or “concepts” according to the article.
For what it’s worth, I didn’t even realize that anything was off in SFReader’s remark until it was pointed out directly. I just read it as “wrong”.
The wordlist on which the dataset is based has only 40 words, or “concepts” according to the article.
I noticed that too. I’m guessing they couldn’t find data in so many languages for more words than that?
Wonder if the conclusions are still going to hold with a 100-word or 200-word dataset…
How about 2000-word dataset?
That’s the number of reconstructed Proto-Indo-European roots, I believe.
Entire language in a condensed form, so to speak.
I am pretty sure enough data exists for almost all languages in that damn table.
And surely computer technology their university can afford will be able to process 2000-column spreadsheets or whatever they use nowadays
I am pretty sure enough data exists for almost all languages in that damn table.
Extant ones, maybe. And even then it’s far from a given for anything that isn’t in Europe.
Also, any significant loanword layers would seriously mess the data up (because they’d show up as closely related), and we’re going to get a lot of those if the list includes any significant amount of cultural-specific vocabulary.
I think the basic problem is not that Levenshtein distance is an obviously stupid metric, but that there is no a priori reason for adopting it; it’s essentially an arbitrary choice (i.e. I agree with DO.) The only reason for using it as a criterion for dialect-versus-language-hood is its correlation (or not) with distinctions already made by other criteria.
Apart from the smallness of the dataset, I would also object to the notion that lexicon alone is actually a good enough criterion for separating language from dialect at all. Languages are not bags of words, and languages with very much shared vocabulary may differ greatly in other respects. Most individual Nigerian Pidgin lexemes are pretty easily recognisable to an L1 English speaker, for example, but anything apart from unusually acrolectal Nigerian Pidgin is pretty much incomprehensible to an untaught native English speaker. (Believe me. Mi fa, a no nak am at ɔl.) There are also a good many “false friends” in the syntax. A go fɔ haws for example, does not mean “I go home”, but “I went home.”
Apart from the smallness of the dataset, I would also object to the notion that lexicon alone is actually a good enough criterion for separating language from dialect at all. Languages are not bags of words, and languages with very much shared vocabulary may differ greatly in other respects.
I’m pretty sure Wichmann is aware of that; do you have any quarrel with the results provided by this insufficient criterion? If not, it would appear to be reasonably sufficient after all.
Admittedly W’s metric produces figures that are not counterintuitive; but how does that get us any further? He’s selected cases which are fairly well-behaved cross-linguistically in the sense that the only real question is the degree to which two languages, uncontroversially closely related and grammatically very similar, are “the same”, not cases which vary along unusual grammatical axes. I suppose the proper response is to take his results as (as it were) proof of concept, and then go on to look at cases where his methodology doesn’t agree with accepted wisdom, and see what we can learn from the mismatch. It seems unlikely that there would never be a significant mismatch.
Creoles in general would be a interesting place to look. There’s also the kinda-opposite situation of things like Anglo-Romani, where a group uses a distinctive vocabulary embedded in pretty much the same grammar as everyone else.
Further to the theme that closely related languages vary significantly in ways that have nothing to do with lexicon, there is the whole empire of phonology. Standard Scots (as opposed to Lallans), for example, is surely a dialect of English (or vice versa, as I would say) but the vowel systems are quite radically different.
I must admit that I have an ideological problem with the idea that language similarity could even in principle be reduced to a single number. It’s like the untenable notion that intelligence can be measured by a single number, outside some very specific domain where such a number might possibly be of immediate use (like army recruitment.) There are too many dimensions in reality for a flattening of them all into just one to be possible without severe loss of information.
The West/East Greenlandic example is interesting in this regard. I am far indeed from an expert in these matters, but I have read that there is great lexical turnover in Eskimo languages driven by the fact that personal names are taken from common nouns, and are tabooed when the bearer dies until a child is born to take over the name*, by which time everybody has become used to a neologism (so that – to take real examples – “dog” gets replaced by “puller” and “elbow” by “pusher.”)
*So that a man may be called “Woman”, without this implying anything about him personally at all.
@MattF: The mathematical definition of a metric only requires one nontrivial property, the Triangle Inequality: the distance d(A,B) is less than or equal to d(A,C) + d(C,B). The existence of a metric is a very strong condition if you are interested in topology; however, if you want to do analysis (making use of the actual numerical values of the distance function), it is extremely weak. There is lots of room for mathematically well-defined metrics to have pathological properties if we try to interpret them as real distances.
People have lovely examples to tease apart lexicon difference from other linguistic differences! Those go in a relatively dramatic fashion. It’s also important in valuing this metric to know, does it artificially inflate distances 5% in this family of languages, and contract 5% in that other. Or +10% and -10% of who knows what. These are not values where you can point to a pair example and so “so it’s wrong”, but at a certain level of fuzz in the metric, what’s the point? Is it better than polling linguists on what their gut says?
If it’s Linguistics Math Fun Time, how about using conditional Kolmogorov complexity on the lexicons? This can subsume the Levenshtein distance metric, but it also has the ability to see things like “this is a systemic vowel shift” so it doesn’t overrate the distance of pure accents.
Kolmogorov complexity asks how long a custom program needs to be, to take the one lexicon as input and generate the other. So if there’s a pure phone substitution rule to be had, it’ll use that. Or if a brute-force list of unrelated edits is needed, then it becomes Levenshtein. In a rough sense it’s asking “how hard is it to learn this language’s lexicon if you know that one’s?”
(Yes, there are certain practical issues with computing concrete values for Kolmogorov complexity.)
Surely it would be simpler to just use military spending as a metric (breaking out land and naval forces).
Heh.
I’m surprised that Catalan/Castilian were rated 0.66 in the , well above the proposed cutoff line.
ASJP actually doesn’t use the raw Levenshtein distance. First it’s normalized by dividing it by the word length, which is a fairly standard thing to do. Then this value is taken for all pairs of words between two languages, and the final distance between the language is the mean distance between semantically concordant pairs divided by the mean distance between discordant pairs. That is supposed to eliminate the effects of chance resemblances between words with different meanings.
The ASJP alphabet is superphonemic, blurring many distinctions. I find the WP table hard to grasp, so I’ve made my own:
Front vowels: high, mid, low
Central vowels: low unrounded (rounded counts as front), all others
Back vowels: high, mid/low
Bilabials: voiced stop/fricative, voiceless stop/fricative, nasal.
Labiodentals: voiced fricative, unvoiced fricative
(Inter)dentals: fricative, nasal
Alveolars: voiced stop, voiceless stop, voiced fricative, voiceless fricative, affricate, nasal
Postalveolars: voiceless fricative, voiced fricative
Palato-alveolars: voiced affricate, voiceless affricate
Palatals: stop, nasal, approximant
Velars: voiced stop, voiceless stop, fricative, nasal, approximant
Uvulars: voiced stop, voiceless stop, fricative (including pharyngeals)
Glottals: stop, fricative
Laterals: voiced alveolar, all others
Rhotic sounds: all
Clicks: all
All suprasegmental information is lost. All other phonemes are assimilated to one of the above.
Finally, here are the meanings of the 40 words:
Body parts: eye, ear, nose, tongue, tooth, hand, knee, blood, bone, breast (woman’s), liver, skin
Animals and plants: louse, dog, fish (noun), horn (animal part), tree, leaf
People: person, name (noun)
Nature: sun, star, water, fire, stone, path, mountain, night (dark time)
Verbs and adjectives: drink (verb), die, see, hear, come, new, full
Numerals and pronouns: one, two, I, you, we
And the amazing thing is that with all this approximation and only a few words to play with, it still does as well as it does (see: dog walking on its hind legs).
I wanted to post it a year ago in another thread, I will write it here. Two basic notions in set theory (basic as in “simpler than elementary school arithmetic”, but I think they do not teach it) are “equivalence classes” and “equivalence relation”. My first idea is that they can be useful here.
Equivalence classes are just subclasses that do not overlap – like dialects in some people’s minds.
An equivalence relation is a binary relation that is reflexive, symmetric and transitive.
A relation is “A loves B” or “1 < 2" (binary means exactly two arguments are involved)
Consider a relation “a and b are pages of the same book”. Let’s designate it as “a ~ b”
reflexive: a~a
In English: every page is a page of the same book as itself (true).
symmetric: a~b=>b~a
In English: for every pair of pages a and b, if a and b are pages of the same book, then b and a are also pages of the same book (true).
transitive: a~b, b~c => a~c
In English: for every three pages a, b and c, if a and b are pages of the same book and b and c are pages of the same book, then a and c are pages of the same book as well.
When you have such a relation, you can partition the class of all pages of the world in equivalence classes. When you have partitioned something in equivalence classes, you can introduce a relation “belong to the same class”.
Consider now a relation “a and b are freinds”.
Reflexive? Is everyone a freind to herself? Well, we can define a “freind” so at least.
Symmetric? Yes, normally friendship is mutual.
Transitive? No. A freind of my friend is not always my friend – else the whole world except Sentinelese people would be our freinds.
What follows is that if we want to describe social structures based on freindships (and such structures play an important role!), notions simialr to “tribes”, “nations”, “parties” are useless for us.