Søren Wichmann, a Danish linguist, discusses a perennial problem and presents a promising solution; after describing the many ways in which it’s hard to pinpoint the difference between a language and a dialect, he continues:
Recently, two major obstacles in distinguishing language from dialect have been overcome. The first is how to measure differences between speech varieties – finding a value for D. In 2008, a number of linguists came together to form the Automated Similarity Judgment Program (ASJP), of which I am the daily curator and a founder. The ASJP painstakingly assembled a systematic, comparative dataset of languages that now contains 7,655 wordlists from what would be two-thirds of the world’s languages, if we assume for our purposes that languages are defined as in the ISO 639-3 code standard. Since each wordlist contains a fixed set of 40 concepts and are transcribed in a uniform manner, they can easily be compared, and a measure of difference can be obtained. The measure of difference between two words that has become most used is a version of the Levenshtein distance, named after Vladimir Levenshtein, a Soviet computer scientist who in 1965 devised an algorithm to compare two strings of symbols. He defined ‘distance’ as the number of substitutions, insertions and deletions needed to turn one string into the other. The Levenshtein distance can usefully be divided by the length of the longest of the two strings, because this puts all the distances on a scale from 0 to 1. This has become known as the normalised Levenshtein distance, or LDN.
The second obstacle is that perhaps ‘language’ and ‘dialect’ are concepts can that be defined only arbitrarily. Here, there is some more promising news. If we look at all the language families in the ASJP database for which database contributors have included a healthy portion of close varieties, we can begin to look for different behaviours of languages and dialects. An intriguing picture emerges: the distances tend to hover around either a relatively small value or a relatively large one, with a valley in between. As it turns out, the valley tends to lie in a narrow range around a mean of 0.48 LDN. Without losing significant precision, we can say that speech varieties tend to not be halfway similar in their basic vocabulary. Either they will tend to be more similar, in which case they can be defined as different dialects, or less similar, in which case they can be defined as different languages. Herein lies the distinction between language and dialect.
The phenomenon is probably a result of social circumstance. Dialects will drift apart as people settle in new places and shape new identities but, if there is still some contact, convergence can also be present so that speech varieties remain less than halfway similar (and therefore the same language). A small push in the direction of divergence, however, might cause the varieties to drift apart relatively rapidly, raising their Levenshtein distance, thereby qualifying them as distinct languages. Possibly there is a connection between the cut-off for distances between words on the standard list used by ASJP and corresponding distances in other parts of language structure that make for a point of serious loss of mutual intelligibility. In other words, the threshold for mutual intelligibility might correlate with the threshold between languages and dialects. We don’t know that yet, but it’s something to look into. […]
Finally, a technique derived from the datasets, called ASJP chronology, can be applied to establish the amount of time it takes for dialects to drift far enough apart to qualify as separate languages. The answer we have found, ignoring some margin of error, is 1,059 years. These findings can be corroborated by looking at how long it typically takes for an ancestral language of a language family to break up into daughters that subsequently become ancestors of subfamilies. This requires other techniques, but the results are similar: it takes about a millennium for dialects to become languages. We know this because we can now distinguish the two.
Thanks, jack!
Recent Comments