Daniel Ford and Josh Batson have made a fascinating post on the Google Research Blog describing the connections between languages on the web:
Looking at the language web in 2008, we see a surprisingly clear map of Europe and Asia. The language linkages invite explanations around geopolitics, linguistics, and historical associations.
The outlines of the Iberian and Scandinavian Peninsulas are clearly visible, which suggest geographic rather than purely linguistic associations.
Examining links between other languages, it seems that many are explained by people and communities which speak both languages.
The language webs of many former Soviet republics link back to the Russian web, with the strongest link from Ukrainian. While Russia is the major importer of Ukrainian products, the bilingual nature of Ukraine is a more plausible explanation. Most Ukrainians speak both languages, and Russian is even the dominant language in large parts of the country.
The link from Arabic to French speaks to the long connection between France and its former colonies. In many of these countries Arabic and French are now commonly spoken together, and there has been significant emigration from these countries to France. …
What’s happened since 2008? The languages of the web have become more densely connected. There is now significant content in even more languages, and these languages are more closely linked.
By all means click on the maps, and go to Stæfcræft & Vyākaraṇa for “some ponderings”; I was particularly interested in his caveat at the end:
…But a Nepali-Marathi link doesn’t make sense, at least in absence of other intra-Indo-Aryan linkages.
There is one property which I can think of which does link Nepali and Marathi, namely the fact that they both are written in Devanagari script (also used for Hindi). Gujarati, Punjabi, and Bengali, on the other hand, are each written in their own scripts (distinct from Devanagari). So I wonder if there is any possibility that the script is creating “false hits” when the off-site link connections for Nepali and Marathi are being computed.
That also makes me worry about the other surprising inter-language linkages, such as Bengali-Swahili, Swahili-Tagalog. Not, obviously, that these languages share a common script, but whether some of the apparent connections are artefacts of the algorithm, whether due to use of a common script or some other factor. If they’re not simply artefacts, then it certainly would be interesting to find out why, for instance, Bengali-language and Swahili-language webpages are linking to each other.