LANGUAGE CONNECTIONS ON THE WEB.

Daniel Ford and Josh Batson have made a fascinating post on the Google Research Blog describing the connections between languages on the web:

Looking at the language web in 2008, we see a surprisingly clear map of Europe and Asia. The language linkages invite explanations around geopolitics, linguistics, and historical associations.
The outlines of the Iberian and Scandinavian Peninsulas are clearly visible, which suggest geographic rather than purely linguistic associations.
Examining links between other languages, it seems that many are explained by people and communities which speak both languages.
The language webs of many former Soviet republics link back to the Russian web, with the strongest link from Ukrainian. While Russia is the major importer of Ukrainian products, the bilingual nature of Ukraine is a more plausible explanation. Most Ukrainians speak both languages, and Russian is even the dominant language in large parts of the country.
The link from Arabic to French speaks to the long connection between France and its former colonies. In many of these countries Arabic and French are now commonly spoken together, and there has been significant emigration from these countries to France. …
What’s happened since 2008? The languages of the web have become more densely connected. There is now significant content in even more languages, and these languages are more closely linked.

By all means click on the maps, and go to Stæfcræft & Vyākaraṇa for “some ponderings”; I was particularly interested in his caveat at the end:

…But a Nepali-Marathi link doesn’t make sense, at least in absence of other intra-Indo-Aryan linkages.
There is one property which I can think of which does link Nepali and Marathi, namely the fact that they both are written in Devanagari script (also used for Hindi). Gujarati, Punjabi, and Bengali, on the other hand, are each written in their own scripts (distinct from Devanagari). So I wonder if there is any possibility that the script is creating “false hits” when the off-site link connections for Nepali and Marathi are being computed.
That also makes me worry about the other surprising inter-language linkages, such as Bengali-Swahili, Swahili-Tagalog. Not, obviously, that these languages share a common script, but whether some of the apparent connections are artefacts of the algorithm, whether due to use of a common script or some other factor. If they’re not simply artefacts, then it certainly would be interesting to find out why, for instance, Bengali-language and Swahili-language webpages are linking to each other.

Comments

  1. David L says:

    Fascinating data, but interpretation is difficult. The Arabic-French connection is there, to be sure, but it’s barely above their threshold, and the connection between Arabic and German is almost as strong. Is that a lingering remnant of German colonization in N. Africa, or something else?

  2. Nathanial says:

    I’d be a bit cautious about taking the data at face value, as Stæfcræft & Vyākaraṇa have wisely done. Below is a comment of mine on the Google Research Blog (pending approval) that I thought worth cross-posting here, as this meme is likely to take off, rocketing around the interwebs…
    ——-
    While I think that this work is quite interesting and surely measuring something related to what you’re shooting for, in the following quote the data doesn’t quite pass the sniff test:
    “However, only 45 percent of off-site links from English pages are to other English pages, making English the most extroverted web language given it’s size”
    What this is saying (and what I’m not quite buying) is that more than half (55%) of ALL external links from all important (page-rank-wise) websites in English go to non-English sites. Here’s where publication of the algorithm you’re using matters deeply. Could you give us, if not the algorithm, a representative sample of English-language websites where the vast of majority of external links are to non-English sites?
    Again, I really like this line of inquiry, and the results probably are measuring something related to inter-language linkage proportions, but 55%, really? Strong claims need strong evidence.

  3. anyone know what the Telugu-German link is about?

  4. I wonder what Armenian to Belorussian and back means?

  5. Just below Figure 1 in Ford/Batson, there is this suspicious statement:

    The outlines of the Iberian and Scandinavian Peninsulas are clearly visible, which suggest geographic rather than purely linguistic associations.

    Now this is either the dummest sentence I’ve read today, or the authors are indulging in a bit of (unconscious ?) humor. –>Here <– is the graph enlarged. The nodes for Spanish and Portuguese are cheek by jowl, and near them languishes the Galician node. Now, either this proximity is due to manual intervention, or to a graph-drawing algorithm, or it is the result of an algorithm followed by manual intervention.
    Let’s consider the possibility that an algorithm was used. Can we identify any algorithmic features ? The links pointing from Galician to Portuguese, and from Portuguese to Spanish, are both labelled with “0,03”. Is the length of a link proportional to the size of the labelled fraction ? It seems not: to take only one example, the link from Arabic to French is much longer than the Galician->Portuguese->Spanish ones, yet is also labelled with “0,03”.
    So visual proximity is not being used as a aid to recognition of strong links (“edge weights” in graph terminology). But such a correlation is not an output feature of all graph-drawing algorithm, nor is it always important – for instance in some “avoid intersecting graph edges” algorithms. The “strength” of the links don’t accord with node placement in any obviously systematic way. It may be that no algorithm is involved.
    So let’s consider the other possibility: that the authors are kidding themselves, or us. They say about the visual proximity of the Galician->Portuguese->Spanish nodes that they take the “clearly visible” form of the Iberian peninsula, “suggest[ing] geographic than purely linguistic associations”. Is this a coincidence ? Perhaps the authors draw the nodes in proximity because of the geographical proximity of the respective linguistic populations ! So it would be no surprise that the nodes suggest geographical proximity if they were drawn near each other in order to mimic their geographical proximity.
    In fact, when one looks at the graph as a whole, one is struck by the circumstance that the “big-language” nodes are located suspiciously near where they would appear in a standard Mercator projection of the world minus the Americas – that is, if they were shown collocated over the corresponding linguistic populations.
    I suspect that the authors deliberately laid the nodes out as if over a world map, to help readers orient themselves. But they don’t say straight-out that they did this, and seem to have forgotten it themselves when the opportunity arose for displaying analytical penetration by identifying “the Iberian peninsula” in what purports to be purely statistical data.
    anyone know what the Telugu-German link is about?
    For what it’s worth, I note that there are many Southern Indian IT workers in Germany, and the fluctuation is high. Large German corporations such as the Deutsche Bahn have been trying for years to outsource programming work to India, with little success – but they persevere.

  6. komfo,amonan says:

    I will be curious to hear if anything like an explanation (akin to what GS has provided) surfaces for some of these unexpected connections. Maybe I’m overreacting, but Armenian-Belarussian e.g. seems so unlikely that I would be inclined to look at the data more closely & attempt to come up with an explanation before releasing the data.

  7. Athel Cornish-Bowden says:

    It’s certainly interesting, but when one reads things like Both the Philippines and Pakistan are former British colonies one wonders how much confidence to put in anything else that is said.

  8. I had a similar reaction to the claim that “the Iberian peninsula is clearly visible”, although I didn’t say so, since I am trying out a no-exaggeration diet. Slight mistakes are no justification for huffing and puffing, but the “you can see the peninsula” claim seemed to have something seriously wrong with it. I figured out what the problem was much faster than my analysis above might suggest.
    I had to force myself to read Ford/Batson through to the end. My usual response to encountering blatant nonsense in an article, particularly of the statistical kind, is to immediately cast the article aside in disgust. I think of it in terms of learning efficiency. There are relatively few writers who deserve a close reading, so I avoid spending time on those who are no better than they should be.

  9. On the subject of my last-duchess attitudes towards other writers … I wonder to what extent in each case my willingness to read a book, and my possible appreciation of it, are promoted or demoted by the opinions of others as to its merits. Notice I say “to what extent in each case”, because obviously my opinions are never “just my own”. I wonder this every time I pass the groaning shelves of novels in a bookstore.
    At present I am fretting over this question while rereading Lolita, which 40 years on I now find rather irritating (even leaving aside the judicious, on-the-one-and-the-other-hand foreword by somebody or other). I’m not sure how I could set up an experiment for myself to throw some light on the matter.

  10. Philippines as a former British colony: here. (Well, Manila briefly occupied.)

  11. chris y says:

    Is that a lingering remnant of German colonization in N. Africa, or something else?
    Eh? The last significant German colonisation in North Africa I can think of involved the Vandals (and a few Suevi). I think it’s more likely that this reflects the point that N.Africa is a major trading partner of the EU, and that German is a major commercial language in the EU.

  12. I’m not sure how I could set up an experiment for myself to throw some light on the matter.
    Why not make a list of books you’ve read recently, with the names of any reviewers you can remember & the gist of their judgement? Then let us take a look and we’ll tell you what to think.

  13. Then let us take a look and we’ll tell you what to think.
    Ha, ha. I was thinking more along the lines of asking someone to prepare me a mix of “good” and “bad” novels bound in plain brown paper, with all the blurbs, forewords, afterwords and the author’s name ripped out or effaced.

  14. If I weren’t so prejudiced and supercilious, who knows whether I just might become a fan of Barbara Cartland.

  15. David Marjanović says:

    Is that a lingering remnant of German colonization in N. Africa

    There wasn’t any north of Togo and Cameroon.

  16. dearieme says:

    “Is that a lingering remnant of German colonization in N. Africa
    There wasn’t any north of Togo and Cameroon.”
    You overlook the towels on the sunloungers.

  17. Grumbly, are you having a mid-life crisis? Life’s too short to start reading anonymous books with brown paper covers. Around Christmas, the papers have quizzes where you have to guess who wrote a paragraph of text; why not wait for that? I bet you’d do very well.
    Do you even know any fans of Barbara Cartland? I don’t think I do. I bet no one even reads her books any more, she’s probably out of print.
    You could read every book you come across just to avoid being influenced by others. That’s one possibility. Another would be to read books recommended by people whose judgement you respect, and then see how often you agree with their opinion. Write it down, keep track. Then you can confront them: “I find I only concur with 29% of your opinions so, sadly, we can no longer be friends”, that kind of thing.

  18. I guess you’re right, Crown. It’s hard to change your tune when you can only play a kazoo. If I knew more about my reading prejudices, I might not look down on certain writers any more – but then I would look down on those who are not yet aware of their reading prejudices. Condescension is its own reward.
    I just read that Trollope worried for a while about his own reputation as a writer:

    I had so far progressed that that which I wrote was received with too much favour. … I felt that aspirants coming up below me might do work as good as mine, and probably much better work, and yet fail to have it appreciated. In order to test this, I determined to be such an aspirant myself, and to begin a course of novels anonymously, in order that I might see whether I could succeed in obtaining a second identity, – whether as I had made one mark by such literary ability as I possessed, I might succeed in doing so again.

    Two results of this were the “short tales” Nina Balatka and Linda Tressel, which I had never heard of. I just received copies from my sister. Twenty years ago I introduced her to Trollope, now she has sped far past me.
    The novels didn’t sell that well, even though several reviewers immediately recognized Trollope as the author. My sister describes them as “grim, grim, grim”. I’ve started with Nina Balatka, and found a cute expression that would have fitted in the recent Hat thread about “pupils”. A child says:

    ‘Anton likes fair hair–such as yours–and bright grey eyes such as you have got. I said they were green, and he pulled my ears. But now I look, Nina, I think they are green. And so bright! I can see my own in them, though it is so dark. That is what they call looking babies.’

    A note at the back explains this:

    looking babies staring at the small image of oneself in another person’s eye. In a letter (7 February 1833), Tennyson reminds James Spedding ‘of the many intellectual, spirituous, and spiritual evenings we have spent together … while we sat … looking smoky babies in each other’s eyes (for you know, James, you were ever found of a pipe)’

  19. What is “Chinese_t”? Traditional? If so, we’re separating by writing system? In any case it’s interesting that the link from Chinese to Japanese goes through it.

  20. I assumed Taiwan.

  21. “I assumed Taiwan.”
    That seems to be correct:
    // Given a Language, return its standard code. There are Google-specific codes:
    // For CHINESE_T, return “zh-TW”.
    // For TG_UNKNOWN_LANGUAGE, return “ut”.
    // For UNKNOWN_LANGUAGE, return “un”.
    // For PORTUGUESE_P, return “pt-PT”.
    // For PORTUGUESE_B, return “pt-BR”.
    // For LIMBU, return “sit-NP”.
    // For CHEROKEE, return “chr”.
    // For SYRIAC, return “syr”.
    // Otherwise return the ISO 639-1 two-letter language code for lang.
    // If lang is invalid, return invalid_language_code().
    //
    // NOTE: See the note below about the codes for Chinese languages.
    from: http://src.chromium.org/svn/trunk/src/third_party/cld/languages/public/languages.h

  22. Funny how people like Trollope, who publish work under an assumed name to see if it will sell (I think Paul McCartney did it too), must believe there’s a causal connection between quality and popularity. But a quick look round any bookshop or amazon would show them it’s not so. It’s something else, an insecurity: they’re really wondering if they would be able to get their life back if one day they wake up poor and anonymous.
    I like the sound of your sister. Thanks for that “looking babies”. I’d never heard it.

  23. Doris Lessing performed that sort of experiment, too.

  24. anyone know what the Telugu-German link is about?
    When last year I visited Vienna – for the first time in this millennium – I was surprised at hearing Tamil and Telugu in neighborhoods where Serbian (et al.) were spoken in 1999.
    Armenian and Polish/Belarussian is weird. There is a Armenian minority in Belarus, but not large enough. Weird.
    And no connections between Slovak and Hungarian. Figures. Same cities, but might as well be worlds apart. But no connection between Czech and Polish, that’s surprising.

  25. Crown: they’re really wondering if they would be able to get their life back if one day they wake up poor and anonymous.
    Yet more wit worthy of Wilde. I wish I had said that.

  26. You will, Grumbly, you will.

  27. Haha.

  28. The language tags “zh-CN” and “zh-TW” formally mean ‘Chinese as used in China’ and ‘Chinese as used in Taiwan’ respectively, but by a long-standing abuse of the tagging system they are often used for ‘Chinese in simplified Han script’ and ‘Chinese in traditional Han script’ respectively. Newer versions of the system provide “zh-Hant” for traditional script and “zh-Hans” for simplified script: thus Mao Tse-tung’s poetry is published in “zh-Hant-CN”.
    Google can reliably sort traditional from simplified script in a document of reasonable length, but cannot reliably distinguish mainland Mandarin from Taiwan Mandarin, so the confusion subsists.

  29. Funny how people like Trollope,…,must believe there’s a causal connection between quality and popularity.
    Even funnier to hear someone assume to know what one of the 19th centuries greatest novelists believed!
    It’s something else, an insecurity: they’re really wondering if they would be able to get their life back if one day they wake up poor and anonymous.
    Pity old Anthony, he obviously didn’t know any better, what paralyzed by writers block, needing the hoi poloi’s approval to assuage his feelings of insufficience, searching the Yellow Pages for analysts, chatting on Oprah’s couch, publishing a tell-all, then going through rehab before becoming a vegetarian Buddhist. Please, let’s keep those insecurites in the fragile 20th century where they originated!

  30. Even funnier to hear someone assume to know what one of the 19th centuries greatest novelists believed!
    If I’m going to be patronised, it’s not going to be by someone who can’t spell “Century’s”.

  31. Hozo, you seem pretty consistently belligerent here. Could you dial it down a bit?

  32. If I’m going to be patronised, it’s not going to be by someone who can’t spell “Century’s”.
    Just getting in touch with my inner descriptivist, what ho!
    Point well taken Mr. Hat. Appreciate your fine work here.

Speak Your Mind

*