LANGUAGE GUESSER.

Maciej Cegłowski of Idle Words has created something called Languid (langu– ID, get it?):

I’ve set up a little web service for identifying language. If you paste in some text (the more the better), it will tell you what language it’s in. Not rocket science, but perhaps useful to somebody.
There’s an API for people who like to do things programatically.
Note that I’m logging all the queries, so you don’t have to email me and say “I pasted BLAH and it gave me the wrong answer”. But any other feedback is welcome.

Me, I pasted Inuit (the text string from my Last Samurai post) and it told me it was Cebuano; this perplexed me less when I saw that on the right of the Languid page is a vertical list of all the languages he’s programmed into it, which includes Cebuano but not Inuit. Anyway, it’s a lot of fun, and I thank Margaret Marks (of Transblawg) for alerting me to it (via this Blethers post).

Comments

  1. If you can point me to some training text to Inuit (and other languages), I’d be happy to add support. The trainer needs about 10K of source text to do a respectable job.
    Right now the guesser gives the nearest possible match, which can be quite far off for more exotic languages (Maori, for example).

  2. It, er, doesn’t seem to be working right now. I fed in one of the Tzotzil dialogues from Sk’op Sotz’leb, and it never gave me an answer.
    The problem (well, one problem) with Inuktitut[1] is that there have been a number of different orthographies, both Roman and syllabic, which will have different statistical characteristics. The text in The Last Samurai is quite possibly in an antiquated form, given its origin in the book. You’d want to be careful to choose a trainer text written in a system in common use. (Similarly for other languages without an extensive written tradition, I suppose). An archived version of a set of Inuktitut texts, in Unicode, might be enough to train it on syllabics. Also the website of the government of Nunavut has pages in Inuktitut (syllabics) and Inuinaqtun (Roman). Scott Martens might be able to provide further information – I know he has some familiarity with the language.
    It’s hardly a representative text, but you might consider the various translations of the
    Universal Declaration of Human Rights
    as a source for a lot of exotic languages. (No good for those in non-Roman scripts, though, as it uses graphical pdfs.)
    [1] I’m being pedantic, perhaps, but the preferred name for the Inuit language is Inuktitut. Inuit is the plural of inuk, “human being”, and inuktitut means something like “in the manner of an inuk. Of course, since the Eskimoan languages form a dialect continuum from the Bering Strait to Greenland, that’s a simplified view of the situation[2]. I’m just pointing out that a distinction is made between the Inuit people and the language Inuktitut – see e.g. Ethnologue.
    [2]

    The Alaskan dialects are referred to as Iñupiaq or Inupiaq; the Canadian dialects, spoken by the Inuit, as Inuvialuktun, Inuktitut and Inuttut (Labrador); and the Greenlandic dialects as Kalaallisut.
    –Marianne Mithun, The Languages of Native North America

  3. You are indeed being pedantic, and while I’m a big fan of pedantry in its place, I try to write for the general reader, who would have no idea what “Inuktitut” means. If I were writing a post about the language, of course I would use the correct term and explain what it means, but as a casual reference I’m not going to go to the trouble. When I’m chatting with someone and want to refer to Romanes, I call it “Gypsy” for the same reason — it would be appallingly discourteous to use a name certain to go over their head, and I would find it unbearably pedantic to be saying things like “now, in Romanes, which is what you call ‘Gypsy’…” To every thing there is a season.

  4. i have some information on resources related to the language, as my father was a social worker up north for a decade or so.
    if you email me, i can get you the resources, and also some information on cree and dene.

  5. Hat, this non-linguist has always very much appreciated all your efforts, seen and unseen, known and unknown, to make this blog accessible to those of us who are interested and want to become more knowledgeable. Or, more plainly, thank you for your clarity and accessibility.

  6. Michael Farris says:

    Tim: I think your heart’s in the right place and I’m as naturally politically correct as it gets (see my comment vs. structural sexism in Slavic languages) but expecting English speakers (English is still just a national language after all) to recognize each and every ethnic self designation in the world is little mis-placed.
    I have a long-standing trivial interest in Greenlandic (including a few textbooks such as Qaagit! [in Danish mange tak] and a long-standing wish that KNR will put their broadcasts on line already) but I can’t see anything wrong with ‘Eskimo’ for Eskimos (or Gypsy for Romanis or Hungarian for Magyars, Finnish for Suomis etc. etc. etc.
    I’ll believe that Eskimos should be referred to as Inuit when Poles stop referring to anglieski and ask Czy Pan mówi po English? (or Russians ask Govorite li po-English?)

  7. LH: Accessibility to a general readership is a most worthy goal, and one in which you have always succeeded admirably, in my opinion. You have used both “Romanes” and “Inuktitut” without explanation in the past, though.
    Michael: I dare say you’re right in general principle – or, rather, I don’t think any reliable general principle can be established. Certainly, to require all such terms in English to be based on native forms would be at best quixotic. But I think the word Inuit is pretty widely recognised, now. In any case, I don’t have any strong feelings on the word Eskimo – for one thing, I don’t know any better term for the whole group containing the Inuit and Yup’ik.
    Anyway, I wasn’t talking about Eskimo vs Inuit, but Inuit vs Inuktitut. Which is perhaps an unfair distinction to expect English to make, since it doesn’t normally distinguish between language and nationality. I can’t think of any examples among those words that can clearly be considered as English names for other languages, rather than foreign words used in English. (Not counting those cases where there is no clear notional 1-1 mapping of language to nation, like “Swiss” or “Latin”).

  8. You have used both “Romanes” and “Inuktitut” without explanation in the past, though.
    Got me! What can I tell you? Sometimes I’m feeling populist, sometimes I’m feeling linguistical…
    Good point about words for languages vs nationalities.

  9. Language vs. nationality:
    can anybody explain to me, please, why the country is called Netherlands and language (AND nationality, as far as I understand – but may be there is a nuance I don’t see) – Dutch?
    It is so much more logical in Russian – *Gollandia and *gollandskij yasyk. And * datskij is the language of Dania, pure and simple!

  10. From my trial of the program it’s a “good guesser”, but not perfect. I put in a sample of Kurmanci Kurdish, in Latin script, and the response it gave was “Turkish”. Or maybe the program is simply taking the traditional Turkish approach of denying the Kurdish language and pretending it’s “mountain Turkish”…

  11. Tatyana: I think the crucial factor is that the Netherlands became an independent country long before Germany, so that there was an established name for it as early as the 17th century. There was no separate language and people, however — they were just a variety of “Dutch” (= German), like Saxons and Bavarians. Then the word got specialized; as the OED says:
    “In the 15th and 16th c. ‘Dutch’ was used in England in the general sense in which we now use ‘German’, and in this sense it included the language and people of the Netherlands as part of the ‘Low Dutch’ or Low German domain. After the United Provinces became an independent state, using the ‘Nederduytsch’ or Low German of Holland as the national language, the term ‘Dutch’ was gradually restricted in England to the Netherlanders, as being the particular division of the ‘Dutch’ or Germans with whom the English came in contact in the 17th c.; while in Holland itself duitsch, and in Germany deutsch, are, in their ordinary use, restricted to the language and dialects of Germany and of adjacent regions, exclusive of the Netherlands and Friesland; though in a wider sense ‘deutsch’ includes these also, and may even be used as widely as ‘Germanic’ or ‘Teutonic’. Thus the English use of Dutch has diverged from the German and Netherlandish use since 1600.”
    (Note that the OED, back in the 19th century, uses “Netherlandish” to refer to the language; I have no idea how widespread this was, but it’s certainly not been used for a long time now. I suspect they were using a rare/archaic term to avoid confusing people by using “Dutch” in an article about that word.)

  12. Oh, I see – it’s another instance of English speakers correcting the foreigners in matters of their own language and nationality. Like that guy who insisted of calling me Titania and said I have to adjust to please the English ear.
    I consede, though, that calling the language “Hollandish” wouldn’t sound terribly pleasing in English.
    And how the natives of Netherlands call their official language and nationality?

  13. The adjective for both nationality and language is nederlands; a Dutch person is a nederlander.

  14. Thanx, LH. This is how I will refer to them both- language and persons- from now on.

  15. Michael Farris says:

    If you’re talking with a Netherlander, just don’t refer to the country as Holland, unless you want a mini lecture on geography.

  16. I’ll make sure I’m not talking to a Brabant[er].
    Still, some Nederlanders aren’t seems to be offended, since they set up official Netherlands tourist site called holland.com

  17. So for what it’s worth, I put in Yiddish and was told it’s Hebrew. Is that because Yiddish is not among the “target languages” on the right side, or did the program just get it wrong?

  18. The former — if he hasn’t programmed it in yet, there’s no way it could be recognized.

  19. It is able to refuse to recognise a language entirely, though. Entering Basque (a paragraph of the UDHR) gave the result “gibberish”. Likewise Greenlandic. Presumably this would be the ideal response for an unknown language, although it’s probably impossible to guarantee that.
    Some other notes –
    Tagalog seems to be pretty consistently misidentified as Cebuano, although both are listed.
    Entering UTF-8 text in scripts not used by any of the listed languages produces responses like Unknown script: ‘Khmer’
    The list features “gurumkhi”. This I take to be a typo for Gurmukhi. Even spelt correctly, it doesn’t belong on the list, as Gurmukhi is not a language but a script, mainly used to write Panjabi. Finally, if there is code to recognize text in Gurmukhi, it doesn’t seem to work – it returns Unknown script: ‘Gurmukhi’.
    Oriya is listed. I’d be interested to know what text was used to train it – I’ve never been able to find a Unicode Oriya text online. I could use to check my system’s rendering capabilities.
    Sinhala (sample text) is correctly identified, although it isn’t listed.
    Quite impressive overall.

  20. Ha! I entered five English dialect words and it told me, very seriously, that they were Indonesian. I suppose to an outsider they may as well be.
    But a fun site anyway.

  21. This ‘guesser’ went 0-10 in basic Armenian sentences.
    Needs lots of work. I mean lots.

Speak Your Mind

*