The World Loanword Database (WOLD) is the most amazing thing I’ve seen in a while, linguistically speaking. Lameen Souag took time off from thesis-writing to share it, and I’m glad I have neither a thesis to write nor (at the moment) work to do, so I can splash around in it to my heart’s content. Here’s their description:

It provides vocabularies (mini-dictionaries of about 1000-2000 entries) of 41 languages from around the world, with comprehensive information about the loanword status of each word. It allows users to find loanwords, source words and donor languages in each of the 41 languages, but also makes it easy to compare loanwords across languages.
Each vocabulary was contributed by an expert on the language and its history. An accompanying book is being published by Mouton de Gruyter (Loanwords in the World’s Languages: A Comparative Handbook, edited by Martin Haspelmath & Uri Tadmor)….
The database can be accessed by language, by meaning, by author, or by reference.

Here‘s the “Languages” page (with a nifty map: recipient languages are shown by a red symbol, donor languages by blue) and here‘s the “Vocabularies” one, with a percentage of loanwords for each language (ranging from Old High German at 6% to Tarifiyt Berber at 53%). I’ll give a random example of the kind of information you get when you dig down. Bezhta (Affiliation: Nakh-Daghestanian, Avar-Andic-Tsezic; the section is by Bernard Comrie and Madzhid Khalilov) has 32% loanwords; one of them is čarx ‘whetstone,’ the page for which tells us that it is from Avar čarx ‘whetstone’, from Georgian čarxi ‘lathe.’ It goes on to say:

Other comments: Georgian may be the ultimate source – cf. also the related verb čarxva ‘to grind (knife)’ – in which case the details of the direction of derivation are unclear

And if you click on “Contact situation: Avar as local lingua franca,” you get:

Avar is the single largest immediate source for loans into Bezhta, and contact with Avar has been intense for at least some centuries, with Avar serving as the main means of communication with the outside world, in addition to personal meetings between speakers of Bezhta and Avar. It is difficult to justify the assignment of a particular date to the beginning of this process, but as a rule of thumb we have taken the beginning of the eighteenth century, as it was during the eighteenth century that the Bezhta speaking area was incorporated into a larger, religiously Muslim community under Avar leadership. In addition to loans of indigenous Avar origin, Avar has also provided the major conduit for the introduction into Bezhta of words of ultimate Arabic, Persian, or Turkic origin.

I could spend weeks rooting around in here, and probably will as work and other interests allow. Thanks, Lameen, and good luck with your thesis! (There’s already discussion of possible errors at Lameen’s post.)
I have to say, I’m not thrilled about “languoid,” which they call “a (relatively new) cover term for ‘language’ and ‘language family,’” but I suppose I can get used to it.


  1. Wow! This is awesome. Haspelmath and Tadmor gave a talk citing this at the 2009 LSA meeting and I’ve been wondering when they’d get around to making the data public. Thanks for promoting the link!

  2. John Emerson says:

    One of the things that struck me in Wixman’s “Language Aspects of Ethnic Patterns and Processes in the North Caucasus” was that a language as obscure as Avar was a literary language and lingua franca. So you have Russian > Azeri Turkish > Avar > a lot of even more obscure languages. Though perhaps Avar is on a par with Azeri.
    It’s very unlikely that the Avar language of the Caucasus has anything to do with the Avars defeated by Charlemagne, whose language was probably Oghur Turkish (and thus related to Volga Bulgar, Khazar, contemporary Chuvash, and probably Hunnic). Likewise the Avars of European history are probably unrelately to the Avars (Juan-Juan, etc.) of Chinese history.
    Come on, guys: one name, one thing! Confucius already told you!

  3. I guess this is a good point to mention macrolanguage, a term used in language tagging to identify a group of closely related languages which are for some purposes treated as a single language. Chinese, Arabic, Quechua, Zapotec, Kurdish, and Malay are prototypical examples, though ISO 639-3 identifies almost sixty of them.
    Sometimes there is a single dominant form which is partly identified with the whole macrolanguage (thus “Chinese” usually means “Mandarin”), sometimes there isn’t (no one Quechuan language is really dominant), sometimes it’s impossible to say (“Persian” encompasses both Farsi and Dari varieties, but is Farsi dominant? It depends where you stand and where you sit, as the saying goes).
    John E.: The names used by barbarians for themselves and each other are of no concern to the Han.

  4. The database is fascinating, but several things about it are puzzling. For example, some English pidgins/creoles (Tok Pisin, Sranan, Saramaccan) are classified as “Other/ Pidgin/Creole”: others (Aluku, Kriol, Pacific Pidgin English) are considered “Indo-European/Germanic”. This is quite inconsistent (Aluku, Sranan and Saramccan aren’t just creoles, they clearly have a common ancestor: likewise, it is accepted that Tok Pisin and Pacific Pidgin English are related): the Romance Creoles are considered “Indo-European/Romance”, except for Seychelles French Creole, which is “Other/Pidgin/Creole”.
    This means that any attempt to use the database to determine whether (for example) pidgins and creoles are more or less liable to borrow elements is bound not to yield usable results if one follows the database as to what languages are pidgins and creoles. CAVEAT LECTOR indeed…

  5. John E,
    Avar is what the Russians et al. call them, their own name for themselves is магӀарулал (гӀ being the voiced pharyngeal fricative as in Arabic ʕayn) = “highlanders”. According to Alekseev & Ataev 1997, the ethnonym ‘Avar’ is commonly believed to be derived from a word “found in many Eastern languages” meaning ‘those who roam, nomads.’
    Come on, guys: one name, one thing! Confucius already told you!
    No fair! He could just invent a new character for any homonym he wanted…

  6. I noted that some contributors seem to be using the data to push their own personal theories. Japanese is a particularly egregious example, with several claims of Proto-Malayo-Polynesian sources. This is not only flawed in that only vanishingly few historical linguists accept such a possibility, but also since PMP was *south* of Taiwan we’d instead expect the poorly documented northern Formosan languages to be interacting with the Japonic languages. (I am relieved there are no Altaic entries.)
    There are other things in the Japanese data that are very suspicious too, like sutōbu supposedly borrowed from Old High German, and karendā from Latin! (BTW, sutōbu doesn’t mean “stove”, it means “portable heater”, so the gloss is wrong too.) I strongly question how words like būmeran “boomerang” or opossamu “opossum” could seriously be considered as borrowings from Dharuk and Powhatan, respectively. It’s stupidly obvious that they came by way of English, and I doubt that any Japanese speaker has ever met a Powhatan speaker in real life. Attributing borrowings to a “root” language is a perilous road to go down, because it’s easy to end up with nonsense. (How many borrowings are there in English from Nostratic or Dene-Caucasian?) I believe this is a point that Sally Thomason has made many times in the past. The person that proposed these has made some seriously flawed judgements, and is very much at odds with the generally accepted analyses of historical and language contact.
    In sum, it looks suspiciously like data has been accepted without any critical review at all. Until there’s some semblance of anonymous review, I’m going to treat this site as no more reliable or citable than Wikipedia.

  7. While not errors per se, I’ve noticed some inconsistencies in Hungarian sources for Selice Romani. For example, árnyík and onoka are both archaic/non-standard spelling for árnyék (shadow) and unoka (grandchild). Same I believe goes for özved which most speakers of Hungarian would know as özvegy (widow/widower).
    Leaving aside the question of why go so far as Bulgarian or Croatian for praho and nebo when Slovak or Czech would be much more likely source (perhaps there’s data on the history of Selice Romani I am not familiar with), there are a few questionable etymologies. Chief among them mama which the list traces to Hungarian. Surely if it’s not one of those childspeak words that crop up all over the linguistic place, then any Slavic language in contact with SR is a much more likely source.

  8. Etienne, James C.,
    in case you haven’t seen it, Uri Tadmor was kind enough to respond to some of these questions over at LL after a post on the very talk Chris mentioned.

  9. Antonio Cheung says:

    Thanks so much for making this available, but there are also some things that I have noticed:
    1: Cantonese seems to be confused with Mandarin.
    The romanization for that of Vietnamese source words are that of Mandarin
    fall/pour is dou2 倒
    preserved sausage is laap6coeng4 臘腸
    stir-fry is caau2 炒
    Thai: chicken in Cantonese is gai1 雞, horse is maa5 馬
    Hawaiian: baak3 伯
    Swahili: typhoon in Cantonese is toi4fung1 颱風
    2: Japanese borrowings from Chinese
    As a native speaker of Cantonese and a learner of Japanese, I observe that many cases of borrowing would be much clearer if the donor language is considered to be Middle Chinese instead of Mandarin. Since the borrowing took place when Mandarin “isn’t around”, and Cantonese preserves much more of the pronunciation of Middle Chinese than Mandarin (which underwent some major changes/simplification e.g. loss of a lot of coda), and it would be much clearer if Cantonese data is also considered.
    e.g. 2.76 widow: both words exist in Cantonese and the pronunciation are closer too. 未亡人 is mei6mong4jan4. Mandarin might probably be wei4wang2ren2
    Also, doesn’t onyomi refer to the “Chinese reading” – words using onyomi could be evidence for borrowing.
    Go-on (呉音?, “Wu sound”) readings are from the pronunciation during the Southern and Northern Dynasties or Baekje, an ancient state on the Korean Peninsula, during the 5th and 6th centuries. Go may refer to the Wu region (in the vicinity of modern Shanghai), but does not appear to have this meaning in Go-on.
    The following are copy-and-pasted from Wikipedia entry “Kanji”
    “Kan-on (漢音?, “Han sound”) readings are from the pronunciation during the Tang Dynasty in the 7th to 9th centuries, primarily from the standard speech of the capital, Chang’an (長安 or 长安, modern Xi’an). Here, Kan is used in the sense of China.
    Tō-on (唐音?, “Tang sound”) readings are from the pronunciations of later dynasties, such as the Song (宋) and Ming (明). They cover all readings adopted from the Heian era (平安) to the Edo period (江戸). This is also known as Tōsō-on (唐宋音).”
    I haven’t checked English borrowings from Chinese (Mandarin or Cantonese), but from history it seems that most contact occur at ports like Hong Kong, so probably it’s not Mandarin that is the source language, but local languages/lects.

  10. There are other things in the Japanese data that are very suspicious too, like sutōbu supposedly borrowed from Old High German, and karendā from Latin!… I strongly question how words like būmeran “boomerang” or opossamu “opossum” could seriously be considered as borrowings from Dharuk and Powhatan, respectively. It’s stupidly obvious that they came by way of English
    You’ve missed their distinction between “immediate” and “earlier” sources; the latter include the ones you mention (“mediated” might have been a better term).

  11. John Emerson says:

    I doubt that any Japanese speaker has ever met a Powhatan speaker in real life.
    It’s because of people like you that Tnkerbelle died. I believe! I believe!
    And, no kidding, the Dravidian origins theory of Japanese is apparently alive and well. It’s probably where I got my joke, in fact.

  12. marie-lucie says:

    JE, you mean it was a joke? I am disappointed.

  13. I see Malay is there, but only as a donor language. I suspect showing it as a recipient language would break the database.

  14. John Emerson says:

    I almost had you convinced, didn’t I, M-L.

  15. I had a quick look at the Chinese section. There seem to be quite a few very tentative etymologies in there. I was intrigued by the hypothesis tha 站 zhàn “station“ is a loan from Mongolian (source word Mongolian Ĵam ‘road > post station’). This is accompanied by the comment that “During the Southern Song dynasty, zhàn 站 replaced the earlier yì 驛 for ‘station’. Officially forbidden in the following Ming dynasty, the word was still in colloquial usage and was restored during the Manzhu rule after 1644 for military stations. The verbal usage is first attested during the 16th century (Qi Xuguang 1528-87).”
    But the entry also notes “5. no evidence for borrowing”. So how far are we supposed to believe that this etymology is correct?
    Similarly 农 nóng “perhaps borrowed (Starostin)” from *niàŋu ‘field’ (Proto-Altaic), also “5. no evidence for borrowing”.
    Do we really have a decent basis for asserting that 站 and 农 are borrowed words? It doesn’t look like it.
    I also agree with Etienne that it is sloppy to trace Japanese loanwords to their ultimate ancestor.

  16. Trond Engen says:

    Good point, Alexa. That is obviously what is lacking in the database.
    Generally, if scientists went online and bought undergraduate essays from sites marketing themselves through internet spam much painstaking research could be avoided.

  17. There’s other really elementary mistakes in the Japanese list, like this:
    “湿気る(become wet)、 時化る (become rough (2)) The latter is a case of ateji.”
    First and foremost, all kanji are ateji in Japanese. This is lost on a lot of Japanese speakers, but it’s true.[1]
    Second, しける (shikeru) originally meant “to become moist” whence “to become inedible” and now is even used for food which has dried out and become inedible (ie., has become stale). It very clearly comes from 湿気 (shikke), which is actually a Chinese borrowing meaning humidity. 時化る (shikeru) also looks suspiciously like a borrowing, since neither shi for ji nor ke for ka (the canonical on-yomi for that combination being ‘jika’) would be very surprising. That’s just a guess, though. S- does exist in rain/storm words, mainly SHigure and haruSame.
    So, nit-picking, but these are things which wouldn’t escape someone who knew something about Japanese.
    Also, nohara is itself a compound, so either no or hara by themselves would make more sense in the context of a Polynesian source.
    And is Chinese 空気 (air) a borrowing from Dutch!?!
    [1] For example, かう is a Japanese word whose lexical coverage is very broad, meaning some kind of exchange. Its various kanji manifestations narrow the meaning according to the context (eg, 買う – to buy, (行き/飛び)交う – many elements of a group moving in different directions among one another, 代える – to substitute, etc., etc.) but they’re all only really just かう and morphological variations thereof (かわる、かえる).

  18. Since the languages were done by different people, I would expect a considerable range in quality. Sounds like Japanese suffered.

  19. “John E.: The names used by barbarians for themselves and each other are of no concern to the Han.”
    Tha’s not strictly true, John:
    England – Ying1 Guo2
    France – Fa3 Guo2
    Germany – De2 Guo2
    Italy – Yi4 Da4 Li2
    “I haven’t checked English borrowings from Chinese (Mandarin or Cantonese), but from history it seems that most contact occur at ports like Hong Kong, so probably it’s not Mandarin that is the source language, but local languages/lects.”
    Well, yes, even when one of those lects happens to be Mandarin, so:
    Tycoon, typhoon, goon (as in goon squad)

  20. John Emerson says:

    Mair and other have done a lot of work on very early (before 1000 BC, maybe before 2000 BC) Chinese borrowings from an Indo-European language, probably Tokharian. There’s archeological evidence giving plausibility to this, because many technologies seem to have come to China from the West, and the Tokharians were a major factor in the Chinese Northwst intil late in the 1st millenium BC when the Xiungnu drove most of them to the SW.

  21. Trond Engen says:

    Most seem to think that they were absorbed into the (conquering) Uyghur society. I’m intrigued by the suggestion that they were driven southwest. What records are there? Do you have any idea where they ended?

  22. John Emerson says:

    Some remained in Xinjiang and survived until about 800 AD IIRC. They are the source of our written documents in Tokharian languages (2 or 3 languages or dialects). Before 200 BC or so the Tokharians had controlled a much larger area for thousands of years. When defeated by the Xiung-nu, the bulk of them retreated SW to the area of Afghanistan and Pakistan, establishing the Kushan Empire. (Kushan history is extraordinarily obscure and some think that the Kushans were Scythians with a Tokharian element, whereas others think they were primarily Tokharians.)
    The Tarim mummies Mair writes about are thought to be Tokharians, who are thought to have brought a lot of technology west sometime before 2000 BC.
    This is all controversial but it’s much better grounded than anything was even 15 years ago. SOurces: Craig Benjamin: The Yuezhi. Mallory and Mair: The Tarim Mummies. Anthony: The Horse, the Wheel, and Language. Benjamin has a Volume Two coming out which should be about the Kushans (or not, depending on his judgement).

  23. John Emerson says:
  24. My main complaint is that many etymologies seem to be speculative. Somebody *thinks* that a word was a borrowing and it’s thrown in. I agree that 站 zhàn as a borrowing from Mongolian makes a lot of sense (modern Japanese for ‘railway station’ is 駅 (from 驛), modern Chinese is 火車站, so an outside source for 站 is quite plausible. But the list doesn’t provide any means of judging whether an etymology is well-founded or just someone’s wild guess.
    In the Japanese section, I notice that 自動車 jidōsha (automobile, motor car) is listed as a borrowing from Chinese, but the explanation given (e.g. Chinese uses 汽車 qìchē) suggests that it is not. I mean, unless someone has decisively proven that 自動車 is from Chinese and not simply a calque on “automobile”, why list it at all?

  25. John Emerson says:

    I think that the specific zhàn / yam relationship is well-attested. You frequently see it referenced, and the yam system was an early innovation of Ogedei in N. China and elsewhere, and later in S. China.

  26. The only reason I can guess they would give 自動車- jidousha – automobile a Chinese origin is because the components are transparently Chinese. It would be like me making up a word using a Latin root and Latin affixes — is the word a borrowing? In other words, if the components are all borrowed, does that qualify the word as a borrowing, even if the source language of the “borrowing” doesn’t have the word itself?
    That’s an interesting gray area.

  27. John Emerson says:

    I’ve been told that a fair number of Chinese neologisms have been adopted from Japanese, the only example I can think of being kexue “science”. This word leaves no sign of its origin, since the Japanese wrote it down in kanji and each nation pronounced it their own way. But IIRC it can be shown to have originated in Japan and adopted into Chinese.

  28. Jordan,
    No kidding. Malay is as loanword-friendly as English, overflowing with them. (Tadmor is a Malay dialect specialist, but probably too busy to tackle all those loans.)
    Here’s one set I recently ran into, if Wikipedia is reliable on this topic.
    Sanskrit Maharddhika ‘great and mighty man’
    adapted to socially classify emancipated slaves in East Indian Dutch as Mardijker ‘freeman’, then adapted by the independence movement in Indonesia as Merdeka ‘freedom’.
    There was an interesting exchange last month on the AN-Lang (Austronesian langs) list about grammatically gendered loanwords into Malay, starting from Sanskrit. I copyedited and blogged a bit of it.

  29. I agree that Sino-Japanese is a grey area, but not as grey as that.
    The word “telescope” should surely be regarded as a modern word created from Greek elements, not a loanword from Greek. Similarly for words like 自動車.
    There is, I grant, a difference from modern European languages and ancient Greek, in that there was continuing exchange between Chinese and Japanese during the 19th century. The Japanese were up on Chinese attempts to translate foreign words and eagerly devoured Chinese books introducing Western knowledge. So it was a two-way street within what should be regarded as one cultural zone.
    Moreover, in some cases the Japanese didn’t create new words; they took old words (like 經濟) and gave them new meanings. Thus the boundaries between borrowing and coinage can be fuzzy.
    Still, it is my understanding that in most cases there is documentary evidence of where modern Sino-Japanese compounds were created — in Japan or in China. Thus, words like 社會, 科學, 化學 etc. were (if I remember rightly) were clearly created in Japan and then taken into Chinese. It is as though “telescope” were created in English and then borrowed into Greek (I have no idea if this is the actuality, it’s merely a hypothetical example). This can’t by any stretch of the imagination be regarded as a borrowing from Greek into English.

  30. marie-lucie says:

    It is as though “telescope” were created in English and then borrowed into Greek (I have no idea if this is the actuality, it’s merely a hypothetical example)
    I have read that this type of thing has happened a number of times in Modern Greek, as scientific terms were coined from Greek elements in other European countries and then adopted into Modern Greek.

  31. I agree. I wouldn’t say it’s a borrowing, but still, how do you classify words created out of borrowed parts?
    Interestingly (and perhaps apocryphally), I’ve heard that the Japanese (namely Yukichi Fukuzawa) invented the word 自由 jiyuu – freedom/liberty, because there was no such term. Does anyone know if that’s true?

  32. John Emerson says:

    On Malay: in Taiwan I met an Anglo-Dutch woman who studied Malay but could understand Indonesian, because she knew the Dutch loan-words in Indonesian. The two language were natively more or less the same, but diverged over the last 200 years or so.

  33. Does anyone know if that’s true?
    Not exactly. He only settled on 自由 jiyū among several existing alternatives like 自主 jishu, 自在 jizai and 不羈 fuki; in, for instance, his translation of the Declaration of Independence. See Douglas Howland’s Translating the West or the specific “Translating Liberty in Nineteenth-Century Japan” in JSTOR.
    One of the amusing aspects of the history he tells is an early treaty with the Dutch translating vrijheid as 我が儘 wagamama. Of course, now that’s the name of a post-modern 蕎麦屋, perhaps due to some misunderstanding.

  34. I think it was Herbert Read (or maybe I just have him on the mind because of Zinn) who claimed that French and German were impoverished in only having liberté and Freiheit respectively and not both like English.

  35. marie-lucie says:

    And English vocabulary is bloated by having borrowed synonyms of its own words from other languages, those synonyms then acquiring slight differences in meaning and use. .

  36. John Emerson says:

    The Dutch dictionary is the largest in the world, which makes sense since Dutch was the language of the Garden of Eden.

  37. John means the Garden of Edam, of course.
    I’m reading (and blogging a bit of) a book about Creoles of all kinds in the Dutch East Indies, a book that has reminded me that the Dutch controlled both Ceylon and Malacca until as late as the 1790s, just a few decades before Raffles turned Singapore into something more than a fishing village.
    I’ve just come to a passage about how the creolized Burghers of Colombo were marginalized by their new British overlords the same way they had marginalized the creolized Portuguese settlers there 150 years before. The “Burghers” then went on to play key roles in the Ceylonese independence movement. I’ll blog it.
    BTW, I tend to use “Malay” in the broadest sense of the Malay world and its farflung outposts, not in the sense of Bahasa Melayu vs. Bahasa Indonesia (or Brunei Malay, which I understand has considerably enriched the intricacies of its honorifics in formal communication to reflect the many levels of its oil-fueled royal bureaucracies).

  38. Another Chinese neologism adopted from Japanese is 電話 telephone.

  39. Well, according to the World Loanword Database, Japanese 電話 denwa was borrowed from Chinese 電話 diànhuà. Frankly, I don’t know which way the borrowing went, but I am highly suspicious of the accuracy of the World Loanword Database. My feeling is that it’s sloppy and unprofessional when it comes to Japanese.

  40. The list is not even consistent within itself.
    At 煉瓦 renga ‘brick’ it says: “Brick was introduced to Japan towards the end of the Edo period. Probably a neologism, as this term has no currency in Chinese. Modern Mandarin uses zhuān 磚.” “Most likely a Japanese neologism involving SJ elements.”
    At アドービ煉瓦 adōbi-renga ‘adobe brick’ it says: “adōbi is from English, renga from Chinese”.

  41. Wow, that’s pretty embarrassing. They should have given Japanese to somebody else, obviously.

  42. It is rather embarrassing, isn’t it. The Japanese section in the book based on the data (Loanwords in the World’s Languages. A Comparative Handbook. Ed. by Haspelmath, Martin / Tadmor, Uri. 2009, about €200) is also quite embarrassing. Some of the other bits are rather good, though, but I think I will wait until the library gets it.

