Multilingual Parallel Bible Corpus.

Here you can find a multilingual parallel corpus created from translations of the Bible. This an effort to create a parallel corpus containing as many languages as possible that could be used for a number of NLP tasks. Using the Book, Chapter and Verse indices the corpus is aligned (almost) at a sentence level. (There are cases where two verses in one language are translated as one in another)

Following a similar effort by Philip Resnik and Mari Broman Olsen at the University of Maryland (website) I have encoded the text of each language in XML files using the Corpus Encoding Standard

The following table contains the XML Bibles in 100 languages (all the languages that an electronic version was freely available online) along with information about each language from Ethnologue.

  1. David Eddyshaw says

    all the languages that an electronic version was freely available online

    There are a great many more than 100 of those (inevitably I looked for Kusaal straight away …)
    I expect they mean “freely available” in the sense of being freely usable for any purpose (i.e. not copyright.)

    Though Bible translations generally are copyright (for good reasons.) Freely available for the purpose in question, then, I suppose.

  2. The author of the alignment is Christos Christodoulopoulos. Wow, “Christ, son of the servant of Christ”. There is some “his own grandfather” feel to it, but it is much simpler — the gentleman’s father is his servant; it happens. But was Joseph the Carpenter his son’s servant? Or maybe god-the-father is Christ’s servant? There is some serious heresy hidden in this name…

  3. I’m back.

    But perhaps I should explain. I was innocently googling when i slipped across the wrong side of a tesseract and fell into the 2015 languagehat German Dzwiebel AmerIndoEuropean Zombie Apocalypse.

    Complete with the foreshadowing character of Gary Moore and the echo, or possibly cat’s paw, “Kaledon Naddair.” Though if Dzwiebel did pull the strings of Kaledon, he did it with verisimilitude and panache – Kaledon made a brief first foray in the thread, then opened his next post a day and a half later with “having taken the time to read all the entries in this thread …” by then there were some 900, with the Dzwiebal harangues and ripostes by David Marjonovic running the length of 70’s New Yorker feature stories. The thread was still young and would see more than double that before it ran its course.

    Wow! My jaw was hanging so deeply into the Appall range for so long that I feared my accent would change. The blog version of a Netflix binge in a series that the sunk cost fallacy should have warned me off of long since. Or at keast at the point the thread reached its Naddair.

    Is there a place on Dzwiebel’s now moribund blog where I can apply for reparations?

    That guy deserved all the hits he took. And consider that in my dialect “hits” is derived via the concept of *s-mobile.

    Sorry for the interruption. But you all lived that in real time! A Sarajevo of the linguist’s soul.

  4. @ Ryan:

    Did the thread use “co-ed”?

    @ The Bible:

    Disappointing. Limited range of languages. No Mongolian. Couldn’t open things (internal server error) and maps apparently didn’t open properly (although it didn’t seem to matter).

    There are plenty of Bibles available on the Internet. The trouble is not in finding them but in finding ones you want in a useful format.

    And the more versions the better. If the only English-language version you had access to was King James, it would be of limited usefulness, wouldn’t it. Same goes for other languages. A broad range allows you to compare style, register, vocabulary, etc. One version is just one version.

  5. >Did the thread use co-ed

    If it had, Dzwiebel would have given 23 graphs on why it’s cognate with the Welsh word, since co-eds wander in the grove of academe.

  6. only 100 languages? has bibles in 1289 languages.

  7. My favorite is Ulster Scots.

    1 Maist warthie Theophilus, A hale lock o fowk haes taen ït ïn han tae draa up an accoont o aa tha thïngs that haes cum aboot amang iz.
    2 The’ hae brocht thegither whut wus hannit doon tae iz frae yins that saa ït aa wi thair ain een richt frae tha stairt, an that becum sarvints o tha Wurd o God.
    3-4 Sae, haein lukt ïntae ït aa masel richt frae whan tha hale storie begun, A thocht ït wud be a guid thïng fer me tae pit doon a trig accoont fer ye forbye, Theophilus, tae mak ye shair o tha truith o whut ye hae bin lairnt.

  8. David Marjanović says

    Dziebel is not an onion (Zwiebel).

    Whether he is The Onion is a separate question.

  9. Dziebel is not an onion (Zwiebel).

    Well, they’re clearly cognate — just look at them!

  10. Cognate and magnate are too, by the same type of reasoning – direct inspection, aka grokking the sense data.

  11. T. Herman Zweibel (with “ei”) is the editor of The Onion (fictionally).

  12. David Marjanović says

    I have Zweifel (“doubt(s)”) about that. 🙂

  13. John Woldemar Cowan says

    On being named Twivvle (although there are in fact many people named Zweifel).

  14. Does anyone know whether Vladimir Diakoff, who entered the thread just as Kaledon departed, calmly tried to defend Dwiebel from a Russian academic context,and has the kind of detailed knowledge of GD’s theories and background that you’d expect from his biographer, was real, or another layer of the Zwiebel? It’s only now I’m realizing how Diakoff could be pronounced. In light of the revelation of the Onion editor’s name I’m starting to think German really might be an elaborate hoax. Which would make his angry protests about anonymous posters really funny.

    LinkedIn says he’s head of brand strategy and marketing for a “cultural consultancy”. I wonder how he approaches the task of getting people to like the company. Does he handle their social media with the same gusto that he brings to his personal online interactions?

  15. Checked his LinkedIn.

    OK, it turns out I know personally people who know him – one degree of separation.

    In fact, we closely missed a chance to meet personally at a certain university I visited (he graduated from there a few months earlier).

  16. Capra Internetensis says

    Many people, including me, have suspected Dziebel of being a long-running hoax, but he has never to my knowledge broken character. He has published (and often shills for) a book, called The Genius of Kinship, through a genuine if not prestigious academic press. The author bio notes that he “has conducted field research among the Karelians and the Mordvinians in Russia and among the urban reenactors of Native American cultures in Eastern and Western Europe.” One of the reviews mentions that “the author has brought up himself as a ‘native American’ in Europe”! I’m pretty sure he is genuine academic crackpot (and Diakoff is almost certainly his sock-puppet).

  17. @SFReader: That’s two degrees of separation, actually. You start counting at zero degrees of separation from yourself. Incidentally, what the famous “six degrees of separation” experiment* actually showed was not that any two people in the U. S. were connected by six degrees. Rather, it showed that people could find a path of connections to anyone in the country (assuming a certain stockbroker in Medford, Massachusetts was “anybody”) in an average of six steps. (Taking into account that the chain had a fixed chance to break at each link, with a person not continuing the process, the actual average was closer to seven than six, since longer chains were more likely to get broken than short ones.) A statistical analysis of the results of that experiment and later ones suggested that everyone in the U. S. at the time was within three degrees of separation of a majority of the country’s population.

    * The experiment was conducted by Stanley Milgram, who is much more famous for having studied obedience to authority.

    @David Marjanović: I have always assumed that Zweifel was “two feelings,” although the final vowel does not seem regular for German. I’m afraid to look it up now, lest my private folk etymology be cooked (to use a chess term).

  18. David Marjanović says

    The final vowel doesn’t even exist, it’s an orthographic lie. 🙂 I just looked it up. The Gothic version is tweifl(s) without it.

    There is “two” in it, but the rest isn’t “feel”, it’s “fold”.

  19. Trond Engen says


    Danish tvivl, Norwegian tvil, Swedish (rare) tvivel. I think the Norwegian form must be a borrowing from Danish with simplification of the final cluster.

  20. David Marjanović says

    “Schwed. tvivel und dän. tvivl sind Entlehnungen aus dem M[ittel]n[ieder]d[eutschen].”

  21. Oh jesus christ. I just reached the point where Pyysalo showed up to explain that he had a computer program that was accurately generating IE languages from their ancestors. In 2015.
    Surely by now his program has linked Nostratic to Khoisan, and he’s just waiting for it to clear peer review before announcing he’s Solved Linguistics.
    I’m thinking of self-publishing a book on that thread. Or writing a masters thesis on it, then never publishing it, but constantly citing it.

  22. I’ve been marveling (and snarking,) but i want to offer an insight that the thread prompted in me. Maybe someone will find it useful.

    A year ago I posted about someone’s attempt to imitate an accent, and Piotr Gasiorowski gave me the term for something I’ve long felt is important. “Articulatory setting.”

    All the talk in linguistics about sound rules and how palatovelars can change is really either poor proxy or at best synecdoche for what i believe is really going on, which is that differences in articulatory setting make a host of sound changes obligatory, acting universally and simultaneously across all consonants and vowels.

    It would be interesting, i think, to give someone with a good ear a practice thumb drive with a vocabulary in, say, Polish, but limited to half the phonemes. Let them listen and repeat to perfect the articulatory setting, and then unleash them on the full vocabulary. Would it sound right?

    If so, that could have ramifications for how one attempts to reconstruct an etymon or a sound law.

  23. Diakoff is almost certainly his sock-puppet

    This allegation has been made on several forums independently (both in Russian and English).

    I have no idea whether it’s true or not, however, something about Diakoff is fishy. He purports to be a Russian from Russia, but spelling his surname with ending in -off is not possible in modern Russia (it violates regulations of the Russian Ministry of Foreign Affairs on English spelling of Russian surnames).

    So all Diakoffs today have to be descendants of Russian émigrés, Russian citizens with this surname would spell it Diakov.

  24. David Marjanović says

    differences in articulatory setting make a host of sound changes obligatory, acting universally and simultaneously across all consonants and vowels.

    I don’t think such simultaneous groups of sound changes are common, though.

  25. David Eddyshaw says

    Articulatory setting

    “Default position of a speaker’s organs of articulation when preparing to speak”, according to Wikipedia. Knowing nothing about it, it sounds a real enough thing but also something that must be hard to pin down rigorously.

    Languages (to be yet less rigorous) vary in other hard-to-pin-down dimensions too. I recall a fairly eminent scholar of Gur languages telling me that he felt Kusaal was a “swallowed” language (like Danish) in contrast to say Mooré, where it is fairly easy to pick out the words even if you don’t actually know what they mean. Hausa is even further toward the “unswallowed” end of this scale, helped along by its love of long vowels and word-demarcating glottal stops.

  26. David Eddyshaw says

    Differences in articulatory setting wouldn’t in themselves affect the mutual relationships between phonemes, but by altering phonetic differences they might be one explanation for (part of) Sapir’s “drift”, the phenomenon that languages of a common origin which are no longer in contact nevertheless may continue on a similar path of development to one another in ways which can’t just be attributed to universal tendencies.

    For example, if your articulatory setting includes a Japanese-like distaste for lip rounding, it might help drive the loss of rounding in front vowels.

  27. John Woldemar Cowan says

    differences in articulatory setting make a host of sound changes obligatory, acting universally and simultaneously across all consonants and vowels

    If we look at chain shifts, though, we don’t see this happening. For example, let’s look at a very clear case of a pull chain shift in a subset of Proto-Polynesian consonants: *t, *k, *ʔ, *n, *ŋ. Of the other consonants, *p, *m are stable; *w materializes as /w/ or /v/ (which may just be different notations for /ʋ/); *r is lost and *l preserved in Tongan and Niuean (the Tongic subfamily, which uncontroversially branched off first) and merge either as /r/ or as /l/ everywhere else, except in the Marquesan languages where they become /ʔ/; *h is lost everywhere (very common in the world’s languages) except in Tongic; *s merges with /h/ in Tongic, is preserved in Samoan and its closest relatives (probably the second to branch off), and is lost everywhere else.

    So what happens with our four “interesting” phonemes? Well, *ʔ is lost everywhere (also very common worldwide) except in Tongan. That leaves a gap, which fills as follows: *k retreats to /ʔ/ in Samoan (but not its relatives), Hawaiian, Tahitian, and South Marquesan; *t retreats to /k/ in Hawaiian and L Samoan, but not H Samoan or any other languages; *n retreats to /ŋ/ in L Samoan only; /ŋ/ denasalizes and retreats to /ʔ/ in Tahitian only (it just denasalizes to /k/ in North Marquesan). It’s also the case that what is dental /l/ in H Samoan (from either *l or *r) retreats slightly to alveolar tap /r/ in L Samoan. So far so good.

    But meanwhile, although both L Samoan and Hawaiian show the backing trend strongly in the oral stops, in Hawaiian /ŋ/ merges forward to /n/, exactly the opposite of what happens in L Samoan! What is more, the vowels /a/, /e/, /i/, /o/, and /u/ (short and long) remain absolutely stable in all Polynesian languages, completely unaffected by any of this. So while your explanation may explain some things, it is definitely not a universal explanation, because the articulatory setting can’t move in different directions and not at all simultaneously.

  28. <I don't think such simultaneous changes of sound are common though

    The classic Chicago et'nic accent differs from many other American accents at minimum in the way d, r, and short a are pronounced. To achieve that accent, i hold my mouth differently. Likewise, i have a better accent in Spanish than my comprehension, which can cause je problems. I can speak well in Spanish because i adopt a new articulatory setting, and each phoneme comes effortlessly right. I would argue that all accents are driven by a shared articulatory setting, and that dialectic and language divergence is simplythat process writ large across time or into a non-native language.

  29. John Woldemar Cowan says

    Five interesting phonemes, of course. I had conflated *n and *ŋ in the first draft and had to go back and fix them, but missed the number.

    Mama Bat, Papa Bat, and Baby Bat were hanging upside down in a cave. Baby Bat said, “Boy, this cave is almost completely vacant — only the four of us are present!”

    Mama Bat said, “What four? There’s only you, me, and your father.”

    Baby Bat flew into a rage. “You know perfectly well I can’t count!”

  30. JC,

    Your data trumps my anecdote and self-analysis.

    Like most muttering amateurs, I still suspect my explanation has something going for it. So I’ll try to fit my ideas to your data. And when I fail, I’ll drop it to avoid trolling.

    There may be something here which is outside the ability of historical linguistics to capture. Verbal descriptions of phonemes are probably insufficiently fine to define articulatory setting, and for all I know, a given articulatory setting might be consistent with the entire range of human phonemes, while still giving each a particular cast and suggesting most likely elisions in clusters.

    Consider my Chicago accent anecdote. It’s hard for me to understand how the Chicago accent can consist of what I would describe as several simultaneous changes to the way sounds are articulated, and yet this wouldn’t apply to proto-Polynesian dialects and languages. It seems to me more likely that there are in fact broad changes to sound qualities going on, shared by the speakers of each of these divisions, but that they aren’t captured by our tools for writing them down.

    Notably, my sense of the question is driven mostly by my experience of accents, not dialects or languages. Accents are difficult to represent in writing.

    If languages are dialects with an army, is it fair to say dialects are just accents with a police force?

    Meaning, isn’t accent just the finest gradation on the continuum, with language the broadest? Isn’t the process of dialect formation just accent differentiation over a broader range, the process of language formation, (except in cases of pidgins or the swamping of a language by L2 speakers) usually just accent formation plus time, space and isolation.

    If accent can be produced by articulatory setting, and the process of language formation is different only fractally, from accent formation, then articulatory setting may have a lot to tell us, if we could figure out a better way of measuring articulatory setting and/or its immediate impact on accent.

    I would definitely differ with one comment:
    >the articulatory setting can’t move in different directions and not at all simultaneously.

    I would think that articulatory setting is describing the standard cast of multiple facial muscles and perhaps other muscles. While perhaps it can’t move in different directions and not at all simultaneously, it’s easy for me to envision how the outcomes in terms of phonemes might register that way, especially at the course level at which phonemes are described.

  31. @Brett degrees of separation,…. Incidentally, what the famous “six degrees of separation” experiment* actually showed …

    IIRC was that more than half the packets never made it to their target. “In one case, 232 of the 296 letters never reached the destination.” sez wp

    The six (or seven) degrees average was only for chains that did reach their target. And this was from US Mid-West to Boston, not from Timbuctoo to Ulan-Bator.

    Milgram himself never said “six degrees …”. Never the less the urban myth is that “we’re all connected by six degrees” (for some value of “all”, as with some value of words for snow).

  32. David Eddyshaw says

    “Multilingual Parallel Bible Corpus” just made me think (for no reason) of “Teenage Mutant Ninja Turtles.”

    I can already envisage some plot lines …
    Matteo, Marco, Luca and Giovanni, with their mentor, the aged Origen Hexapla ….

  33. Trond Engen says

    @Ryan: I don’t think “articulatory setting” can explain every change ever, but it may be important for understanding such phenomena as the many palatalization of Slavic, the neverending process of lenition in Danish and the multiple “Grimm” shifts in Upper High German. When I first got the idea I called it “default position of the mouth” and tried to relate it to the value of the neutral (schwa) or epenthetic vowel. I wondered at the time if the Grimm shift could be related to how Proto-Germanic apparently had an u-like epenthetic vowel, but I never got anywhere with that, I think.

  34. >I don’t think “articulatory setting” can explain every change ever,

    I’m enough of a crank to initially think grandly, but yes, you’re surely right. Interesting that you’ve had some thoughts about what more circumscribed but specific effects it might have.

  35. January First-of-May says

    The six (or seven) degrees average was only for chains that did reach their target. And this was from US Mid-West to Boston, not from Timbuctoo to Ulan-Bator.

    It also required the people involved to have a good guess for the best next point on the chain, which they almost by definition didn’t have anywhere near enough knowledge to do even remotely optimally. I’m sure Milgram didn’t correct for that.

    Modern computing techniques enable figuring out what actual optimal chains (given a known connection network, at least) look like; it turns out that, depending on how common connections are, the true typical length varies between 3 and 5, and something as high as 7 is already an outlier.

  36. Interesting; the original experiment seems to have been pretty worthless, and it was misinterpreted to create an even more worthless popular impression, but it got people thinking in useful directions.

  37. David Marjanović says

    The difference between “accent” and “dialect/language” is that “accent” refers to pronunciation only, not to grammar or vocabulary.

  38. It wasn’t worthless insofar as the results were extremely contrary to expectations. When people didn’t know the answer, they might guess it would take a hundred connections for a package to find its way to its target.

    A related observation is that, while the study participants were unlikely to find the optimal path spanning the connection graph, the also did much, much better than pure chance. There are strong cues associated with our interpersonal relationships, suggesting with reasonable accuracy what is the right general direction toward an unknown individual. The most obvious cues are just geographical, but pure geography is also not as important as you might imagine. Starting the package relatively close to the target did not reduce the number of steps by all that much.

  39. John Woldemar Cowan says

    t’s hard for me to understand how the Chicago accent can consist of what I would describe as several simultaneous changes to the way sounds are articulated

    The short-a change is part of the Northern Cities Shift. I don’t know what the consonant changes that you mention are about.

    The difference between “accent” and “dialect/language” is that “accent” refers to pronunciation only, not to grammar or vocabulary.

    In most places. But people working on North American English, where true dialects are thin on the ground (Tidewater, Newfoundlander, AAVE, and that’s about it), usually use dialect to refer to accent differences, at least synchronically. I think this is at least partly because the old dialect boundaries (which mostly reflected differences in vocabulary, not in morphosyntax) are now reduced to accent boundaries (but not vice versa in all cases) with the spread of mesolect over almost the whole continent.

  40. So if I speak in a Chicago accent and also say lookit, it’s a dialect?

    I’m wondering whether youre unwilling to accept my twist on the concept of accent, in which case maybe i just need a new word for accent plus mild incipient differentiation of vocabulary and usage, perhaps mostly in frequency of word choice and grammatical pattern rather than outright gain or loss. Or whether you believe the process of accent differentiation is not the first mark along the continuum of dialect / language differentiation, but instead a different process.

    Accents surely consist of multiple clustered changes in the exact expression of phoneme, and the changes take place more or less simultaneously, given the time scale of accent formation. Yet you asserted before that the changes that amount to dialect or language differentiation are not multiple, so maybe you really believe they’re distinct and essentially different processes. That seems startling to me. Not immediately convincing.

    I can believe that the incremental changes of accent cross thresholds that make a linguist switch from one ipa symbol to another at different moments. But the increments exist nonetheless, and i suspect they change simultaneously across multiple phonemes.

  41. Your concept of ‘dialect’ is common in Australia, too. That’s what led to a ridiculous interview on Australian radio about Chinese dialects (on my website) that I might have linked to before.

