Ur-etyma.

Victor Mair has an extremely interesting post up at the Log:

[...]I’ve long been intrigued by the fact that the number of basic morphemes in Sinitic is roughly comparable to the number of roots in Proto-Indo-European (PIE). I wondered whether this was purely a coincidence or a reflection of some fundamental feature of language and the human brain. So I started to look at other language families to see whether they too had a similar amount of root morphemes.

As I gathered and examined data, they seemed to confirm my initial impression that the essential etyma of many languages amount to approximately 1,000-2,000, with most falling at around 1,200-1,500. Wanting to secure more precise and reliable evidence, I asked colleagues who are specialists in various fields to share their expertise.

He quotes John Huehnergard on Semitic, Philip Jones on Sumerian, Michael Witzel on Nostratic and PIE, Allan Bomhard on Nostratic, John Colarusso on Caucasian languages, and Don Ringe, J. P. Mallory, and Douglas Adams on PIE, all very interesting, and himself discusses Sinitic, concluding:

[...]I think that the fact that the quantity of basic building blocks of various languages is roughly comparable is not merely coincidental, but may have something to do with the cognitive makeup of the brain. That is to say, at the bottom limit, for a language to become an organic, functioning entity, it needs to have a sufficient amount of constituent, core etyma from which a working vocabulary may be derived. At the other end of the scale, there seems to be an upper limit to the number of primary conceptual categories that the mind is capable of processing.

It seems that, in general, there are roughly 1,200-1,500 root concepts from which all others are generated. This appears to hold for many language families. Inventories of core etyma with a magnitude that are much over 2,000 or much under 1,000 are probably the result of differing definitions of what constitutes a basic root and how the computations are carried out.

Fascinating stuff, and I look forward to the ensuing discussion!

Comments

  1. des von bladet says:

    Why do they prefer languages of which so much is unknown for this sort of work?

  2. Because it gives free rein to speculation.

  3. That is to say, at the bottom limit, for a language to become an organic, functioning entity, it needs to have a sufficient amount of constituent, core etyma from which a working vocabulary may be derived.

    What is the precise meaning of the claim that languages are “organic, functioning entities” ? Are we being asked to imagine that natural languages arose by discovery/stipulation of a set of “core etyma”, then derivation of a working vocabulary, then – what, somebody pushes the “speak” button ? This sounds like a rehash of the Genesis or Golem myths, i.e. a technology myth.

    I would have thought that language speakers, not languages, are organic, functioning entities. “Core etyma” would be words for “things in the world”: counting, killing, praising, mounting.

    At the other end of the scale, there seems to be an upper limit to the number of primary conceptual categories that the mind is capable of processing.

    And yet “the mind” is capable of processing any number of secondary conceptual categories, such as the ones deployed in these speculations by Mair ?

  4. Why do they prefer languages of which so much is unknown for this sort of work?

    I can see an argument for considering languages which were never written down separately from languages which were, because the act of writing surely helps preserve old and obscure etyma that would otherwise have been forgotten, artificially inflating the total stock (assuming that new ones keep emerging).

  5. Stefan Holm says:

    I would be surprised if the number of roots in different languages or families could not be described by some form of probability distribution. After all every natural language is built, not by ‘somebody’, but in a spontaneous, random process with contribution of a large number of members of the community in question.

    There are however several influencing factors: Our brains are genetically the same, i.e. the way we humans percieve the world doesn’t basically differ from one (healthy) individual to another (I’m not a Sapir-Whorfian): We see things, concrete or abstract (nomina), we see them move, change or interact (verbs) and we apply objective or affective attributes to the things or processes (adjectives, adverbs, numbers, participles).

    There is also an impact from our phoneme and morpheme systems. Among all sounds the human speech apparatus can produce each specific language only makes use of a limited number. Click sounds are well known but much restricted. Front rounded vowels are actually not very common in a global scale. Even combinations of phonemes and the building of morphemes are restricted: Initial spr-, str-, skr- are rare outside Gmc as are initial combinations like nd-, ng-, mb- outside Africa south of Sahara.

    Under such circumstances it seems like gefundenes Fressen for the Law of large numbers and the Central limit theorem to start operating. But of course you can’t compare apples and oranges. You have to decide whether you are talking about morphems, roots, stems, etyma or whatever. (How handy they might come words like TV, laser, and mic must from obvious reasons be sorted out). If this can be investigated practically I expect to see a nice probability distribution curve applicable to ‘the’ human language in the future. The interval 1200-1500 could very well include the expected mean value while I dare say little or nothing about the standard deviation.

  6. One could plausibly imagine that the number of recorded words and phrases in a language, at every point in its history, correlates strongly with the number of things (physical or imaginary) the contemporary speakers had notions for, talked about and wrote down.

    So the number of words – call them etyma, stems, whatever – is simply a correlate of the number of “things” identified and dealt with in the contemporary social context. No need to drag in “the brain”, “the mind”, statistics or genetics.

  7. First, there should be some criterion for ur-ness, besides what an expert thinks– otherwise you are making hypotheses about experts rather than about languages. Given that, the law of large numbers becomes a friend; one can make predictions and look for non-generic probability distributions.

  8. First, there should be some criterion for ur-ness, besides what an expert thinks– otherwise you are making hypotheses about experts rather than about languages.

    That’s an interesting hypothesis ! Mair has gone to considerable lengths to find out what the experts think. How else are “criteria for ur-ness” arrived at if not through a consensus of experts ?

    In any case, I find this “cognitive makeup of the brain” business to be so vague that statistics would only confuse things further.

  9. David Marjanović says:

    I can see an argument for considering languages which were never written down separately from languages which were, because the act of writing surely helps preserve old and obscure etyma that would otherwise have been forgotten, artificially inflating the total stock (assuming that new ones keep emerging).

    …and this, I think, is what really limits the number of basic roots in a language: words for concepts that almost never come up are so rare that they are easily forgotten and/or by chance not passed on to the next generation. If there’s something you only talk about once every 20 or 30 years, it’s easier to create a word for it on the spot by derivation, compounding, metaphor or whatever than to remember a root that is unrelated to all others.

  10. Stefan Holm says:

    Mair is clearly not talking about ‘words’ but basic morphemes in Sinitic and roots in (reconstructed) PIE. The number of words is a meaningless concept since they are practically innumerable. In a language open for borrowings and/or allowing for compounds there’s really no upper limit for the number of ‘words’.

    A comparison must be made between the number of productive morphemes in a language at a specific time. For instance English ‘erable’/‘arable’ are definitely words but consisting of two morphemes. The latter, ‘-able’, would (if a non native has a say) count as productive in contemporary English since it is added to numerous verbs and can be added to further more.

    The first morpheme however is PIE *ar-, meaning ‘to plow’ and widespread throughout the IE family. But it can’t be said to be productive and thus counted as an English morpheme. In the Swedish lexicon you will find the cognates ärja, ‘to plow’ and the noun årder (Icelandic: arđr), ‘a (wooden) plow’. They are both of the type David mentioned as being used every 30th year or so. The morphological components are the mentioned *ar-, ‘to plow’ plus in the latter word the instrumental suffix *-tr-, i.e. an item ‘to plow with’. But neither the words nor their components can count as Swedish morphemes today.

    For some reason words beginning with ‘ar-‘ or ‘er-‘ were in all Gmc languages except Gothic substituted by ‘plow’, which is alive and kicking today. So ‘plow’ could be seen as a valid modern English morpheme (but hardly a PIE one).

    In this sense I see no theoretical objections against testing Mair’s hypothesis. Practically though it seems like a tedious work, to put it mildly.

  11. erable

    I was about to say “not English”, but I decided to be careful and check the OED. The English descendant of *ar- turns out to be ear, a verb that is totally forgotten today. It appears in Bailey’s dictionary of 1721 but not in Johnson’s dictionary of 1755; Johnson is known to have used Bailey, so if he omitted it, it was because it was no longer current. Bailey probably copied it from one of the earlier dictionaries that he used.

    The OED1 entry of 1891 calls ear “obsolete except archaic”, and the last quotation given is from an 1855 translation of Virgil; other than that, the last live use is in 1630. Shakespeare used it figuratively in the sense of ‘tear’, applied to what a boat does to the water. Earable is also in the OED, but with no quotations since 1598.

    But it can’t be said to be productive and thus counted as an English morpheme.

    There are lots of unproductive morphemes in English, particularly in classical and neoclassical compounds. It’s true that -able is productive, but it became so only because a vast number of loanwords in -able were added to the language. Isolated morphemes like ar- in arable are often the consequences of loanwords; cranberry, for example, was a borrowed whole from a Low German dialect, but it is synchronically analyzed as cran- followed by the ordinary morpheme berry, where cran- is unique.

  12. As a sidetrack on searching for English erable, I found its French google-alike érable ‘maple tree’. This, it seems, is from Latin acer id. plus a Gaulish tree suffix -abulus, whose cognate shows up in Welsh cri-afol ‘rowan tree’. However, it’s also possible that acarabulus was the Gaulish word, in which case acar- was irregularly replaced by acer- during Latinization. A third theory is that acerabul- is dissimilation from acer arbor-.

  13. Stefan,
    “Mair is clearly not talking about ‘words’ but basic morphemes in Sinitic and roots in (reconstructed) PIE.”

    In Classical Chinese “word” and “morpheme” are functionally identical as far as anyone can tell from the script.

    And interestingly in Modern Chinese the government sets 3,000 hanzi the number it has determined are necessary for basic literacy. Morphemes in Chinese are generally monosyllabic so in general 1 hanzi =1 morpheme. obviously for actual literacy you need much more, you need a vocabulary of the compounded forms, many, many of which are not deducible just from the combination of hanzi used.

  14. Stefan Holm says:

    John, here’s the Online Etymology Dictionary’s entry on ’arable’.

    early 15c., “suitable for plowing” (as opposed to pasture- or wood-land), from Old French arable (12c.), from Latin arabilis, from arare “to plow,” from PIE *are- “to plow” (cognates: Greek aroun, Old Church Slavonic orja, Lithuanian ariu “to plow;” Gothic arjan, Old English erian, Middle Irish airim, Welsh arddu “to plow;” Old Norse arþr “a plow”). Replaced by late 18c. native erable, from Old English erian “to plow,” from the same PIE source.

    As for ‘cranberry’ you may be right but it’s worth mentioning that the Swedish name is ‘tranbär’. It could be a coincidence but ‘crane’ (the bird, grus grus) is ‘trana’ with some 20 compounds beginning with ‘tran-‘ (without the final ‘-a’).

  15. That has to be an error in Etymonline; I suspect the author saw an abbreviation repl. in his source, and misread it as replaced by rather than replaced. No current dictionary (I looked at m-w.com, AHD5, RHD2, Collins, ODO) lists erable; all list arable.

    Low German kraanbere, from which cranberry is borrowed, is transparently ‘crane-berry’, but cran- is not recognizable as crane to anglophones, so in English the morpheme is isolated. Other English examples are mul(berry), rasp(berry) (unrelated or only distantly related to the English word rasp ‘scraping tool’), twi(light), cob(web), luke(warm), (blather)skite, hinter(land), taff(rail). Besides borrowing, other sources are native roots that happen survive only in compounds, dialect mixture, and misanalysis.

  16. Maybe the “by” goes with “late 18c.”, so it’s saying “By the late 18th century, it replaced native erable.”

  17. Keith: Yes, you must be right. Pretty confusing, though. I’ve sent an email to the author.

    Language Hat discussion on Vaccinium.

  18. Aaaaaand… I got a reply thanking me, and Etymonline is fixed!

  19. That was fast!

  20. David Marjanović says:

    In Classical Chinese “word” and “morpheme” are functionally identical as far as anyone can tell from the script.

    With about 3 exceptions like “butterfly”, modern Mandarin pronunciation hùdié – 2 syllables, 2 characters, 1 morpheme as far as anyone can tell (…despite Classical attempts to claim that they must once have meant “male butterfly” and “female butterfly” respectively).

  21. Saith Wiktionary, “From Middle Chinese *ɣo dep (first syllable unstressed), from Old Chinese, derived from a proto-form of *kʰleːp ~ *ɦleːp, a prefixed form of the root *lep ‘wide, flat’”. Since this transformation was not understood, both 蝴 and 蝶 were entered into dictionaries separately and glossed ‘butterfly’.

    In Mark Rosenfelder’s yingzi, which devises an analogous writing system for Modern English, a similar thing happens to language (see his page for the actual character images):

    Words, perceived as compounds, might lend themselves to abbreviation. After all, why write two yingzi when one will do, especially if it unmistakably implies its partner? For instance, language would be a two-character word, each character defined only as part of this compound and used nowhere else in the language. If you’ve written lang, you must write gwidge next. You might as well just write lang and leave it at that. Ultimately of course [lang] will acquire a meaning of its own — namely language. And for consistency’s sake lexicographers might well give gwidge a meaning of its own as well — namely, language.

    Anyway, in modern Mandarin there are plenty of multisyllabic/multi-hanzi morphemes, notably 布爾什維克 Bù’ěrshíwéikè ‘Bolshevik’ and 馬提尼 mǎtíní ‘martini, lit. horse-kicks-you’ (coined by Yuen Ren Chao).

  22. “Saith Wiktionary, “From Middle Chinese *ɣo dep (first syllable unstressed), from Old Chinese, derived from a proto-form of *kʰleːp ~ *ɦleːp, a prefixed form of the root *lep ‘wide, flat’”. Since this transformation was not understood, both 蝴 and 蝶 were entered into dictionaries separately and glossed ‘butterfly’.

    Aha. This is one of those rare, rare examples of the prefix being preserved in a recognizable form. It also looks like the main syllable is semantically related to the word for leaf, as the form of the hanzi suggest anyway.

    Mandarin has a lot of bisyllabic morphemes. One is “dongxi” which clearly has nothing semantically to do with its component hanzi. It looks like a May Fourthism.

  23. David Marjanović says:

    Aha. This is one of those rare, rare examples of the prefix being preserved in a recognizable form.

    Cool – I didn’t know there were any.

  24. P'i-kou says:

    “butterfly”

    Old Chinese has hundreds of what look like disyllabic morphemes. The most common are those called 聯綿詞 liánmiáncí, in which the two syllables have either similar (not necessarily identical) onsets or similar rimes (e.g. 參差 cēncī ‘uneven’, 逶迤 wēiyí ‘winding’). A couple dozen others (like 蝴蝶 húdié ‘butterfly’, 芙蓉 fúróng ‘lotus flower’) neither alliterate nor rhyme, but the second syllable is reconstructed to start with an l- or an r- (‘butterfly’ is something like *ga-lep e.g. in Baxter-Sagart). These are called (besides more boring names) 嵌lqiàn l l-encrusted words’.

    Some of those might be actually compounds of no longer active, unattested or undetected monosyllabic morphemes, or prefix-root combinations (as Jim suggests for ‘butterfly’, of which the second component indeed seems to have cognates); but the fact that l-’inlay’ has been active in producing di- from monosyllables (some OC l-inlaid disyllables have monosyllabic parallels; some modern descendants of OC show similar derivations) makes people think that the ‘incomplete reduplication’ was also a derivation process – i.e. words like cēncī above would have also been monomorphemic, possibly derived from an earlier monosyllable.

    The idea that a character could be just ‘half’ a morpheme seems to have annoyed people since millenia – there are some really old (bookish and perhaps also popular) reinterpretations of the component syllables of disyllabic words as self-standing elements. A funny one is 首鼠 shǒushǔ ‘hesitate’, written (in one variant) with the characters for ‘head’ and ‘mouse’, later embedded into the idiom 首鼠兩端 shǒushǔliǎngduān head-rat-two-ends ‘undecided, as a rat whose head looks two ways when emerging from its lair’.

    Then there are some early di- (and poly-?) syllabic borrowings from e.g. Persian.

  25. David Marjanović says:

    Fascinating!

Trackbacks

  1. […] Hat links to an interesting speculation of Victor Mair’s, to the effect that all languages include at […]

Speak Your Mind

*