A Proceeding of the National Academy of Sciences paper by Steven T. Piantadosi, Harry Tily, and Edward Gibson called “Word lengths are optimized for efficient communication” (pdf) proposes that “average information content is a much better predictor of word length than frequency.” You can read a summary of their findings, along with some background, here; it’s interesting stuff (“The research results held across all but one of the languages studied: Czech, Dutch, English, French, German, Italian, Portuguese, Romanian, Spanish and Swedish, with German being the outlier”). What bothers me is the meaning of Zipf’s Law; the linked article describes it as saying “word length is primarily determined by frequency of use” (which the NSF piece summarizes as “short words are used more than long ones”), but the Wikipedia page on the law doesn’t mention length at all, saying “Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.” Can anyone unravel this for those of us who have forgotten most of what statistical theory we ever knew? (Thanks, Hans!)
Update (August 2017). See now Unzipping Zipf’s Law: Solution to a century-old linguistic problem; thanks, Kobi!
In statistics and economics, the definition you gave is the standard one. Perhaps Zipf wrote about frequency and word length as well, but the frequency/rank distribution is the well-known Zipf’s Law.
The lategreat Hugh Kenner, who was trained as a mathematician, wrote about Zipf’s Law in his studies of James Joyce, and so far as I’m aware he says nothing about word length. Anyway, I can think of a 12-letter noun beginning with M (and its 13-letter adjective form) that you hear all the time on certain comedy channels.
The psycho-biology of language makes both claims: “the greater the frequency, the shorter the word,” and the inverse square relationship between frequency and number of different words.
On page 27 in Manning & Schuetze’s Foundations of Statistical Language Processing we find the following:
Zipf’s Human Behavior and the Principle of Least Effort lays out his entire framework.
Zipf had many laws, it appears. Thanks for sharing this with your readers; I find their use of “predictability” as a stand-in for information density insightful but wonder if others agree.
Isn’t this selection of languages biased for Indo-European? I wonder if the results hold for a wider selection.
Rather obviously IE-biased. There’s no real excuse, other than they really wanted to confirm their initial hypothesis. Because if you tried something similar on Mandarin or Vietnamese you’d get really boring results, and if you tried it on say Plains Cree you’d get really strange results. But I suppose to have to control for morphological parameters would be too difficult, particularly given that linguists (morphologists even!) don’t really understand the parameters involved.
@James C.: Your comment seems needlessly harsh. Demonstrating that this effect exists for any languages is already an interesting result. There is obviously room for future work, but there is absolutely nothing wrong with that.
Also, it seems much more difficult to do something similar for Mandarin or Vietnamese, since in those languages the main orthographic unit is the syllable rather than the word, and impossible to do something similar for Plains Cree, since their approach depends on huge corpora.
I wonder how Standard German compares to most dialects. Like most, mine has a lot of syncope and apocope… it’s probably at the extreme end, actually: zusammen “together” is one syllable, unstressed die and zu are just a lone voiceless consonant each, Badewanne has two syllables…
That may be a reason why we so often switch to Standard for emphasis.
I wonder if there’s a relationship between the distinctiveness of Mandarin morphemes and their frequency or information content. I understand it has lots of homophones; could it be that the number of morphemes assigned to a given syllable is correlated with some metric of the morphemes? i.e. some syllables would be reserved for the most frequent or informative morphemes.
Let’s not overthink things: surely this theory, if true, applies to the spoken, natural language, not written language per se.
Mandarin has plenty of multisyllabic words, and each syllable has between one and five phonemic components (tone being neutralized in some syllables), leaving room for the variation this theory seeks to make sense of. Whether it does make any sense is another matter…
There’s still the trouble defining what a word is.
Badevannet is bathwater in Norwegian.
Updated (see above).