THE BIGGEST DICTIONARY.

Victor Mair at Language Log has posted about a new dictionary that ups the ante in the East Asian contest “to see who can produce a dictionary with the most entries”:

The Koreans at Dankook University have just pulled off the amazing feat of compiling a dictionary that has outstripped anything yet generated by the Japanese or the Chinese themselves. After 30 years of labor and investing more than 31,000,000,000 KRW (equal to more than 25 million USD), the South Koreans have just published the Chinese-Korean Unabridged Dictionary in 16 volumes. This humongous lexicon contains nearly half a million entries composed of 55,000 different characters.

Which is interesting in itself, but I’m linking to the entry for Victor’s discussion of why “there will never be an end to the compilation of ever larger single character dictionaries, since the Chinese writing system is essentially open-ended” and why it’s pointless to try to accumulate as many characters as possible: “most of the characters in these mega-dictionaries can only be attested as having occurred once in history, and that often in lexicons of obscure characters!” There’s a very interesting graph of “number of characters” versus “rate of coverage” that shows that 6,600 characters cover 99.999% of what’s found in actual text, which means a massive compilation like the Zhonghua zihai 中華字海, with over 85,500 different characters, is an exercise in overkill.

Comments

  1. I was interested in this paragraph from one of Victor’s links (http://www.universityworldnews.com/article.php?story=2008012509544767):
    Despite the title, the Dictionary of Chinese Characters in Korean Usage covers the entirety of the Chinese character-using sphere, including the various Chinese languages spoken in China and Taiwan, as well as the Chinese characters used in Japanese and Korean, where they are known as kanji and hanja respectively, and those found in pre-1950s Vietnamese texts.
    This suggests that this dictionary contains not only Chinese characters, but also Vietnamese chu nom. It would be interesting to know if this is the case. Expanding your set of “Chinese characters” to include chu nom would be a nice way to help ratchet up a bigger score than everyone else.

  2. John Emerson says

    An interesting case of a rare character is the first character in the given name of the XXc author Liang Shu-ming / Sou-ming. In what I’ve read, no one seems quite clear which pronunciation is correct.
    In my own casual leafing through dictionaries, a lot of the once-used characters seem to be geographical names or other proper names, the rough equivalent of such English words as “Seattle” or “Owyhee”.
    Google tells me that I made more or less this same comment once before.

  3. John Emerson says

    An interesting case of a rare character is the first character in the given name of the XXc author Liang Shu-ming / Sou-ming. In what I’ve read, no one seems quite clear which pronunciation is correct.
    In my own casual leafing through dictionaries, a lot of the once-used characters seem to be geographical names or other proper names, the rough equivalent of such English words as “Seattle” or “Owyhee”.
    Google tells me that I made more or less this same comment once before.

  4. Crown, A.J.P. says

    At the risk of belaboring it I want to alert readers to the very good title Language came up with for this post.

  5. When I worked in the editorial department of a large accounting company we referred to Webster’s Third New International as “the Big Dic.”

  6. Overkill until you run up against a character or a combination that isn’t in the 字海, the 中文大字典,Ueda’s 大辞典, or the 국어대사전 [国語大辞典].

  7. Sure, there needs to be a massive repository like the OED, but there only needs to be one—for dictionary makers to compete in offering the greatest possible numbers of useless characters is kind of silly.

  8. John Emerson says

    Fortunately, Hat’s Luddite negativism is not universally shared. 120,000 characters is possible — The Hadron Supercollider Dictionary! The human spirit ever rises above something something something. That’s why we’re approaching Dow 36,000.

  9. John Emerson says

    Fortunately, Hat’s Luddite negativism is not universally shared. 120,000 characters is possible — The Hadron Supercollider Dictionary! The human spirit ever rises above something something something. That’s why we’re approaching Dow 36,000.

  10. komfo,amonan says

    Does it make sense to have two separate dictionaries, one of nonce characters and one of all the others? The former for commercial sale, the latter put up on the web by some foundation?

  11. You got your former and latter reversed, I think, but more importantly, I suspect a lot of scholars would still want a physical book. What happens if there’s a power outage?

  12. Crown (butting in) says

    The same as what happens if there’s a power outage with a paper book: everyone takes a nap until it’s repaired and the lights come back on.

  13. John Emerson says

    If Bill Gates gave me a million dollars I’d use it to produce an online deluxe Chinese reference sorted historically, with special sorts for Buddhist Chinese, geographical names historically sorted, alternate names for famous people, calendars of the various dynasties, government terminology of the various dynasties, genealogies, alternative graphic forms, and so on.
    As far as I know, the phonetic/ phonemic representation of earlier Chinese is still undetermined. At one time I had at least four reconstructions each of the Book of Odes period and of the Tang period. Everyone knows Karlgren isn’t quite right, but there’s no consensus replacement.

  14. John Emerson says

    If Bill Gates gave me a million dollars I’d use it to produce an online deluxe Chinese reference sorted historically, with special sorts for Buddhist Chinese, geographical names historically sorted, alternate names for famous people, calendars of the various dynasties, government terminology of the various dynasties, genealogies, alternative graphic forms, and so on.
    As far as I know, the phonetic/ phonemic representation of earlier Chinese is still undetermined. At one time I had at least four reconstructions each of the Book of Odes period and of the Tang period. Everyone knows Karlgren isn’t quite right, but there’s no consensus replacement.

  15. komfo,amonan says

    (Oof, I did reverse my former and latter; thanks, Hat.)

  16. John Emerson says

    Actually, it would cost more than a million. Fifty million please, Bill.
    And yes, the primary sort would leave out the nonce character or list them as variants under the head of the common character.
    In Chinese, though, there’s always been the possibility that a poet or scholar (or the father of a famous person) will write something which becomes famous about some obscure event at an obscure location with a nonce name.
    I’ve read that the San Guo Wei Cao dynasty deliberately gave its princes nonce names, in order to avoid the inconvenencies requires by the Chinese naming taboos. The first official Wei emperor was named Cao Bi or Cao Pi and the question of which may be undecidable. The first two Han Emperors were named “Xuan” (= “dark”) and “Min” (= “people”), both very common words, and as a result, the most common texts of the Tao Te Ching (Daode Jing) are garbled with substitutes.

  17. John Emerson says

    Actually, it would cost more than a million. Fifty million please, Bill.
    And yes, the primary sort would leave out the nonce character or list them as variants under the head of the common character.
    In Chinese, though, there’s always been the possibility that a poet or scholar (or the father of a famous person) will write something which becomes famous about some obscure event at an obscure location with a nonce name.
    I’ve read that the San Guo Wei Cao dynasty deliberately gave its princes nonce names, in order to avoid the inconvenencies requires by the Chinese naming taboos. The first official Wei emperor was named Cao Bi or Cao Pi and the question of which may be undecidable. The first two Han Emperors were named “Xuan” (= “dark”) and “Min” (= “people”), both very common words, and as a result, the most common texts of the Tao Te Ching (Daode Jing) are garbled with substitutes.

Speak Your Mind

*