Through an interesting Language Log post (“Semen, green rice and the rate of internet decay”) by Mark Liberman, I learned about the Unihan site (it was actually mentioned in the comments to this LH post from last year, but there was so much else being discussed I didn’t even notice it). The search page allows you to search for characters by meaning or transcription, the latter in “three varieties of Chinese (Cantonese, Mandarin, and Tang), the two basic Japanese pronunciations (Japanese On, or Sino-Japanese, and Japanese Kun, or native Japanese), and Sino-Korean,” and the radical-stroke index allows you to look them up as you would in a traditional dictionarly. And the results page, eg for ren2 ‘man(kind), people,’ gives you not only its number in the most important dictionaries, readings in the six varieties mentioned above, and definitions, but a long series of phrases using the character in both Chinese (Mandarin and Cantonese readings) and Japanese (kanji and kana).

One problem is that if you search by meaning, what you enter is treated as a string of characters rather than a word, so that entering “man” gets you 355 matches, including characters with “manifest,” “manner,” “womanly,” “command,” and so on in the definition. There’s probably a way around this, but adding spaces before and after doesn’t work.


  1. Another peculiarity is that Mandarin pronunciations are also treated as strings, so a “han” search will get you shan, zhan, and chan. An “an” search gives you 5520 matches, incuding ang, zhang, han, chan, etc. I’ve fished for a workaround, but none seems to work. You can cut off the end “g” by entering the tone, but you still have 1518 mathes for an4.
    On the other hand, li4, one of the most common character-tome combinations, gives you 226 matches, which is about right, though about half of them don’t display on my browser.

  2. I just checked out the Unihan page. From the link you listed above, nearly all the characters from the results page are very rare and arachic. If characters like those showed up in tattoos, I’ll most probably discard them as yet another case of Hanzi Smatter (http://www.hanzismatter.com).

  3. That’s a good example of Unicode.org’s detailed CJK files being transformed into usable information by a web frontend. I don’t know why I had never thought of that before.

  4. This brings up something I’ve wondered about for a while– is there a Chinese equivalent to kakasi, which would take a URL for, e.g., a big5 encoded page and spit out a pinyin version?

  5. For pinyin conversion (from GB- or Unicode-encoded pages), try David Lancashire’s Adsotrans. Use the “advanced” page and select “convert to pinyin.” (Give it time. The site is acting slow today.) Capitalization at the beginnings of sentences and other such niceties are coming soon. If prompted, use “guest” for both the login name and password.
    I hope to have something similar on my own site one of these days.

