Unicode Help.

Some useful sites: Unicode character table (great layout), shapecatcher (draw your own characters), amp-what (type a description). Via MetaFilter (where people will doubtless post other links that are useful and/or fun).


  1. Richard Ishida is also a good go-to person for Unicode, especially if you want the nitty gritty of how different scripts work with Unicode: http://rishida.net/

  2. This is the most comprehensive unicode site I’ve found:

    And why on earth does unicode not include cyrillic vowels with stress marks ? I know it’s possible to create them using &#x301 (COMBINING ACUTE ACCENT) but few fonts give an acceptible result.

  3. I guess shapecatcher is a great source for riddles. Here’s one. I gave it a reasonably well drawn character to recognize and received back a long list of possibilities, which did not include the intended character (I’m not sure it is in Unicode at all), but have some nice variants. Partial list:
    Latin capital letter l with middle dot: Ŀ
    Reverse solidus preceding subset (Unicode hexadecimal: 0x27c8)
    Latin small letter k with acute: ḱ
    Vai syllable la:(Unicode hexadecimal: 0xa55e)
    Canadian syllabics taa: ᑖ
    Musical symbol c clef: (Unicode hexadecimal: 0x1d121)
    Cyrillic small letter i with grave: ѝ
    Greek capital dotted lunate sigma symbol: Ͼ
    Hiragana letter ni: に

    Try to guess what was the intended symbol.

  4. For those using Emacs, the One True Editor allows you to insert any Unicode character by hitting C-x 8 RET and then typing in the name (with tab-completion speeding that up). It’s a lot faster to do e.g. C-x 8 RET LATIN SMALL LETTER DELTA than to open a character map, click around to find what you want, and copy and paste.

  5. For Windows: BabelStone.

  6. Unicode: The Movie.

    Alex: Because precomposed characters are, in general, only provided in Unicode when existing character sets already had them, so that 1-1 round-trip conversions were possible. That was not the case for any existing Cyrillic character set. Exceptions were sometimes made when the letter-with-diacritic is considered a distinct letter of the alphabet and/or the language in question never had computer support before, neither of which is the case for Russian vowel letters marked for stress.

    And also for Windows, don’t forget about my Moby Latin and Whacking Latin keyboards, which only handle about 1% of Unicode, but most likely the most important 1% for people using U.S. or UK physical keyboards.

  7. Suppose you work under Windows, and only occasionally want to insert text in non-English characters – as I do, say when commenting here with letters from a European language. Then there is an easy way to do this: use the Windows on-screen keyboard.

    I mostly uses a standard character set – the one provided by my physical keyboard – and can switch in various “virtual” keyboards when I need them for text in different languages.

    What I get from this is what I wanted to get. I don’t have to struggle with Unicode to get it. That’s the great advantage.

  8. Stu, what does your physical keyboard look like? Is it an American QWERTY, or a German QWERTZ?

  9. John: QWERTZ. After making those bold claims about ease in use, I am now investigating which languages are actually non-problematic, given the way I work. Russian is not one of them, but I could pretend that it is not a “European language” …

    I use UltraEdit to create blog comments outside the blog editor. This works fine with English, Spanish and French, where the codepoints I need are in the standard upper-ASCII set. When I want to copy some Russian word from another comment into the text, I must work directly in the blog editor.

    In UltraEdit with the RU on-screen keyboard, to type Russian I had to change the charset to ISO8859-3 (Latin-3) and the font to “Arial Unicode MS”. The hex mode shows each letter as a single byte – some kind of upper-ASCII mapping – so clearly I couldn’t copy this text into the blog editor.

    So what I claim boils down to this – if you work with languages with ASCII codepoints, you don’t need unicode. Who’da thunk it ?

  10. “ASCII mappings”, not “ASCII codepoints”.

  11. I never understood what that Unicode stuff was for, and how it worked. Having a qwerty keyboard on a laptop gives me a bit of a headache while writing in French. Pressing “Alt 130”, “Alt 147” or “Alt 0156” usually does not improve typing speed, especially when you are left wondering whether the one you are looking for is 140, 141, 150 or 151. So how could this improve matters (given that I can’t install new software on that computer)?

  12. Stu, I’d like to develop a QWERTZ version of my keyboard driver. If you’re interested in beta-testing such a thing (it supports vast quantities of Latin-script letters, lots of symbols, and math-Greek, but not Cyrillic yet), drop me a note at cowan@ccil.org.

    UltraEdit supports Unicode. If you set the character set to UTF-8 or UTF-16, you can represent all characters. You can install a Russian keyboard driver (I use Russian Phonetic YaWERT) and then type Russian as well, switching keyboard drivers using the Windows Language Bar.

    Siganus: Yeah, if you can’t install a better keyboard driver, you are out of luck.

  13. Sig: Unicode is merely a system in which (binary) numbers are assigned to “glyphs”. A glyph is a letter or symbol in a writing/printing system.

    A computer “text file” is a sequence of bytes, i.e. binary numbers, stored on a medium. A display program (such as Word in Windows) reads those numbers from the medium and presents the corresponding sequence of glyphs on your monitor (another medium).

    That’s the basic principle. Unicode is a convention for translating back and forth between numerical and visual representations of letters.

  14. The unicode idea is extremely old. You find it in gematria, the Hebrew descendent of assyro-babylonian numerology. According to the German WiPe, gematria is based on the fact that special numeric symbols were a later addition to writing systems using letters. Before the numeric symbols were invented, already existing letters were used to represent numbers.

  15. I mention the German Wipe on Gematrie because the English one says nothing about the prior use of letters to represent numbers as a “hack” due to the absence of special numeric symbols. The English article rushes right into Rabbinic and Kabbalistic hermeneutics. If learned Jews had not been gobbled up by all that silliness, they might have found time to invent Word for Windows before the Baby Jesus burst on the world.

  16. Thanks Stu, but I’m left wondering what practical use that all thing might have if you are not a programmer, i.e. for a layman like me. (Incidentally, I loved the gematria “games” in Potok’s novel The Chosen.)

  17. Sig, when you drive a car and it just stops, rudimentary knowledge of how a car works helps you to identify whether you’ve merely run out of gas, or need to contact a car mechanic.

    To know the unicode priniciple should help you to identify certain problems on your computer as unicode/keyboard/font mismatch problems, so that you know to contact a unicode mechanic to fix them.

  18. Siganus Sutor says

    Stu, if there were funny signs suddenly appearing on my computer screen, like skulls and bones, smileys or ampersands, I would certainly not start to unscrew the back of the damn machine to feed it some unicode from a character table that might or might not put it back on the right track. I would certainly leave it to mechanics and their greasy hands!

  19. Well, even for people with no interest in Unicode as a coded character set, Unicode as a vast repertoire of characters can still be compelling. Go to the code charts and check out the stark angularity of Old South Arabian, the Greekness in disguise of Gothic, the still-mysterious pictographs of the Phaistos Disc, the bald heads of Oriya, the whorls of Saurashtra, the misleading familiarity of Cherokee, the Braille dots, the Yijing (I Ching) hexagrams, the dingbats, the emoticons. I can admire as well as anybody the powerful sweep of mighty generalizations in physics or real analysis that bring skrillions of separate examples under their control, explaining much with little. But my heart is given to the complicated domains of learning, the natural numbers and discrete mathematics generally, natural and constructed languages in all their diversity, writing systems.

  20. John, you are right, the possibilities are mind-boggling. By lifting your eyes to the top of this page maybe you could have also taken Steve’s banner into account: we could also write in cuneiform thanks to Unicode:
    Now how one would physically do that here without a proper calamus remains a mystery to me. Could it simply be by typing U12038 or U1203A here?

  21. No, it’s not. Unicode is even more mysterious than I thought. Maybe it comes from Rapa Nui as well.

  22. U12038 or U1203A

    That’s almost right. In fact, we must write 𒀸 or 𒀺 respectively to produce 𒀸 and 𒀺. In that way, the cuneiform characters become part of this very comment. If you try this yourself, don’t forget the semicolon at the end of each.

    Now whether these characters actually appear to you or me as cuneiform characters, outlined blocks, little groups of numbers, or “last resort” glyphs depends on what fonts we have installed on our computers, and not at all on what Steve or his blog software does. If you see something other than cuneiform, you can install a proper cuneiform font and then redisplay this page, and the Right Thing will appear. (Some older operating systems may not be able to handle characters with five-digit Unicodes such as these, however.)

    Maybe it comes from Rapa Nui as well

    “Nay,” said I, “I come not from heaven, but from Essex.”

  23. Arrgh, I made a Balls of it. We must write 𒀸 and 𒀺.

  24. Now this is going too far. We must write (but without spaces) & # x 12038 ; and & # x 1203A ;.

  25. In which case, perhaps the Cuneiform Digital Library Initiative may be of interest.

Speak Your Mind