Unicode is Kind of Insane.

June 1, 2015 by languagehat 44 Comments

Ben Frederickson has a post describing some of the weirdnesses of Unicode (“Things that seem like they should be very simple are often deceptively complicated when dealing with Unicode strings … Unicode also has lots of different characters that are visually identical to one another”) but ending with this upbeat conclusion:

Every change to Unicode has been a rational change by intelligent hard working people. While I can make fun of the poop emoji being included in the Unicode standard, it was the end result of a smart strategic decision by engineers at Google. Now that emoji are included in the Unicode standard, we have the rational follow on decision of supporting racial hints for people in emoji. Likewise by supporting emoji like a piece of pizza, the Unicode consortium has to now make the tough calls on including hot dogs and tacos in the next version of the standard while also excluding hoagies. Even having visually identical characters with different code points was a deliberate design decision – it’s necessary for lossless conversion to and from legacy character encodings.

Unicode is crazy complicated, but that is because of the crazy ambition it has in representing all of human language, not because of any deficiency in the standard itself. Human language is a complicated messy business, and Unicode has to be equally complicated to represent it. Thankfully we have people writing those long standards on how to display bidirectional strings appropriately, or sort strings, or the security implications of all this – so that the rest of us don’t have to think about it and just use standard library code to handle instead.

I’m deeply grateful for the existence of Unicode, and equally grateful that I don’t have to understand how it works, but I figure those with more understanding of coding than I (which is a very low bar) might find it interesting and/or have something to say about it.

Comments

Matt says

June 1, 2015 at 9:17 pm

I liked this article too. Unicode is kind of like Eve Online in a way — people love the tales of mind-boggling intrigue within its domain, even if they have no desire to wade into the fray themselves.
Bathrobe says

June 1, 2015 at 9:25 pm

The problems of Unicode with Chinese characters are well known. For instance, the Chinese simplified character 直 and the Japanese kanji 直 both occupy the same codepoint, even though their appearance is quite distinctive. However, I doubt that anyone looking at this comment can see any difference, and there lies the problem: how to get webpages to render them correctly.

At one time I thought you just needed to stipulate the font, i.e., a Chinese or Japanese font respectively, but that doesn’t appear to work. While doing a rather inconsequential post at Bones of the Living, Bones of the Dead, I recently discovered that there is (or appears to be) a way of forcing the browser to show the correct form: use the language tag.

Thus, using (Simplified Chinese), (Traditional Chinese), and (Japanese) should make the characters look like they are supposed to look.

I don’t know if that will work here but I’ll try it:

直直直.
Bathrobe says

June 1, 2015 at 9:30 pm

Well, that didn’t work!

The language tags are ⟨span lang=”zh-Hans”⟩, ⟨span lang=”zh-Hant”⟩ and ⟨span lang=”ja”⟩. Not sure why it doesn’t work here to differentiate 直, 直, and 直.
John Cowan says

June 1, 2015 at 9:33 pm

It’s an excellent article. I’ll explain the ! (exclamation point) vs. ǃ(alveolar, formerly called ‘retroflex’, click) distinction. Suppose you are looking at the IPA for the Xhosa word iqanda ‘egg’, which is [iǃanda]. If you double-click on the string, you’ll see that it selects the whole “iǃanda”, because ǃ counts as a letter. If you write it “I!anda”, though, then double-clicking will give you just “anda” (or “i”), because ! is not a letter. This is what you want, and the only way to achieve it is to be sure not to use a punctuation mark as a letter, but to have a separate letter character.
Bathrobe says

June 1, 2015 at 9:33 pm

Sorry for messing up you blog, Hat. This is my final try:

直, 直, 直.
Keith Ivey says

June 1, 2015 at 10:00 pm

Bathrobe, it looks like the blog strips out the span tags, but even in the “Bones of the Living, Bones of the Dead” post, which has the spans and lang attributes, the difference doesn’t show up in my Chrome on Linux. (Also LH may want to fix the link, which is missing the “http://”.)
Y says

June 1, 2015 at 10:07 pm

I have a peeve. Typically, similar looking characters in different scripts get their own code points, say A in Latin and Cyrillic. That gives font designers the freedom to have distinct-looking characters for the two scripts. θ, however, occupies a single code point, shared by both Greek and IPA. I understand why the basic Latin alphabet doesn’t get its own IPA-usage duplicate: we take IPA to be basically an extension of Latin. But a single font containing both Latin and Greek will be forced to reconcile the styles of the two scripts, to keep the θ from looking odd in either an IPA or a Greek context.
Bathrobe says

June 1, 2015 at 10:40 pm

even in the “Bones of the Living, Bones of the Dead” post, which has the spans and lang attributes, the difference doesn’t show up in my Chrome on Linux

Sigh, back to the drawing board…
Keith Ivey says

June 1, 2015 at 10:47 pm

Y, it looks like that’s also true for β and χ, but oddly not for ɣ (which is distinct from the Greek γ) or ɸ (which is distinct from Greek and ϕ and φ).
cardinal gaius sextus von bladet says

June 2, 2015 at 2:41 am

It is probably possible to read the history of Unicode’s Han unification programme as something other than a bunch of enlightened westerners helpfully imposing “reform” on the backward natives, who mysteriously seem not to be particularly grateful.

Recent Unicode histories tend to underplay that there really was an explicit intent to squish everything into 16 bits (which was the main driver behind unifications generally) – these days Microsoft and Java are routinely mocked for hardcoding that obsolete intent, but the intent really was there.

It’s still, warts and all, a tremendous boon to multilingual text management, obviously. I had occasional brief encounters with the bad old ways, and they really were pretty bad.
dainichi says

June 2, 2015 at 2:48 am

What annoys me the most about unicode is that some characters show up as boxes when reading some languages in some browsers. In the 21st century, I would expect the browser to automatically download characters if they’re missing, or maybe give me a dialog box asking me if I want to. Or at the very least give me easy-to-follow steps explaining how to get the characters to show. “This page uses unicode, you might see some characters as boxes”… “Indeed, I do. Now what do I do next?”
Athel Cornish-Bowden says

June 2, 2015 at 4:03 am

Something I found frustrating before I started using LaTeX for serious work is that most typefaces and word-processors make no different between an en dash and a minus, or even, much worse, between a hyphen and a minus sign. In LaTeX em dash, en dash, hyphen and minus are all different: en dash and minus are the same length, but en dash is slightly thicker. However, LaTeX has one fault (probably others) that I find irritating, as there is no command \omicron as Greek ο and Roman o are regarded as identical: maybe they are (as they look in the typeface I’m using to visualize this), but it’s still nice to be able to indicate which you mean.
Matt says

June 2, 2015 at 7:59 am

Yeah, Han unification was a total disaster. Almost as bad is the fact that as Bathrobe demonstrates few software makers are interested in implementing the workarounds necessary to make the results acceptable to readers at least. But you can’t really blame the software makers: the whole point of Unicode is that you don’t have to account for a bunch of edge cases arising in languages you don’t understand.
John Roth says

June 2, 2015 at 9:59 am

There are other strange things lurking in the history. the UTF-8 encoding, for example, came in from Plan 9.
Sven says

June 2, 2015 at 10:51 am

dainichi, when you get characters as boxes it usually means that the font doesn’t have a glyph associated with that codepoint. That can mean that the font creator didn’t bother drawing that particular character (drawing all of the characters Unicode is a huge job) or that there is something funny with language/character set encodings going on with the text, the application, or the operating system.
Aidan Kehoe says

June 2, 2015 at 11:35 am

Sven, well, it really is the browser developers’ job (or that of the OS developers) to ensure that *something* useful displays if a given font doesn’t have anything for a given codepoint, by finding a glyph for that codepoint in another font.

The browsers are getting better at this, but for a long time they signally failed to display anything useful if the font specified by the web page author didn’t have the corresponding glyph. This is a well-understood, common event, not an obscure edge case. Most fonts do not have anything close to full Unicode coverage, nor should they.

If the browser or another app just displays a box, that app is failing at their job of displaying the information given.
Sven says

June 2, 2015 at 1:21 pm

Aidan Kehoe, those are all good points, and I agree with you.
John Cowan says

June 2, 2015 at 1:23 pm

total disaster

Not. Japanese readers are more picky about the details of their ideographs, it’s true. But Unicode unification (of which Han unification is a large subset) is about bare legibility in plain text. Polish a-kreska, properly typeset, uses a shorter and stubbier accent than Spanish a-acute. But Unicode unifies them as á, because either version is legible to users of the other. To claim that Han unification seriously breaks legibility is to claim that Han-character users can’t read the text on their tablets and smartphones, all of which use Unicode. (For that matter, Chinese text in a Japanese context has traditionally used Japanese character forms, not Chinese.)

shared by both Greek and IPA

That may have been a mistake, but it also doesn’t impact bare legibility in plain text. See alsoNick Nicholas’s writeup (his Greek Unicode pages are excellent on all aspects of Greek in Unicode). In any case, it would be too hard to change now.

bunch of enlightened westerners

For “Westerners” read “Easterners”. The details of Han unification has always been in the hands of the ISO representatives of the Han-using countries, specifically the Ideographic Rapporteur Group. Enlightenment I will not debate, except to say that chopping wood and carrying water is very necessary in either case: it is not mere bureaucratic delay that makes it take about two years for a character or group of characters to make it into Unicode.

squish everything into 16 bits

That was abandoned in 1996, by which time the modern architecture of Unicode was set. Discussions were ongoing for several years before that. Unfortunately, Java 1.0 was published in 1995. In any case the 16-bit design was going to bump up against the reality of Han characters for names sooner or later, and it was the Han-using countries that pointed out how necessary an extension of the 16-bit space was going to be.

there is no command \omicron

Put “\newcommand{\omicron}{o}” somewhere at the top of your document. You can do analogous things for the other Greek/Latin unifications in TeX.
SFReader says

June 2, 2015 at 7:46 pm

Unicode still lacks support for Tangut and Khitan scripts.

I blame this delay on conspiracy of “Han-using countries”…
David Marjanović says

June 2, 2015 at 7:46 pm

While doing a rather inconsequential post at Bones of the Living, Bones of the Dead, I recently discovered that there is (or appears to be) a way of forcing the browser to show the correct form: use the language tag.

…The character that shows up in your comments here, 直, doesn’t occur in your post; I can’t find it, and neither can Ctrl+F. I do see three different characters for “bone” in that post, but you say there only are two… the simplified and the Japanese version look like the pictures you provide, but the traditional version has 冫(as in 习) instead of 二 inside the 冂. And interestingly, copying & pasting 骨 here makes it show up in the Japanese form in the comment window, while I was expecting the simplified form. ~:-|

[iǃanda]

Actually [ik͡ǃanda], in that you’re supposed to spell out that the click is voiceless (compare Xhosa gq [g͡ǃ], nq [ŋ͡ǃ] and its rear closure is velar (rather than uvular). But, anyway, there are languages that use Africanist click symbols – all separately encoded as letters in Unicode: ǀǁǂǃ – in their orthographies.

Historically, BTW, ǃ is ǀ with an underdot that means “retroflex”; it’s wholly unrelated to the exclamation mark.

Y, it looks like that’s also true for β and χ, but oddly not for ɣ (which is distinct from the Greek γ) or ɸ (which is distinct from Greek and ϕ and φ).

It may be a factor here that ɣ and ɛ (distinct from Greek ε) are used in the orthographies of several languages in west Africa. (Capital letters: Ɣ, Ɛ.) ɸ is not, though.

Polish a-kreska

O, not A; Polish hasn’t had á in centuries.
Bathrobe says

June 2, 2015 at 8:06 pm

No, I only used the example of 骨 at that particular page.

Your experience with 骨 (with冫) alerted me to another problem. The Traditional Character set doesn’t seem to be rendering correctly after all on my computer. For 骨 it gives the Japanese version. For 直 it gives the Simplified version. So Simplified and Japanese are rendering correctly, but Traditional isn’t working properly.

I’m not sure of the determining factor in deciding what you get on your computer. The OS? The browser (all my browsers give the same rendering)? User preferences?
Matt says

June 2, 2015 at 8:20 pm

But Unicode unification (of which Han unification is a large subset) is about bare legibility in plain text.

I’ll grant that they achieved that goal, but that goal itself conflicts with the stated Unicode goal of nonambiguity if interpreted strictly — and if we’re not going to be interpreting things strictly, why are we bothering with an enormous, exhaustively documented standard at all?

If plain legibility in bare text is the only goal, then there’s no reason to have a click character or separate capital Ks for Greek and Cyrillic. Similar to your accented a example, originally there was no s-with-comma or t-with-comma for Romanian – but later one was added. Clearly there are concerns other than “Can a human understand it?” in play. If those concerns are more pressing in the case of Romanian and Xhosa than Chinese and Japanese, fine, but let’s not pretend that they didn’t exist and that a single majestic standard was impartially applied to all the scripts of the world.

(For that matter, Chinese text in a Japanese context has traditionally used Japanese character forms, not Chinese.)

Yes, but so what? I would expect a Japanese page quoting Jōshū to use the Japanese versions of the characters, and a Chinese page quoting Zhàozhōu to use the Chinese versions. Would that make life difficult for people wanting to implement, say, search algorithms? Yes, undoubtedly! Maybe it would even be worse overall. Maybe I shouldn’t have said “Disaster” when what I really meant was “At times inconvenient for people with my interests, and in a way that’s basically unfixable.” But the current situation is not an elegant compromise that only a nitpicker could find fault with – it’s an ugly workaround with real disadvantages. (That said, even with Han unification using Unicode is still orders of magnitude better than any existing alternative.)
David Marjanović says

June 2, 2015 at 8:43 pm

Unicode still lacks support for Tangut and Khitan scripts.

“UNICODE TANGUT COMING IN JUNE 2016”

Farther down the page is a link to the “Proposal on Encoding Khitan Large Script in UCS”.

I’ll take this opportunity to state that the Tangut script is probably the most wrong-headed enterprise in human history.
Bathrobe says

June 2, 2015 at 9:40 pm

a Japanese page quoting Jōshū to use the Japanese versions of the characters, and a Chinese page quoting Zhàozhōu to use the Chinese versions. Would that make life difficult for people wanting to implement, say, search algorithms?

I can’t see why. Chinese searches nowadays return both Simplified and Traditional results. Perhaps it took a bit of work to set up initially but it’s certainly doable. Hell, half of my Japanese-language Google searches return Chinese-language results, anyway.
John Cowan says

June 2, 2015 at 10:20 pm

By the way, Simplified and Traditional are not in general unified (except for unsimplified characters, obviously). So changing the language setting is not going to switch between zh-Hans and zh-Hant.
minus273 says

June 2, 2015 at 11:40 pm

Tangut script is probably the most wrong-headed enterprise in human history.

Indeed, they took the worst feature of Chinese script without any of the better.
Max Pinton says

June 5, 2015 at 3:22 am

The Unicode situation with Bengali doesn’t sound too fantastic either:

https://modelviewculture.com/pieces/i-can-text-you-a-pile-of-poo-but-i-cant-write-my-name
minus273 says

June 5, 2015 at 3:48 am

https://modelviewculture.com/pieces/i-can-text-you-a-pile-of-poo-but-i-cant-write-my-name

This article is ludicrous. The Bengali thing is a font problem. On CJK languages, “Some (but not all) of the characters trace their lineage back to a common set, but even these characters, known as Han characters, began to diverge and evolve independently over two thousand years ago.”
John Cowan says

June 5, 2015 at 10:29 am

The article is an ugly mixture of essentialism (and not the kind I peddle, either) and outright lies of omission.

While yelling about seven of the nine full (i.e. corporate) members being U.S. technology companies, he carefully doesn’t mention that the institutional members have equal rights to the corporate ones, and that they include the Government of India and the Government of Bangladesh, the two countries where Bengali is chiefly spoken. He also suppresses all mention of ISO, which is explicitly nationality-based and which is co-responsible for Unicode: not just a rubber stamp, but a full partner. Nothing gets added until both the Consortium and the ISO working group (JTC1/SC2/WG2) has passed it. Any national standards body can join and about 14 contribute regularly.

It’s a bit misleading to call the suddenly/Aditya problem a font problem. The normal way in all Indic scripts to force a conjoining form of a dead consonant (one that isn’t followed by a vowel is with consonant letter + virama/halant/hasant + zero width joiner, just as explained in the article. The claim that these are three separate and unrelated characters, or analogous to \/\/ for W, is absurd: you cannot write a single sentence in any Indic script without using such a sequence.

The problem is that in Bengali there are two possible resulting shapes for ত (ta), the normal one and the “khanda-ta”, and the difference is not merely stylistic: for about a century now, khanda-ta has been used in native words and normal dead ta in non-Sanskrit borrowings, with Sanskrit borrowings varying over time but tending toward khanda-ta. Therefore, there have to be two different Unicode representations. This was not clearly understood by Unicode/ISO until about 2003, at which point the new character ৎ was encoded for khanda-ta specifically. Such errors and omissions are inevitable in any merely human endeavor.

The claim in the title (also in the text itself, so it’s not merely an editor’s attempt at click bait) that “you can’t write your native language” can now be firmly dismissed as false, at least if your native language is written at all. There is a strong feeling in India that a language is not “real” if it doesn’t have its own script, and so there is pressure from minorities of minorities to employ such unique scripts, some but not all of which are in Unicode. For that matter, a lot could be said about the use of the name “Bengali” for Eastern Nagari script, as if Assamese and many minority languages were not also written in it, by the way; if anyone is at all “subaltern” in the present situation, it is precisely the speakers/writers of non-scheduled languages. We of course hear nothing of this in the hegemonic Bengali discourse of the article’s author. 🙂

Everyone who works for Unicode or WG2 is a volunteer, by the way (including yours truly in the teeny tiny capacity of occasional commenter on other people’s proposals), with a literal handful of exceptions. It’s all very well to set up quotas, but how do you arrange for people to fill them? Is he volunteering to come walk the Unicode walk? I doubt it.
minus273 says

June 5, 2015 at 11:19 am

Thanks! Still more enlightenment from John Cowan.

Han unification is mostly a contemporary (post WWII) Japan-vs.-Sinophonia issue on the identity of characters. To Chinese-speaking people, a great range of minor variants for a character are considered equivalent in print. So the story of Han unification is intuitive to Chinese speakers, but not to the Japanese, who often mandate his/her names spelt with a three-stroked or four-stroked variant “艹”. So the Chinese won the Unicode war, and the Japanese resigned to ideographic selectors.
Doug K says

June 5, 2015 at 12:48 pm

Before Unicode, we had code pages for i18n and l10n (internationalization and localization). Here’s a sample of the sort of thing I had to write to try to explain:
“IBM code page 850 supports all the major Western European languages, and contains all the characters specified in the ISO 8859/1, Latin Alphabet No. 1 standard. But although the pages are roughly equivalent, they aren’t the same. The table headed “Translation Code Pages” is a bit confusing, because the value x’01’ is said to represent both the ISO 8859 and the IBM codepage 850, which isn’t possible. From experience, the answer is that the characters supported for translation by the “Translation Code Pages” are in fact a subset of the 8859 and IBM850 code pages: only the plain vanilla set of characters (a,b,c.. A,B,C..) and numbers is supported. Anything else requires a user-written translation routine. For example, an exclamation mark ! sent from the PC will appear in the mainframe as an umlaut (u with two dots on top). ”

Unicode is a lavish blessing in comparison with all that.. it’s necessarily complex because it has to take account of humans and their peculiar ways of doing communication, and politics.

John Cowan – just so. Preach it, brother..

Bathrobe,
“I’m not sure of the determining factor in deciding what you get on your computer. The OS? The browser (all my browsers give the same rendering)? User preferences?”
It is one or more of those things, combinatorially. Plus other factors.

Once Joel wrote this article in 2003 I started referencing it in all my character discussions,
http://www.joelonsoftware.com/articles/Unicode.html
Not all the programmers who need to have read it yet, unfortunately.
Also need-to-read,
http://www.w3.org/International/questions/qa-what-is-encoding.en
John Cowan says

June 5, 2015 at 3:36 pm

IBM code page 850, by the way, was the character set of DOS PCs sold in non-anglophone Western Europe, for those still living in ignorance (= bliss, in this case). It included all the characters of ISO 8859-1 aka Latin-1, but assigned different numbers to many of them.
Bathrobe says

June 5, 2015 at 8:10 pm

@JC By the way, Simplified and Traditional are not in general unified (except for unsimplified characters, obviously). So changing the language setting is not going to switch between zh-Hans and zh-Hant.

So how do I force a web page to use the characters I want? How do I force browsers to show ⻍ instead of ⻌? Or the Japanese form of 直 rather than the Mainland Chinese form? Is it virtually impossible?

Unicode for Traditional Mongolian is still pretty unsatisfactory and not widely implemented. The problem is the same as the one with Arabic/Hebrew. Mongolian letters have different forms depending on their place in the word. Unicode specifies only the letters of the Mongolian alphabet, plus a couple of Free Variation Selectors, an “MVS”, and a non-breaking space before case endings. For instance, if you want to specify that your final vowel is a tail separated from the body of the word, you have to use MVS. The FVS’s are used to force letters to take certain irregular forms (often found in foreign words) or to prevent letters from joining together when you don’t want them to. And if you don’t use the non-breaking space before case endings, they won’t appear properly. See this page (basic table of letters) and this page (for all the special techniques of input).

Microsoft implemented the Traditional Mongolian script several years ago and it works quite well in Windows. Apple is more problematic. The script works fairly well on iPhone but every time Apple updates their iOS it breaks the system until they fix it (which could take six months or more).

The best implementation of the Mongolian alphabet that I’ve seen that can be used both on a Mac and a PC is that of Mongolfont, but I still haven’t figured out how to implement it properly on a web page. It involves CSS and embedding fonts. Mongolfont’s site renders fine on Android, but I haven’t managed to replicate this.

In China, the commonest system for Traditional Mongolian is Menksoft, but that doesn’t use Unicode at all. It avoids all the problems associated with different forms of characters and their ligatures by encoding the combined forms. Unfortunately it’s proprietary software and uses Microsoft software (I forget the specific software) to achieve its effects. There is apparently no possibility that it can be ported to a Mac.
John Cowan says

June 8, 2015 at 10:31 am

⻍ and ⻌ aren’t actually hanzi: they are traditional and simplified radicals respectively, and are meant to be used as symbols when classifying hanzi by radical, not in running text.
Matt A says

June 8, 2015 at 12:05 pm

Well, ⻌ is a simplified version of ⻍, but either can be used in traditional characters (and probably also in simplified characters, though ⻌ is definitely the usual choice; in any case, both are simplified forms of 辵). As Bathrobe’s post shows, you can specify whether you want to display ⻌ or ⻍—what you can’t do is use unicode to specify which form to use when writing 道, for example. There are obviously more sophisticated potential solutions, but the only way I’m generally able to specify one form or the other is to select a font that uses the form I’d like, and this doesn’t work too well on the internet.
John Cowan says

June 8, 2015 at 1:25 pm

My point is that in addition to the ordinary characters, which are considered letters, there is also a separate set of 214 Kang Xi radicals plus 115 additional simplified, Japanese, oddball, etc. radicals in the Unicode repertoire. These are symbols, not letters, and shouldn’t be used in ordinary documents.
Matt A says

June 8, 2015 at 2:01 pm

That’s true, but what I’m talking about—and what I think Bathrobe is talking about—is being able to specify whether characters like 道 display with one or two dots in the upper left corner. This is not an important issue in terms of legibility or accurately reproducing text (in the sense that there is no distinction in English orthography between ‘a’ and ‘ɑ’ or ‘g’ and ‘ℊ’), but there are times when it would be useful to be able to make this distinction, and it’s not possible within unicode (that is, as it displays in my browser, I can type ⻌+首 (道) but I cannot type ⻍+首, regardless of whether I use a simplified, traditional, or Japanese IME).

So this is arguably unimportant, as this distinction doesn’t matter in ordinary documents, but it is not entirely different than the kind of distinction between, say, 真 and 眞 (two different form of zhēn) or 為 and 爲 (two forms of wéi/wèi; the PRC simplified form of this graph is 为), both of which can be distinguished in unicode. (The distinction Bathrobe mentions between the two forms of 直 is very similar but not identical to the distinction between 眞 and 真, but only zhēn can be distinguished in unicode, for reasons I don’t understand).

(This is not a traditional vs. PRC simplified distinction—in all the cases above, both forms are standard traditional characters—and someone copying a text in traditional characters would freely use either version, without regard to the form used in the original).

I also find this frustrating at times, and I don’t think there is any consistent logic behind it.

That said, I love unicode, which is a million times better than any other system of encoding CJKV characters that I’ve ever used, and I’ve unfortunately had occasion to use a few.
Matt A says

June 8, 2015 at 2:06 pm

Oh, and about 道—what I mean is that it is not possible to use unicode to specify how it appears. Depending on the font used to view the character, it could appear either as ⻌+首 or as ⻍+首, and as far as I know there is no reliable way to specify which should appear, other than specifying a certain font which the viewer may or may not have installed (I think that it’s possible to embed a font to guarantee that it displays the way you want, but this is something I don’t know how to do).
John Cowan says

June 8, 2015 at 3:25 pm

There is in fact a logic behind it, though it’s true that human judgment is required. There are three reasons why characters with the same meaning/usage cannot be unified in Unicode:

1) Because they are distinct in one of the underlying national standards. This rule trumps all other rules, because it has to be possible to round-trip from a national standard to Unicode and back without losing information. For example, there are six trivial variants of ‘sword’ in JIS X 0208, and so Unicode is stuck with six code points for them: 剣劍剱劔劒釼. In Korean, there are a few hundred hanja that are 100% identical, but because they have different readings, the Korean standard gives them different code points, apparently to assist in transliteration to hangul. Perforce, so does Unicode.

2) Because they are etymologically unrelated. The hanzi for ‘earth’ and for ‘warrior, scholar’ look very similar (an equal-armed cross sitting on a base), but only coincidentally so.

3) Because they differ in what is called “abstract shape”. This is the trickiest distinction and I think it is the one operating in your examples. The idea here is to see how two different hanzi with the same meaning/usage can be dissected into ideographic components. If they decompose in different ways or into different components, they are not unified.

The nitty-gritty is in Chapter 18 of the Unicode Standard, pp. 660-64 (physical pages 11-15 in the PDF). I recommend looking at the examples there.
Matt A says

June 8, 2015 at 4:40 pm

Thanks—that looks interesting. I don’t have time to look at it at the moment, but I’ll save it for when I do have time. It does seem like those examples must fall under the “abstract shape” category—there’s a fine line between the distinctions between 真 and 眞 and the two 直s, but I could see how the first could be categorized as an abstract shape difference while the second isn’t. It still seems a pretty strange distinction to make to me, but I’ll take a look at the chapter later and see what they have to say about it.
John Cowan says

June 9, 2015 at 12:18 am

Just glancing at the pictures should tell the tale: worth 1000 words and all that.
John Cowan says

July 20, 2018 at 9:52 am

For that matter, a lot could be said about the use of the name “Bengali” for Eastern Nagari script, as if Assamese and many minority languages were not also written in it, by the way

And now I see a proposal to encode Assamese “script” entirely separately from Bengali, on the grounds that it is a scheduled language and deserves its own Unicode block. There are literally two characters used in Assamese writing but not in Bengali writing, and these have been encoded in Unicode since day one, if not before (Unicode descends from a private Xerox encoding standard).

Fortunately, I doubt this disunification will go anywhere, as it would render all existing digital documents in Assamese obsolete. The only time this has happened before was when Coptic was disunified from Greek, which affected almost no existing documents and made plain-text mixed Coptic and Greek documents straightforward instead of painful.
John Cowan says

October 5, 2019 at 3:53 pm

Well, ISO has asked Unicode to agree on changing the official name of the script to Bengali/Assamese, which is helpful if it allays unfounded fears (as the change from “Old Russian” to “Old East Slavic” in linguistics apparently did). Here’s an Assamese professor at MIT attempting to put those fears to rest more directly.

I misspoke in saying the Coptic disunification was the only case. In Unicode 1.0, before the merger with the originally independent draft ISO 10646, a subset of Korean characters were in a different part of the space. They were expanded to every possible Korean character and moved to their current positions. But Unicode 1.0 was different enough from all following Unicodes from 1.1 on that it almost doesn’t count as “the same thing”, though no other characters changed places. Fortunately, most Korean text was encoded using Korean national standards at the time.