Ben Frederickson has a post describing some of the weirdnesses of Unicode (“Things that seem like they should be very simple are often deceptively complicated when dealing with Unicode strings … Unicode also has lots of different characters that are visually identical to one another”) but ending with this upbeat conclusion:
Every change to Unicode has been a rational change by intelligent hard working people. While I can make fun of the poop emoji being included in the Unicode standard, it was the end result of a smart strategic decision by engineers at Google. Now that emoji are included in the Unicode standard, we have the rational follow on decision of supporting racial hints for people in emoji. Likewise by supporting emoji like a piece of pizza, the Unicode consortium has to now make the tough calls on including hot dogs and tacos in the next version of the standard while also excluding hoagies. Even having visually identical characters with different code points was a deliberate design decision – it’s necessary for lossless conversion to and from legacy character encodings.
Unicode is crazy complicated, but that is because of the crazy ambition it has in representing all of human language, not because of any deficiency in the standard itself. Human language is a complicated messy business, and Unicode has to be equally complicated to represent it. Thankfully we have people writing those long standards on how to display bidirectional strings appropriately, or sort strings, or the security implications of all this – so that the rest of us don’t have to think about it and just use standard library code to handle instead.
I’m deeply grateful for the existence of Unicode, and equally grateful that I don’t have to understand how it works, but I figure those with more understanding of coding than I (which is a very low bar) might find it interesting and/or have something to say about it.