On Unicode

Posted by Michał ‘mina86’ Nazarewicz on 25th of October 2015

There are a lot of misconceptions about Unicode. Most are there because people assume what they know about ASCII or ISO-8859-* is true about Unicode. They are usually harmless but they tend to creep into minds of people who work with text which leads to badly designed software and technical decisions made based on false information.

Without further ado, here’s a few facts about Unicode that might surprise you.

UTF-16 is not a fixed-width encoding

Unicode defines 17 planes (most famous being plane zero, the Basic Multilingual Plane or BMP). Each plane consists of 65 536 code points. Quick multiplication unveils staggering number of 1 114 112 entries. It quickly becomes obvious that 16 bits, which is the size of a single UTF-16 code unit, aren’t enough to identify each code point uniquely.

To solve that problem, a somehow awkward concept of surrogate pairs has been introduced. 2048 code points have been carved out to make room for high and low surrogates. In UTF-16, a high surrogate followed by a low surrogate — four octets total — encodes a single code point outside of BMP.

The encoding method is relatively simple. For example, to represent U+1F574: man in business suit levitating (🕴 — does your browser support it yet?) one would:

  1. Subtract 10000 16 from the code point to produce a 20-bit number.

    1F574 16 - 10000 16 = F574 16 = 000011110101011101002

  2. Add D800 16 to the ten most significant bits of that number — that’s the high surrogate.

    D800 16 + 00001111012 = 110110 00000000002 + 00001111012 = 110110 00001111012 = D83D 16

  3. Add DC00 16 to the ten least significant bits of the same number — that’s the low surrogate.

    DC00 16 + 01011101002 = 110111 00000000002 + 01011101002 = 110111 01011101002 = DD74 16

  4. Output the high surrogate followed by low surrogate.

    UTF-16 encoding of U+1F574 is U+D83D U+DD74.

Case change is not reversible and may change length

German speakers probably recognise ß, a small letter sharp s. It is a bit of a unique snowflake in Latin alphabets in that it has no corresponding upper case form. Or rather, even though capital sharp s exists, the correct, according to German orthography, way to capitalise ß is by replacing it with two letters S. Similarly, an fi ligature becomes FI. Other characters, such as ʼn, need to be decomposed producing ʼN.

In case this isn’t confusing enough, strings may get shorter as well. ‘I◌̇stanbul’ (which starts with capital I followed by U+0307: combining dot above) becomes ‘istanbul’ when lower cased. One code point fewer.

Below is a table of some of the corner cases. Firefox fails at İstanbul when spelled using combining dot above character, while Chrome and Opera fail to properly capitalise ‘film’.

OperationExpectedBrowser’s handling
Notes
uc(‘define’)DEFINEdefine
uc(‘heiß’)HEISSheiß
tc(‘film’)Filmfilm
Ligatures and digraphs often need to be converted into separate characters.
tc(‘nježan’)Nježannježan
lc(‘Ⅷ’)
tc(‘ijs’)IJsijs
Some ligatures and digraphs have corresponding characters in desired case.
tc(‘ijs’)IJsijs
Interestingly, Firefox handles Dutch ij even if written as separate letters.
lc(‘ΌΣΟΣ’)όσοςΌΣΟΣ
Lower case sigma is ‘σ’ in the middle but ‘ς’ at the end of a word.
uc(‘istanbul’)İSTANBUListanbul
lc(‘İSTANBUL’)istanbulİSTANBUL
lc(‘IRMAK’)ırmakIRMAK
Turkish has a dot-less (a.k.a. closed) and dotted ‘i’.

Single letter may map to multiple code points

Above examples show that concepts of a letter or a character may be blurry and confusing. Is aforementioned ‘ß’ a letter or a fancy way of writing ‘ss’? ‘sz’? What of ligatures and digraphs? But at least everyone agrees ‘é’ is a single letter, right? Here it is again: ‘é’, except this time it’s a regular letter e followed by U+0301: combining acute accent, i.e. ‘e◌́’.

The former, single-code-point representation, is called precomposed (or composed) while the latter, using combining characters, is called decomposed. What’s important is that both sequences are canonically equivalent and proper Unicode implementations should treat them identically. They should be indistinguishable based on rendering or behaviour (e.g. when selecting text).

In addition to the above, Polish ‘ą’ ≈ ‘a◌̨’, Korean ‘한’ ≈ ‘ㅎㅏㄴ’, ‘Ω’ (U+2126: Ohm sign) ≈ ‘Ω’ (U+03A9: Greek capital letter omega), Hebrew ‘שׂ‎’ ≈ ‘ש‎◌‎ׂ’ and more.

Based on canonical equivalence, Unicode defines a Normalised Form C (NFC) and Normalised Form D (NFD). The former often uses precomposed while the latter uses decomposed representation of characters.

Oh, and by the way, aforementioned Ohm sign is a singleton which means that it disappears from the text after any kind of normalisation. There’s a bunch of those.

‘Converting to NFC’, a hopeful programmer will say, ‘guarantees that a single letter maps to no more than one code point!’ Alas, no… For example, no Unicode character for ‘ḍ̇’ (i.e. letter d with dot above and below) exists. No matter what form is used, the character must take more than one code point.

NFC is not even guaranteed to be the shortest representation of a given string. We’ve already seen that ‘’ is canonically equivalent to ‘ש◌ׂ’ but what’s more interesting is that the latter is in NFC. Yes, even thought precomposed character exists, decomposed representation is in NFC. In fact, for Hebrew letter shin with sin dot normalised forms C and D are the same.

As to not leave an impression that NFC is the odd ball here, even though NFD usually decomposes precomposed characters, it not always does so. ‘ø’ (U+00F8: Latin small letter o with stroke) is in its NFD (as a single code-point) even though a decomposed representation with a combining stroke also exists.

There’s also a compatibility equivalence which can be thought of as covering ‘meaning’ of strings. For example ‘fi’ (U+FB01: Latin small ligature fi) means the same thing as ‘f + i’, ‘dž ∼ d + z + ◌̌’ etc. This is a bit simplified view though since ‘5²’ has a distinct meaning from ‘52’ yet the sequences are in the same compatibly equivalence class.

UTF-8 is better for CJK than UTF-16

An argument sometimes put in favour of UTF-16 (over UTF-8) is that it is better for far eastern scripts. For majority of Chinese, Japanese and Korean (CJK) ideographs UTF-16 takes two octets while UTF-8 takes three. Clearly, Asia should abandon UTF-8 and use UTF-16 then, right?

Block
(Range)
Octets used by
UTF-8UTF-16
CJK Unified Ideographs Extension A
(U+3400–U+4DBF)
32
CJK Unified Ideographs
(U+4E00–U+9FFF)
32
CJK Unified Ideographs Extension B
(U+20000–U+2A6DF)
44
CJK Unified Ideographs Extension C
(U+2A700–U+2B73F)
44
CJK Unified Ideographs Extension D
(U+2B740–U+2B81F)
44

Alas, the devil, as he often does, lays in the details, namely in the fact that in most cases the CJK text is accompanied by markup which uses US-ASCII characters. Since those need only one UTF-8 code unit, it often more than makes up for octets ‘lost’ when encoding ideographs.

To see how big of a role this effect plays in real life I looked at a bunch of websites popular in China, Japan and South Korea and compared their size (in kibibytes) when different encodings were used. The results are as follows:

PageUTF-8UTF-16Increase
baidu.com91181100%
tmall.com469097%
daum.net15530094%
taobao.com407693%
amacon.co.jp21641391%
rakuten.co.jp29154888%
gmarket.co.kr7113388%
weibo.com61186%
yahoo.co.jp183485%
naver.com8014783%
ppomppu.co.kr14225983%
zn.wiki/Japan9381 69080%
kr.wiki/Japan7821 37075%
zn.wiki/Korea6711673%
fc2.com356072%
jp.wiki/Korea12321171%
kr.wiki/Korea18030369%
jp.wiki/Japan1 0121 61660%

Yes, in the worst case, baidu.com’s size nearly doubled when using UTF-16.

If size is a concern, one might decide to use a dedicated encoding such us Shift_JIS, version of EUC or GB2312. And indeed, some sites did that, but even then advantage over UTF-8 was minimal:

PageOriginal [KiB]UTF-8 [KiB]Increase
ppomppu.co.kr (euc-kr)1361424.5%
weibo.com (gb2312)563.6%
gmarket.co.kr (euc-kr)69713.2%
rakuten.co.jp (euc-jp)2832913.1%
amacon.co.jp (Shift_JIS)2112162.2%
taobao.com (gbk)39401.8%

Truth of the matter is that to save space a technique independent of Unicode should be used. One that has been around for years and any modern browser supports: compression. And this is also true for storage. Even with a dense file with virtually no markup (e.g. a double newline separating paragraphs as the only ASCII characters) it is far better to simply compress the file then try to mess around with encoding.

There is no Apple logo in Unicode

Total of 137 468 code points (U+E000–U+F8FF in BMP, U+F0000–U+FFFFD in plane 15 and U+100000–U+10FFFD in plane 16) are reserved for private use. In other words, the standard will never assign any meaning to them. If used in a data being interchanged, all parties must agree on a common interpretation or else unexpected results (maybe even corrupted data) may happen.

This describes situation with Apple logo. When within realms of Cupertino controlled software, U+F8FF is an Apple logo, but outside in the world of the free (or at least freer) it’s usually a code point with no representation or meaning.

In other words, just don’t use U+F8FF.

Fruit fans should not despair though but rather find consolation in a red apple (🍎, U+1F34E), a green apple (🍏, U+1F34F) and even a pineapple (🍍, U+1F34D which isn’t even an apple nor grows on pine trees).

Shortening text isn’t quite as easy as you might think

Speaking of Apple, removing characters from the end of Unicode string may expand its rendered representation. Tom Scott made a video about it so rather than duplicating all of his observations, I’ll point to his work on the subject. It’s not a long video and is worth the watch.

Conclusion

There is far more that could be said about Unicode. Introduction of emojis and skin tone modifiers makes the standard so much more… interesting to name just one aspect.

But even with a limited exposure to the standard this article showed Unicode is like localisation: it’s complicated, hard to get right and best left to professionals. And said poor souls, when implement Unicode text handling, should remember to forget everything they know from other encodings.

Next, accessing code points by index never makes sense. Code point at position n does not correspond to nth character nor does it correspond to nth glyph. To take a sub-string of a Unicode text it needs to be interpreted from the beginning and having random access to each code point doesn’t speed anything up.

And finally, don’t be fooled by UTF-16 propaganda. The encoding combines disadvantages of UTF-8 (being variable-length) with those of UTF-32 (taking up a lot of space) and as such is the worst possible solution for text. Simply use UTF-8 everywhere and be done with it.