A Programmer's Guide to Unicode

Blog post recommended by Karpathy on Unicode.

Date Created:

1 155

Notes

Unicode represents a large jump in complexity over character sets like ASCII because the goal of Unicode is to faithfully represent the entire world's writing systems.
The Unicode consortium's goal is enabling people around the world to use computers in any language. To date, Unicode supports 135 different scripts, covering some 1100 languages, and there's still a long tail of 100 unsupported scripts, both modern and historical, which people are still working to add.
Unicode embraces the diversity of language and accepts the complexity inherent in its mission to include all human writing systems.
The basic elements of Unicode - its characters , although that term isn't quite right - are called code points. Code points are identified by a number, customarily written in hexadecimal with a prefix U+, such as U+0041 A Latin Capital Latter A or U+03B8 "", Greek Small Letter Theta. Each code point has a short name and a few other properties specified in the Unicode Character Database.
The set of all possible code points is called the codespace. The Unicode codespace consists of 1,114,112 code points. However, only 128,237 of them - about 12% of the codespace - are actually assigned, to date.
Below is a map of the entire codespace, with one pixel per code point. It's arranges in tiles for visual coherence; each small square is 16x16=256 code points, and each large square is a plane of 65,536 code points. There are 17 planes altogether.

Unicode Code Space Visualized

White represents unassigned space. Blue is assigned code points, green is private-use areas, and the small red area is surrogates. The assigned code points are distributed somewhat sparsely, but concentrated in three planes. Plane 0 is the Basic Multilingual Plane, or BMP. The BMP contains essentially all the characters needed for modern text in any script, including Latin, Greek, Han, Arabic, Devanagari, and many more.

In the past, the codespace was just the BMP and no more - Unicode was originally conceived as a straightforward 16-bit encoding, with only 65,536 code points. It was expanded to its current size in 1996. However, the vast majority of code points in modern text belong to the BMP.

Plane 1 contains historical scripts, such as Sumerian cuneiform and Egyptian hieroglyphs, as well as emoji and various other symbols. Plane 2 contains a large block of less-common and historical Han characters. The remaining planes are empty, except for a small number of rarely used characters in Plane 14.

First Three Unicode Planes

The map color codes the 135 different scripts in Unicode. Han Chinese (teal) and Korean (light brown) take up most of the range of the BMP. By contrast, all of the European, Middle Eastern, and South Asian scripts fit into the first row of the BMP in this diagram. Many areas of the codespace are adapted or copied from earlier encodings - the first 128 code points are copied from ASCII.

Below is a heat map of planes 0-2 based on a large sample of text from Wikipedia and Twitter (all languages). Frequencies increases from black (never seen) through red and yellow to white.

Twitter/Wikipedia Unicode Character Use Frequency

You can see the emoji usage in the bottom of plane 1 above.

The most convenient, computer-friendliest thing to do would be store the code point index as a 32-bit integer. This works, but it consumes 4 bytes per character, which is a lot. Consequently, there are several more compact encodings for Unicode. The 32-bit integer encoding is officially called UTF-32 (UTF =Unicode Transformation Format), but it's rarely used for storage. You will commonly see Unicode text encoded as either UTF-8 or UTF-16. These are both variable length encodings, made up of 8-bit or 16-bit units, respectively. In these schemes, code points with smaller index values take up fewer bytes, which saves a lot of memory for typical texts. The trade-off is that processing UTF-8/16 texts is more programmatically involved, and likely slower.

In UTF-8, each code point is stored using 1 to 4 bytes, based on its index value. UTF-8 uses a system of binary prefixes, in which the highest bits of each byte mark whether it's a single byte, the beginning of a multi-byte sequence, or a continuation byte; the remaining bits, concatenated, give the code point index.


UTF-8 (binary)	Code point (binary)	Range
`0xxxxxxx`	`xxxxxxx`	U+0000-U+007
`110xxxxx 10yyyyyy`	`xxxxxyyyyyy`	U+0080-U+07FF
`1110xxxx 10yyyyyy 10zzzzzz`	`xxxxyyyyyyzzzzzz`	U+0800-U+FFFF
`11110xxx 10yyyyyy 10zzzzzz 10wwwwww`	`xxxyyyyyyzzzzzzwwwwww`	U+10000-U+10FFFF

A handy property of UTF-8 is that code points below 128 (ASCII characters) are encoded as single bytes, and all non-ASCII code points are encoded using sequences of bytes 128-255. UTF-8 is very widely used in the Unix/Linux and Web Worlds, and many argue that it should be the default encoding everywhere. When you measure the length of a string, you'll need to think about whether you want the length in bytes, the length in code points, the width of the text when rendered, or something else.

UTF-16 uses 16 bit words, with each code point stored as either 1 or 2 words. Like UTF-8, we can express the UTF-16 encoding rules in the form of binary prefixes:


UTF-16 (binary)	Code point (binary)	Range
`xxxxxxxxxxxxxxxx`	`xxxxxxxxxxxxxxxx`	U+0000–U+FFFF
`110110xxxxxxxxxx 110111yyyyyyyyyy`	`xxxxxxxxxxyyyyyyyyyy + 0x10000`	U+10000–U+10FFFF

A more common way that people talk about UTF-16 encoding, though, is in terms of code points called surrogates. All the code points in the range U+D800-U+DFFF - or in other words, the code points that match binary prefixes 110110 and 110111 in the table above - are reserved specifically for UTF-16 encoding, and don't represent any valid characters on their own. They’re only meant to occur in the 2-word encoding pattern above, which is called a “surrogate pair”. Surrogate code points are illegal in any other context! They’re not allowed in UTF-8 or UTF-32 at all.

Today, JavaScript uses UTF-16 as its standard string representation: if you ask for the length of a string, iterate over it, etc., the result will be in UTF-16 words, with any code points outside the BMP expressed as surrogate pairs. UTF-16 is also used by the Microsoft Win32 APIs; though Win32 supports either 8-bit or 16-bit strings.

UTF-16 words can be stored either little-endian or big-endian. Unicode has no opinion on that issue, although it does encourage the convention by putting U+FEFFF ZERO WIDTH NO-BREAK SPACE at the top of a UTF-16 file as a byte-order mark, to disambiguate the endianness.

A Unicode character can be more complicated than just an individual code point. Unicode includes a system for dynamically composing characters, by combining multiple code points together. This is used in various ways to gain flexibility without cause a huge combinatorial explosion in the number of code points. In European languages, this shows up in the application of diacritics to letters. All these diacritics can be applied to any letter of any alphabet—and in fact, multiple diacritics can be used on a single letter.

For example, the accented character "Á” can be expressed as a string of two code points: U+0041 A LATIN CAPITAL LETTER A plus U+0301 “◌́” COMBINING ACUTE ACCENT. This string automatically gets rendered as a single character: "Á”.

Other places where character composition shows up in Unicode:

Vowel-pointing notation in Arabic and Hebrew. In these languages, words are normally spelled with some of their vowels left out. They then have diacritic notation to indicate the vowels (used in dictionaries, language-teaching materials, children's books, and such). These diacritics are expressed with combining marks: אֶת דַלְתִּי הֵזִיז הֵנִיעַ, קֶטֶב לִשְׁכַּתִּי יָשׁוֹד
Devanagari, the script used to write Hindi, Sanskrit, and many other south Asian languages, expresses certain values as combining marks attached to consonant letters. For example, “ह” + “ि” = “हि” (“h” + “i” = “hi”).
Korean characters stand for symbols, but they are composed of letters called jamo that stand for the vowels and consonants in the syllable. While there are code points for precomposed Korean syllables, it’s also possible to dynamically compose them by concatenating their jamo. For example, “ᄒ” + “ᅡ” + “ᆫ” = “한” (“h” + “a” + “n” = “han”).

In Unicode, precomposed characters exist alongside the dynamic composition system. A consequence of this is that there are multiple ways to express the same string - different sequences of code points that result in the same user-perceived characters. Unicode refers to set of strings that can be composed in different ways as canonically equivalent. equivalent strings are supposed to be treated as identical for purposes of searching, sorting, rendering, text selection, and so on.

To address the problem of how to handle canonically equivalent strings, Unicode defines several normalization forms: ways of converting strings into a canonical form so that they can be compared code point by code point (or byte by byte).

The NFD normalization form fully decomposes every character down to its component base and combining marks, taking apart any precomposed code points in the string. It also sorts the combining marks in each character according to their rendered position.

The NFC form puts things back together into precomposed code points as much as possible. If an unusual combination of diacritics is called for, there may not be any precomposed code point for it, in which case NFC still precomposes what it can and leaves any remaining combining marks in place.

Graphene cluster: string of one or more code points that constitute a single user-perceived character. The main thing graphene clusters are used for is text editing: they're often the most sensible unit for cursor placement and text selection boundaries. Another place where graphene clusters are useful is in enforcing a string length limit - say, on a database field.

While the true, underlying limit might be something like the byte length of the string in UTF-8, you wouldn’t want to enforce that by just truncating bytes. At a minimum, you’d want to “round down” to the nearest code point boundary; but even better, round down to the nearest grapheme cluster boundary.

A Programmer's Guide to Unicode

References

Notes

Further Reading:

Comments

User Comments