Character Encoding
Character encoding is an important topic in computer programming, so I am going to try to learn more about it.
References
Definitions
- Variable Length Encodings
- In coding theory, a variable-length code is a code which maps source symbols to a variable number of bits. The equivalent concept in computer science is bit string.
- Variable-length codes can allow sources to be compressed and decompressed with zero error (lossless data compression) and still be read back symbol by symbol. With the right coding strategy an independent and identically-distributed source may be compressed almost arbitrarily close to its entropy.
- Code Points
- A code point is a particular position in a table, where the position has been assigned a meaning. The table may be one dimensional (a column), two dimensional (like cells in a spreadsheet), three dimensional (sheets in a workbook), etc... in any number of dimensions.
- Diacritic
- A diacritic is a glyph added to a letter or to a basic glyph.
- Glyph
- A glyph is any kind of purposeful mark. In typography, a glyph is
the special shape, design, or representation of a character
.
- A glyph is any kind of purposeful mark. In typography, a glyph is
Notes
Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that make up a character encoding are known ascode pointsand collectively compose acode space, acode page, or acharacter map.
- The low cost of digital representation of data in modern computer systems allows more elaborate characters used in many written languages. Character encoding using internationally accepted standards permits worldwide interchange in electronic form.
- The most used character encoding on the web is UTF-8, used in 98.2% of surveyed web sites, as of May 2024. In application programs and operating system tasks, both UTF-8 and UTF-16 are popular options.
History
- The history of character codes illustrates the evolving need for machine-mediated character-based symbolic information over a distance, using once-novel electrical means.
- The earliest well-known electrically transmitted character code, Morse code, introduced in the 1840s, used a system for four
symbols
(short symbol, long signal, short space, long space) to generate codes of variable length. - Most codes are of fixed-character length or variable-length sequences of fixed-length codes (e.g., Unicode)
- Common examples of character encoding systems:
- Morse code
- Baudot code
- American Standard Code for Information Interchange (ASCII)
- First developed in 1963
- Addressed the shortcomings of the US military-created Fieldata code
- Unicode
- Unicode, a well defined and extensible encoding system, has supplanted most earlier character encodings, but the path to code development to the present is fairly well known.
- In trying to develop universally interchangeable character encodings, researchers in teh1980s faced the dilemma that, on the one hand, it seemed necessary to add more bits to accommodate additional characters, but on the other hand, for the users of the relatively small character set of the Latin alphabet (who still constituted the majority of computer users), those additional bits were a colossal water of then-scarce and expensive computing resources.
- The compromise solution that was eventually found and developed into Unicode was to break the assumption (dating back to telegraph codes) that each character would always directly correspond to a particular sequence of buts. Instead, characters would first be mapped to a universal intermediate representation in the form of abstract numbers called code points. Code points would then be represented in a variety of ways and with various default numbers of bits per character (code units) depending on context. To encode code points higher than the length of the code unit, the solution was to implement variable length encodings where an escape sequence would signal that subsequent buts should be parsed as a higher code point.
Terminology
- A character is a minimal unit of text that has semantic value
- Exactly what constitutes a character varies between character encodings.
- For letters with diacritics, there are two distinct approaches that can be taken to encode them: they can be encoded either as a single unified character (known as precomposed character), or as separate characters that combine into a single glyph.
- A character set is a collection of elements used to represent text. For example, the Latin alphabet and Greek alphabet are both character sets.
- A coded character is a character set mapped to set of unique numbers. For historical reasons, this is often referred to as a code page.
- A character repertoire is the set of characters that can be represented by a particular coded character set. The repertoire may be closed, meaning that no additions are allowed without creating a new standard; or it may be open, allowing additions
- A code point is a value or position of a character in a coded character set.
- A code point is represented by a sequence of code units. The mapping is defined by the encoding. Thus, the number of code units required to represent a code points depends on the encoding:
- UTF-8: code points map to a sequence of one, two, three or four code units
- UTF-16: code units are twice as long as 8-bit code units. Therefore, any code point with a scalar value less than U+10000 is encoded with a single code unit. Code points with a value of U+10000 or higher require two code units each.
- UTF-32: the 32-but code unit is large enough that every code point is represented as a single code unit
- A code space is the range of numerical values spanned by a coded character set.
- A code unit is the minimum bit combination that can represent a character in a character encoding.
- A code unit in ASCII consists of 7 bits
- A code unit in UTF-8, EBCDIC and GB 18030 consists of 8 bits
- A code unit in UTF-16 consists of 16 buts
- A code unit in UTF-32 consists of 32 bits
Unicode Encoding Model
Unicode and its parallel standard, the ISO/IEC 10646 Universal Character Set, together constitute a unified standard for character encoding. Rather than mapping characters directly to bytes, Unicode separately defines a coded character set that maps characters to unique natural numbers (code points), how those code points are mapped to a series of fixed-size natural numbers (code units), and finally how those units are encoded as a stream of octets (bytes).
- The purpose of this decomposition is to establish a universal set of characters that can be encoded in a variety of ways.
- Unicode uses its own set of terminology to describe its process:
- An abstract character repertoire (ACT) is the full set of abstract characters that system supports.
- A coded character set (CCS) is a function that maps characters to code points (each code point represents one character). For example, the capital
A
in the Latin alphabet may be represented by code point 65. - A character encoding form (CEF) is the mapping of code points to code units to facilitate storage in a system that represents numbers as bit sequences of fixed length (i.e. practically any computer system).
- A character encoding scheme (CES) is the mapping of code units to a squence of octets to faciliate storage on an octet-based file system for transmission over an octet-based network. Simple character encoding schemes include UTF-8, UTF-16BE, ...; compound character encoding schemes such as UTF-16, UTF-32, ...; switch between several simple schemes by using a byte order mark or escape sequences, compressing schemes try to minimize the number of bytes used per code unit.
- Unicode Code Points
- In Unicode, a character can be referred to as
U+
followed by its codepoint value in hexadecimal. - The range of valid code points (the code space) for the Unicode standard is U+0000 to U+10FFFF inclusive, divided in 17 planes, identified by the numbers 0 to 16.
- In the Unicode standard, a plane is a contiguous group of code points.
- In Unicode, a character can be referred to as
Transcoding
- As a result of having many character encoding methods in use (and the need for backward compatibility with archived data), many computer programs have been developed to translate data between character encoding schemes, a process known as transcoding.