Character Encoding

Character encoding is an important topic in computer programming, so I am going to try to learn more about it.

Date Created:

References



Definitions


  • Variable Length Encodings
    • In coding theory, a variable-length code is a code which maps source symbols to a variable number of bits. The equivalent concept in computer science is bit string.
    • Variable-length codes can allow sources to be compressed and decompressed with zero error (lossless data compression) and still be read back symbol by symbol. With the right coding strategy an independent and identically-distributed source may be compressed almost arbitrarily close to its entropy.
  • Code Points
    • A code point is a particular position in a table, where the position has been assigned a meaning. The table may be one dimensional (a column), two dimensional (like cells in a spreadsheet), three dimensional (sheets in a workbook), etc... in any number of dimensions.
  • Diacritic
    • A diacritic is a glyph added to a letter or to a basic glyph.

  • Glyph
    • A glyph is any kind of purposeful mark. In typography, a glyph is the special shape, design, or representation of a character.


Notes


Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that make up a character encoding are known as code points and collectively compose a code space, a code page, or a character map.
  • The low cost of digital representation of data in modern computer systems allows more elaborate characters used in many written languages. Character encoding using internationally accepted standards permits worldwide interchange in electronic form.
  • The most used character encoding on the web is UTF-8, used in 98.2% of surveyed web sites, as of May 2024. In application programs and operating system tasks, both UTF-8 and UTF-16 are popular options.


History

  • The history of character codes illustrates the evolving need for machine-mediated character-based symbolic information over a distance, using once-novel electrical means.
  • The earliest well-known electrically transmitted character code, Morse code, introduced in the 1840s, used a system for four symbols (short symbol, long signal, short space, long space) to generate codes of variable length.
  • Most codes are of fixed-character length or variable-length sequences of fixed-length codes (e.g., Unicode)
  • Common examples of character encoding systems:
    • Morse code
    • Baudot code
    • American Standard Code for Information Interchange (ASCII)
      • First developed in 1963
      • Addressed the shortcomings of the US military-created Fieldata code
    • Unicode
      • Unicode, a well defined and extensible encoding system, has supplanted most earlier character encodings, but the path to code development to the present is fairly well known.
  • In trying to develop universally interchangeable character encodings, researchers in teh1980s faced the dilemma that, on the one hand, it seemed necessary to add more bits to accommodate additional characters, but on the other hand, for the users of the relatively small character set of the Latin alphabet (who still constituted the majority of computer users), those additional bits were a colossal water of then-scarce and expensive computing resources.
  • The compromise solution that was eventually found and developed into Unicode was to break the assumption (dating back to telegraph codes) that each character would always directly correspond to a particular sequence of buts. Instead, characters would first be mapped to a universal intermediate representation in the form of abstract numbers called code points. Code points would then be represented in a variety of ways and with various default numbers of bits per character (code units) depending on context. To encode code points higher than the length of the code unit, the solution was to implement variable length encodings where an escape sequence would signal that subsequent buts should be parsed as a higher code point.


Terminology

  • A character is a minimal unit of text that has semantic value
    • Exactly what constitutes a character varies between character encodings.
    • For letters with diacritics, there are two distinct approaches that can be taken to encode them: they can be encoded either as a single unified character (known as precomposed character), or as separate characters that combine into a single glyph.
  • A character set is a collection of elements used to represent text. For example, the Latin alphabet and Greek alphabet are both character sets.
  • A coded character is a character set mapped to set of unique numbers. For historical reasons, this is often referred to as a code page.
  • A character repertoire is the set of characters that can be represented by a particular coded character set. The repertoire may be closed, meaning that no additions are allowed without creating a new standard; or it may be open, allowing additions
  • A code point is a value or position of a character in a coded character set.
    • A code point is represented by a sequence of code units. The mapping is defined by the encoding. Thus, the number of code units required to represent a code points depends on the encoding:
      • UTF-8: code points map to a sequence of one, two, three or four code units
      • UTF-16: code units are twice as long as 8-bit code units. Therefore, any code point with a scalar value less than U+10000 is encoded with a single code unit. Code points with a value of U+10000 or higher require two code units each.
      • UTF-32: the 32-but code unit is large enough that every code point is represented as a single code unit
  • A code space is the range of numerical values spanned by a coded character set.
  • A code unit is the minimum bit combination that can represent a character in a character encoding.
    • A code unit in ASCII consists of 7 bits
    • A code unit in UTF-8, EBCDIC and GB 18030 consists of 8 bits
    • A code unit in UTF-16 consists of 16 buts
    • A code unit in UTF-32 consists of 32 bits


Unicode Encoding Model

Unicode and its parallel standard, the ISO/IEC 10646 Universal Character Set, together constitute a unified standard for character encoding. Rather than mapping characters directly to bytes, Unicode separately defines a coded character set that maps characters to unique natural numbers (code points), how those code points are mapped to a series of fixed-size natural numbers (code units), and finally how those units are encoded as a stream of octets (bytes).
  • The purpose of this decomposition is to establish a universal set of characters that can be encoded in a variety of ways.
  • Unicode uses its own set of terminology to describe its process:
    • An abstract character repertoire (ACT) is the full set of abstract characters that system supports.
    • A coded character set (CCS) is a function that maps characters to code points (each code point represents one character). For example, the capital A in the Latin alphabet may be represented by code point 65.
    • A character encoding form (CEF) is the mapping of code points to code units to facilitate storage in a system that represents numbers as bit sequences of fixed length (i.e. practically any computer system).
    • A character encoding scheme (CES) is the mapping of code units to a squence of octets to faciliate storage on an octet-based file system for transmission over an octet-based network. Simple character encoding schemes include UTF-8, UTF-16BE, ...; compound character encoding schemes such as UTF-16, UTF-32, ...; switch between several simple schemes by using a byte order mark or escape sequences, compressing schemes try to minimize the number of bytes used per code unit.
  • Unicode Code Points
    • In Unicode, a character can be referred to as U+ followed by its codepoint value in hexadecimal.
    • The range of valid code points (the code space) for the Unicode standard is U+0000 to U+10FFFF inclusive, divided in 17 planes, identified by the numbers 0 to 16.
      • In the Unicode standard, a plane is a contiguous group of code points.


Transcoding

  • As a result of having many character encoding methods in use (and the need for backward compatibility with archived data), many computer programs have been developed to translate data between character encoding schemes, a process known as transcoding.

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


Insert Chart

ESC

View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Lexical State

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload 3D Object

ESC

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image Background Color

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language