12.08.2020

UTF-8 and the question how Unicode gets into the computer

In Part 1, we answered the question why Unicode exists and why it is important to create such standards in our globally interconnected world. We also looked at how Unicode maps language-specific features. But so far, we've only looked at Unicode code points, which basically are numbers. Today, I'm sharing how those numbers map to file bytes.

All previous considerations about the mapping of characters only mapped a numbering of the characters into Unicode code points. But if we imagine a text file, it is organized in single bytes. A byte consists of 8 bits and can therefore represent 256 different values. Without further ado, however, it is not possible to see what the meaning of the individual bytes is. In particular, I can't immediately tell from the bytes whether they contain ASCII-encoded text data or ISO-8859-15. In fact, I can't even tell whether it's text at all and not a JPEG of the latest Bundesliga table. Even with pure text data, the information about the encoding used - i.e. the mapping of bytes to characters of a character set - must always be transported additionally. This is the only way to give the bare bits any meaning at all.

Now it is rather rare in practice that one specifies an explicit encoding when exchanging data. This is due to the fact that one can also regulate this by convention. For example, a text file created on a German Windows 98 will most likely use the encoding Windows-1252 (a superset of ISO-8859-1). Some programs have become quite good at heuristically guessing which encoding is used.

In the area of networking - which of course also includes the internet - it is common practice to add methods for specifying the used encoding to the network protocols themselves. In the HTTP protocol, for example, there is the Content-Type header. The specification: Content-Type: text/plain; charset=UTF-8 announces a UTF-8 encoded text file. More about this in a later article.

**But what is an encoding anyway?**

In the times where ASCII dominated the world, one byte per character was enough so that it was always clear where in the file a character begins and where it ends. With Unicode, this is no longer so simple. Since there are significantly more than 256 code points, there are definitely characters that require more than one byte to represent.

As an example, let's look at the word "Unicode" and convert it to Unicode code points:

U+0055 U+006E U+0069 U+0063 U+006F U+0064 U+0065

One of the first approaches, and in principle the one used by Microsoft from Windows NT onwards, is to store each character in two bytes. This leads to the so-called UCS-2 encoding (Universal Coded Character Set), which requires 2 bytes per character. Hexadecimal it then looks e.g. like this:

00 55 00 6E 00 69 00 63 00 6F 00 64 00 65

However, one aspect is initially completely arbitrary. The following would also be possible:

55 00 6E 00 69 00 63 00 6F 00 64 00 65 00

The difference is that I swapped every byte pair. The computer scientist calls this the byte order or endianness, in which Intel CPUs differ e.g. from PowerPC or ARM processors (yes I know that some PPCs and ARM CPUs have switchable endianness). The first example uses the big-endian format, i.e. each number is stored with the highest bit first. The second example, on the other hand, uses - you guessed it - the little-endian format and thus exactly the convention that Intel has chosen.
So, depending on whether one used Windows NT now on an Intel processor or a DEC Alpha, either big-endian or little-endian encoding was used by default.

Not so good if you don't keep this in mind but want to exchange data. To make it possible to recognize which endianness has been used, a mark can be placed at the beginning of the file if desired. This is the so-called Unicode BOM (Byte Order Mark). It always consists of either FE FF (Big Endian) or FF FE (Little Endian). That way it is easy to grasp from the first two bytes how the rest is encoded. To be fair, however, it must be said that in practice very many programs also had problems with the two extra bytes at the beginning - PHP being among them in the past, for example.

The UCS-2 encoding has two significant disadvantages. First, it can only map 65536 characters, more is simply not possible with two bytes. That covers the possibilities of Unicode only to about six percent.
But much worse is the other disadvantage: Each character occupies 2 bytes. In the days of Windows NT, when not every cell phone had 128GB or more of storage - "640 K", we remember 😉 - this meant that every file became twice as big as an ASCII file of the same content. Especially in the USA, which had previously been able to cope with ASCII, experts doubted the rationale behind such a baroque use of precious storage space.

The first problem, namely to be able to represent the entire Unicode code space, was initially approached relatively crudely: by simply using four bytes instead of two. This is called UTF-32 encoding (UCS Transformation Format), which encodes each character in 32 bits (= 4 bytes). This does cover the entire set of code points. But blowing up every ASCII file to four times the file size in order to be able to use one or two kanjis - practically nobody was prepared to do that in practice - and so this idea quickly disappeared from the scene again.

Das erste Problem, nämlich den ganzen Unicode-Umfang abbilden zu können, ging man zunächst relativ grobschlächtig an: indem man einfach vier statt zwei Bytes benutzte. Das nennt sich UTF-32-Kodierung (UCS Transformation Format), die jedes Zeichen in 32 Bits (= 4 Bytes) kodiert. Damit lässt sich zwar der gesamte Vorrat an Codepoints abdecken. Aber jede ASCII-Datei auf das Vierfache der Dateigröße aufblähen, um ein, zwei Kanjis benutzen zu können – dazu war in der Praxis praktisch niemand bereit – und so verschwand diese Idee schnell wieder von der Bildfläche.

The breakthrough came with UTF-8. The first thing that UTF-8 defines is that it is a variable-length encoding, a so-called multibyte character set. (At this point a salute to the PHP developers: Yes, exactly, that's where the name of the mbstring extension comes from). Within UTF-8 there are characters with lengths between one and (currently) four bytes, although the algorithm also allows lengths of up to eight bytes. The first byte defines the length of the character by the number of binary ones it starts with. All following bytes then have the form 10xxxxxx.

Characters that are exactly one byte long always start with a binary 0 at the beginning and therefore always have a value of 0-127. We remember that the first 127 Unicode code points were defined as identical to ASCII. Thus here also the binary representations agree 1:1. Every (7-bit) ASCII file can therefore always be interpreted as UTF-8 encoded, so it is upward compatible to ASCII. Especially this aspect certainly helped UTF-8 to its triumphal procession.

Let's take another look at the character 🤷 (U+1F937) and how it looks encoded in UTF-8:

11110000 10011111 10100100 10110111

We assume that in each case a file was created that only contains the displayed bytes. So in the file there is nothing before and after the byte sequence shown.
It is easy to recognize that the first byte starts with four ones and this defines that the character is coded with four bytes. Just as claimed above, all further bytes start with 10xx. So a UTF-8 compatible program always knows where a character starts - a crucial property. If I - for example out of pure malice - were to simply swap the first bytes, ...

10011111 11110000 10100100 10110111

... then it is immediately clear that the result is no longer valid UTF-8. Not only that the file starts with a byte that can't be a first byte of a UTF-8 character, after that comes the first byte of a four-byte character and there are only three bytes left.

So it is immediately clear that something is wrong here and one could abort and simply display an error message. Instead, most programs try to make the best of the bytes and continue to read bytes until the file continues with valid byte sequences again. This is usually the more sensible approach, because there may be external reasons why the error occurred. Scratches on the bluray, tilted bit in the network cable (happens more often than you think) or whatever.

To recognize that there was a decoding error, Unicode defines so-called Unicode Replacement Characters. The best known is probably the �, which can not only be found as a pun on the well-known Schei�-Encoding-T-Shirts (German translating to "shitty encoding"). It is always represented as a replacement when a program has encountered a byte sequence that may not occur in the current Unicode encoding - whether UTF-8 or one of the less common ones.

In the next blog post on Unicode, we'll look at web pages and the use of Unicode encodings in web applications. Above all, we'll take a look at the dubious workarounds that database manufacturers have come up with in the meantime.

You always start every HTML page with <meta charset="UTF-8">? Then apply with us!

Please feel free to share this article.

UTF-8 and the question how Unicode gets into the computer

About numbers and bytes

**But what is an encoding anyway?**

The first idea is not necessarily the best

Can it be done more efficiently?

UTF-8 in practice, ...

... but what if things don't go as expected?