06.07.2020

Unicode, ISO-8859-1 and even more character salad

Today we want to take a look at the topic of content from a completely different perspective, namely its presentation. No, no, I don't mean which layout is used or which color scheme is chosen. I'm talking about script, i.e. the legible presentation of text. And, of course, all the pitfalls in the IT and especially the web environment that lurk here.

Let's have a look, back at the year of 1993: Non-Latin written languages were still hardly common in our latitudes. IT (at least in Germany) was still called EDV (Elektronische Datenverarbeitung, i.e. electronic data processing), all employees looked at 80-character wide text screens (whether they showed light gray, green or amber text on a black background was up to you) and the needs of the German local data processing guild often ended at the Oder-Neisse border (translators' note: this border depicted the border to Poland and German IT specialists often used to only care for their own problems not respecting foreign languages at all). With ASCII you could easily get through all situations - 7 or 8 bits per character, depending on the definition - and 640K ought be enough for everyone anyway.

Desktop publishing (DTP) was already a thing, but it wasn't until the widespread introduction of graphical operating systems (Windows 3.11 sends its regards) that this technology became suitable for the masses. Now every school newspaper was designed on a PC and TrueType conquered the world. The western world spoke Latin-1 (ISO-8859-1 to be exact) throughout and everyone was happy. Until, at the turn of the millennium, the euro and its € were introduced and all of Europe had to deal with something like character sets.

Suddenly it was important that the house typeface to be used was "€-capable" and the first hacks were hastily worked out. Who knows how long after that "=" was diligently printed over a "C" because the systems and fonts used did not yet know the character?

After a single new character had already turned the digital scene upside down, it was realized that a more sustainable solution was needed. Since it quickly became clear that it would not be possible to represent all the writing systems in the world with the 256 bit combinations offered by the 1-byte characters, it was decided to go straight to the point. With Unicode a standard was created, that covers an encoding with at present 1,114,112 theoretically possible symbols, of which currently approximately 143,000 were assigned. Unicode defines so called code points, i.e. put simply: It takes all known characters and simply numbers them. Each character gets a number and you can look up each character in the big list, e.g. No. 129335 for 🤷. For reasons of backwards compatibility you define the first 128 entries as identical to ASCII.

Lesson learned: As we have already learned from IPv4, things always get exciting in IT when someone says: "This will do." 🙂 .

At this point, this story could be over. However, there are still two problems.

First, let's start with the more complicated one: To do this, we must first briefly take off our Germanic glasses and ask ourselves: What exactly is a "character"?

The first answer is most likely: Well, probably that means a letter. But a look west of Alsace and the answer becomes more complicated. In addition to the "e", the French also know the "é" (with accent aigu) and the "è" (with accent grave). Everyone knows that on a German keyboard you type these characters in two steps - first the accent and then the "e". Unicode maps this in the same way, i.e. there is a code point for the "e" and such for the accents, which are also counted among the so-called diacritical characters.

The slyboots might now say: "Wait, just a moment ago we were told that ASCII characters can still be used and an 'é' certainly already existed in DOS times".

That is also true. In fact, many characters are found in Unicode several times and also in several combinations, e.g. like the example above as a complete package and also in individual parts. One reason for this is that it is common in Asian countries to combine characters in different ways. It is therefore worthwhile to represent diacritical characters separately in order to have to define fewer combinations and thus fewer code points. The print experts among you may forgive me for not mentioning ligatures etc. at this point, which also exist as separate Unicode code points. But that would really lead too far now.

Another example of the complexity of it all: If two Hindus were to greet each other in a WhatsApp chat, they might do so with the phrase नमस्ते - known to us more Latinized as "namaste." Now the prize question: How many characters are there?

Hmm. We mark the little word successively with the mouse and possibly notice as I do (Chrome under macOS) that the cursor "clicks" three times; so maybe three characters? Unfortunately wrong. The correct answer is, as so often: It depends 🙂 .

Namely, whether one means with "characters" so-called Unicode graphemes (the indivisible, quasi the font atoms): then the answer would be six. Or whether one would rather count so-called Unicode grapheme clusters. A grapheme cluster in terms of Unicode does not mean a tabloid newspaper in the waste paper container, but a basic character plus all diacritical characters that complement it. In the end, this comes quite close to the well-known "letter". With this procedure we come to four such "clusters": "न", "म", "स्" and "ते". How do you come up with three now? The unsatisfying truth: ideally not at all, because three is definitely wrong. Unicode support apart from Western languages is often still incomplete and prone to bugs everywhere. Hordes of developers were driven to despair by the fact that such UTF-8 strings as above consisted of four characters (always think of the heaps when you say "characters"!), which are composed of six Unicode code points and then occupy 18 bytes encoded in UTF-8.

In the next blogpost of this series, I'll bring the computer in the context of Unicode into play and tell you why every Unicode-plagued developer needs a Schei�-Encoding-T-Shirt (translator's note: shitty encoding) sooner or later 😉

You wear such a t-shirt out of conviction? Then apply with us!

Please feel free to share this article.

Unicode, ISO-8859-1 and even more character salad

Off to the good ol' days

Unicode to the rescue!

Text and writing are not so simple, ...

… but would you have thought they were so complex?