Note: This article may not display properly if your browser has no font installed with some special Unicode characters.
Unicode is a beautiful thing. Where in the past, we had to be content with text sequences of merely 128 different characters, we can now use over 110,000 characters from almost every script in use today or throughout history. But there’s a problem: despite the fact that Unicode has been around and relatively unchanged for the past 16 years, most modern programming languages do not give programmers a proper abstraction over Unicode strings, which means that we must carefully deal with a range of encoding issues or suffer a plethora of bugs. And there are still languages coming out today that fall into this trap. This blog post is a call to language designers: your fundamental string data type should abstract over Unicode encoding schemes, such that your programmer sees only a sequence of Unicode characters, thus preventing most Unicode bugs.
Over the past few years, I have encountered a surprising amount of resistance to this idea, with counter-arguments including performance reasons, compatibility reasons and issues of low-level control. In this post, I will argue that having a consistent, abstract Unicode string type is more important, and that programmers can be given more control in the special cases that require it.
To give a quick example of why this is a problem, consider the Hello World program on the front page of the Go programming language website. This demo celebrates Go’s Unicode support by using the Chinese word for “world”, “世界”:
But while Go ostensibly has Unicode strings, they are actually just byte strings, with a strong mandate that they be treated as UTF-8-encoded Unicode text. The difference is apparent if we wrap the string in the built-in len function:
This code, perhaps surprisingly, prints 13, where one might have expected it to print 9. This is because while there are 9 characters in the string, there are 13 bytes if the string is encoded as UTF-8: in Go, the string literal “Hello, 世界” is equivalent to “Hello, \xe4\xb8\x96\xe7\x95\x8c”. Similarly, indexing into the string uses byte offsets, and retrieves single bytes instead of whole characters. This is a dangerous leaky abstraction: it appears to be a simple sequence of characters until certain operations are performed. Worse, English-speaking programmers have a nasty habit of only testing with ASCII input, which means that any bugs where the programmer has presumed to be dealing with characters (but is actually dealing with bytes) will go undetected.
In a language that properly abstracts over Unicode strings, the above string should have a length of 9. Programmers should only be exposed to encoding details when they explicitly ask for it.
The folly of leaky abstraction
The fundamental problem lies in the difference between characters and code units. Characters are the abstract entities that correspond to code points. Code units are the pieces that represent a character in the underlying encoding, and they vary depending on the encoding. For example, consider the character OSMANYA LETTER BA or ‘𐒁’ (with code point U+10481 — I go into details on the difference between characters and code points in this blog post). In UTF-8, it is represented by the four byte sequence F0 90 92 81; code units are bytes, so we would say this character is represented by four code units in UTF-8. In UTF-16, the same character is represented by two code units: D801 DC81; code units are 16-bit numbers. The problem with many programming languages is that their string length, indexing and, often, iteration operations are in code units rather than in characters (or code points).
(An aside: I find it interesting that the Unicode Standard itself defines a “Unicode string” as “an ordered sequence of code units.” That’s code units, not code points, suggesting that the standard itself disagrees with my thesis that a string should be thought of as merely a sequence of code points. This section of the standard seems concerned with how the strings should be represented, not the programming interface for accessing such strings, so I feel that it doesn’t directly contradict this post. This post is about the string interface, not the underlying representation.)
There are three main types of culprit languages that exhibit this issue:
- Languages with byte-oriented string operations and no specified encoding. These leave Unicode support up to programmers and third-party library authors, resulting in a general mish-mash of encoding across different code bases. These include C, Python 2, Ruby, PHP, Perl, Lua, and almost all pre-Unicode (1992) programming languages.
- Languages with code-unit-oriented string operations and no specified encoding. These are probably the worst offenders, as the language manages the encoding, but leaks abstraction details that the programmer has no control over (for example, different builds of the compiler may have different underlying encoding schemes, affecting the behaviour of programs). These include certain builds of Python 3.2 and earlier, and Mercury.
Edit: A number of comments indicate that some of the other languages have optional Unicode support. In Perl 5, you can write “use utf8” to turn on proper Unicode strings. In Ruby 1.9, you can attach encodings to strings to make them behave better. In Python 2, as I’ll get to later, you have a separate Unicode string type.
I must stress how harmful the second class of languages are in terms of string encoding. In 2011, the Mercury language switched from a type 1 language (strings are just byte sequences) to a type 2 language — the language now specifies all of the basic string operations in terms of code units, but does not specify the encoding scheme. Worse, Mercury has several back-ends — if compiling to C or Erlang, it uses UTF-8, whereas compiling to Java or C# results in UTF-16-encoded strings. Now programmers must contend with the fact that string.length(“Hello, 世界”) might be 13 or 9 depending on the back-end. (Fortunately, character-oriented alternatives, such as count_codepoints, are provided.) This is a major blow to code portability, which I would recommend language designers avoid at all costs.
Python too has traditionally had this problem: the interpreter can be compiled in “UCS-2” or “UCS-4” modes, which represent strings as UTF-16 or UTF-32 respectively, and expose those details to the programmer (the UCS-2 build behaves much like the other UTF-16 languages, while the UCS-4 build behaves exactly as I want, with one character per code unit). Fortunately, Python is about to correct this little wart entirely with the introduction of PEP 393 in upcoming version 3.3. This version will remove UCS-2/UCS-4 build, so all future versions will behave as the UCS-4 build did, but with a nice optimisation: all strings with only Latin-1 characters are encoded in Latin-1; all strings with only basic multilingual plane (BMP) characters are encoded in UCS-2 (UTF-16); all strings with astral characters are encoded in UTF-32. (It’s more complicated than that, but that’s the gist of it.) This ensures the correct semantics, but allows for compact string representation in the overwhelmingly common case of BMP-only strings.
Somewhat surprisingly, Bash (1989) appears to support character-oriented Unicode strings. The only other languages I have found which properly abstract over Unicode strings are from the functional world: Haskell 98 (1998) and Scheme R6RS (2007). Haskell 98 specifies that “the character type Char is an enumeration whose values represent Unicode characters” (with a link to the Unicode 5.0 specification). Scheme R6RS was the first version of Scheme to mention Unicode, and got it right on the first go. It specifies that “characters are objects that represent Unicode scalar values,” and goes into details explicitly stating that a character value is in the range [0, D7FF16] ∪ [E00016, 10FFFF16], then defines strings as sequences of characters.
With only four languages that I know of properly providing character-oriented string operations by default (are there any more?), this is a pretty poor track record. Sadly, many modern languages are being built on top of either the JVM or .NET framework, and so naturally absorb the poor character handling of those platforms. Still, it would be nice to see some more languages that behave correctly.
Some real-world problems
So far, the discussion has been fairly academic. What are some of the actual problems that have come about as a result of the leaky abstraction?
Edit: I mention a heap of technologies here, in a way that could be interpreted to be derisive. I don’t mean any offense towards the creators of these technologies, but rather, I intend to point out how difficult it can be to work with Unicode on a language that doesn’t provide the right abstractions.
An obvious example of an operation that goes wrong in these situations is the length operator. If you use UTF-8, Chinese characters are going to each report a length of 3, while in UTF-16, astral characters will each report a length of 2. Think this doesn’t matter? What about Twitter? On Twitter, you have to type messages into 140 characters. That’s 140 characters, not 140 bytes. Imagine if Twitter ate up three “characters” every time you hit a key (if you spoke Chinese, for example). Fortunately, Twitter does it correctly, even for astral characters. I just tweeted the following:
𝐓𝐰𝐢𝐭𝐭𝐞𝐫 𝐬𝐮𝐩𝐩𝐨𝐫𝐭𝐬 𝐚𝐬𝐭𝐫𝐚𝐥 𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫𝐬 — 𝐞𝐚𝐜𝐡 𝐨𝐧𝐞 𝐜𝐨𝐮𝐧𝐭𝐬 𝐚𝐬 𝐚 𝐬𝐢𝐧𝐠𝐥𝐞 𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫. 𝐈𝐭’𝐬 𝐧𝐨𝐭 𝐭𝐡𝐞 𝐬𝐢𝐳𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐭𝐰𝐞𝐞𝐭 𝐭𝐡𝐚𝐭 𝐜𝐨𝐮𝐧𝐭𝐬, 𝐢𝐭’𝐬 𝐡𝐨𝐰 𝐲𝐨𝐮 𝐮𝐬𝐞 𝐢𝐭.
Let’s quickly look at the way Tomboy (a note taking app written in C# on the Mono/.NET platform) loads notes from XML files. I discovered a bug in Tomboy where the entire document would be messed up if you type an astral character, save, quit, and load. It turns out that one function is keeping an int “offset” which counts the number of “characters” (actually code units) read, and passes the offset to another function, which “returns the location at a particular character offset.” This one really means character. Needless to say, one astral character means that the character offset is too high, resulting in a complete mess. Why is it the programmer’s responsibility to deal with these concepts?
Other software I’ve found that doesn’t handle astral characters correctly includes Bugzilla (ironically, I discovered this when it failed to save my report of the above Tomboy bug) and FileFormat.info (an otherwise-fantastic resource for Unicode character information). The latter shows what can happen when a UTF-16 string is encoded to UTF-8 without a great deal of care, presumably due to an underlying language exposing a UTF-16 string representation.
A common theme here is programmer confusion. While these problems can be solved with enough effort, the leaky abstraction forces the programmer to expend effort that could be spent elsewhere. Importantly, it makes documentation more difficult. Any language with byte- or code-unit-oriented strings forces programmers to clarify what they mean every time they say “length” or “n characters”. Only when the language completely hides the encoding details are programmers free to use the phrase “n characters” unambiguously.
What about performance?
The most common objection to character-oriented strings is that it impacts the performance of strings. If you have an abstract Unicode string that uses UTF-8 or UTF-16, your indexing operation (charAt) goes from O(1) to O(n), as you must iterate over all of the variable-length code points. It is easier to let the programmer suffer an O(1) charAt that returns code units rather than characters. Alternatively, if you have an abstract Unicode string that uses UTF-32, you’ll be fine from a performance standpoint, but you’ve made strings take up two or four times as much space.
But let’s think about what this really means. The performance argument really says, “it is better to be fast than correct.” That isn’t something I’ve read in any software engineering book. Conventional wisdom is the opposite: 1. Be correct — in all cases, not just the common ones, 2. Be fast (here it is OK to optimise for the common cases). While performance is an important consideration, is unbelievable to me that designers of high-level languages would prioritise speed over correctness. (Again, you could argue that a code unit interface is not “incorrect,” merely “low level,” but hopefully this post shows that it frequently results in incorrect code.)
I don’t know exactly what the best implementation is, and I’m not here to tell you how to implement it. But for what it’s worth, I don’t consider UTF-32 to be a tremendous burden in this day and age (in moving to 64-bit platforms, we doubled the size of all pointers — why can’t we also double the size of strings?) Alternatively, Python’s PEP 393 shows that we can have correct semantics and optimise for common cases as well. In any case, we are talking about software correctness here — can we get the semantics right first, then worry about optimisation?
But it isn’t supposed to be abstract — that’s what libraries are for
The second most common objection to character-oriented strings is “but the language just provides the basic primitives, and the programmer can build whatever they like on top of that.” The problem with this argument is that it ignores the ecosystem of the language. If the language provides, for example, a low-level byte-string data type, and programmers can choose to use it as a UTF-8 string (or any other encoding of their choice), then there is going to be a problem at the seams between code written by different programmers. If you see a library function that takes a byte-string, what should you pass? A UTF-8 string? A Latin-1 string? Perhaps the library hasn’t been written with Unicode awareness at all, and any non-ASCII string will break. At best, the programmer will have written a comment explaining how you should encode Unicode characters (and this is what you need to do in C if you plan to support Unicode), but most programmers will forget most of the time. It is far better to have everybody on the same page by having the main string type be Unicode-aware right out of the box.
The ecosystem argument can be seen very clearly if we compare Python 2 and 3. Python 3 is well known for better Unicode support than its predecessor, but fundamentally, it is a rather simple difference: Python 2’s byte-string type is called “str”, while its Unicode string type is called “unicode”. By contrast, Python 3’s byte-string type is called “bytes”, while its Unicode string type is called “str”. They behave pretty much the same, just with different names. But in practice, the difference is immense. Whenever you call the str() function, it produces a Unicode string. Quoted literals are Unicode strings. All Python 3 functions, regardless of who wrote them, deal with Unicode, unless the programmer had a good reason to deal with bytes, whereas in Python 2, most functions dealt with bytes and often broke when given a Unicode string. The lesson of Python 3 is: give programmers a Unicode string type, make it the default, and encoding issues will mostly go away.
What about binary strings and different encodings?
Similarly, a common argument is that text might come into the program in all manner of different encodings, and so abstracting away the string representation means programmers can’t handle the wide variety of encodings out there. That’s nonsense — of course even high-level languages need encoding functions. Python 3 handles this ideally — the string (str) data type has an ‘encode’ method that can encode an abstract Unicode string into a byte string using an encoding of your choice, while the byte string (bytes) data type has a ‘decode’ method that can interpret a byte string using a specified encoding, returning an abstract Unicode string. If you need to deal with input in, say, Latin-1, you should not allow the text to be stored in Latin-1 inside your program — then it will be incompatible with any other non-Latin-1 strings. The correct approach is to take input as a byte string, then immediately decode it into an abstract Unicode string, so that it can be mixed with other strings in your program (which may have non-Latin characters).
What about non-Unicode characters?
There’s one last thorny issue: it’s all well and good to say “strings should just be Unicode,” but what about characters that cannot be represented in Unicode? “Like what,” you may ask. “Surely all of the characters from other encodings have made their way into Unicode by now.” Well, unfortunately, there is a controversial topic called Han unification, which I won’t go into details about here. Essentially, some languages (notably Japanese) have borrowed characters from Chinese and changed their appearance over thousands of years. The Unicode consortium officially considers these borrowed characters as being the same as the original Chinese, but with a different appearance (much like a serif versus a sans-serif font). But many Japanese speakers consider them to be separate to the Chinese characters, and Japanese character encodings reflect this. As a result, Japanese speakers may want to use a different character set than Unicode.
This is unfortunate, because it means that a programming language designed for Japanese users needs to support multiple encodings and expose them to the programmer. As Ruby has a large Japanese community, this explains why Ruby 1.9 added explicit encodings to strings, instead of going to all-Unicode as Python did. I personally find that the advantages of a single high-level character sequence interface outweigh the need to change character sets, but I wanted to mention this issue anyway, by way of justifying Ruby’s choice of string representation.
What about combining characters?
I’ve faced the argument that my proposed solution doesn’t go far enough in abstracting over a sequence of characters. I am suggesting that strings be an abstract sequence of characters, but the argument has been put to me that characters (as Unicode defines them) are not the most logical unit of text — combining character sequences (CCSes) are.
For example, consider the string “𐒁̸”, which consists of two characters: ‘𐒁’ (U+10481 OSMANYA LETTER BA) followed by ‘◌̸’ (U+0338 COMBINING LONG SOLIDUS OVERLAY). What is the length of this string?
- Is it 6? That’s the number of UTF-8 bytes in the string (4 + 2).
- Is it 3? That’s the number of UTF-16 code units in the string (2 + 1).
- Is it 2? That’s the number of characters (code points) in the string.
- Is it 1? That’s the number of combining character sequences in the string.
My argument is that the first two answers are unacceptable, because they expose implementation details to the programmer (who is then likely to pass them on to the user) — I am in favour of the third answer (2). The argument put to me is that the third answer is also unacceptable, because most people are going to think of “𐒁̸” as a single “character”, regardless of what the Unicode consortium thinks. I think that either the third or fourth answers are acceptable, but the third is a better compromise, because we can count characters in constant time (just use UTF-32), whereas it is not possible to count CCSes in constant time.
I know, I know, I said above that performance arguments don’t count. But I see a much less compelling case for counting CCSes over characters than for counting characters over code units.
Firstly, users are already exposed to combining characters, but they aren’t exposed to code units. We typically expect users to type the combining characters separately on a keyboard. For instance, to type the CCS “e̸”, the user would type ‘e’ and then type the code for ‘◌̸’. Conversely, we never expect users to type individual UTF-8 or UTF-16 code units — the representation of characters is entirely abstract from the user. Therefore, the user can understand when Twitter, for example, counts “e̸” as two characters. Most users would not understand if Twitter were to count “汉” as three characters.
Secondly, dealing with characters is sufficient to abstract over the encoding implementation details. If a language deals with code units, a program written for a UTF-8 build may behave differently if compiled with a UTF-16 string representation. A language that deals with characters is independent of the encoding. Dealing with CCSes doesn’t add any further implementation independence. Similarly, dealing with characters is enough to allow strings to be spliced, re-encoded, and re-combined on arbitrary boundaries; dealing with CCSes isn’t required for this either.
Thirdly, since Unicode allows arbitrarily long combining character sequences, dealing with CCSes could introduce subtle exploits into a lot of code. If Twitter counted CCSes, a user could write a multi-megabyte tweet by adding hundreds of combining characters onto a single base character.
In any case, I’m happy to have a further debate about whether we should be counting characters or CCSes — my goal here is to convince you that we should not be counting code units. Lastly, I’ll point out that Python 3 (in UCS-4 build, or from version 3.3 onwards) and Haskell both consider the above string to have a length of 2, as does Twitter. I don’t know of any programming languages that count CCSes. Twitter’s page on counting characters goes into some detail about this.
My point is this: the Unicode Consortium has provided us with a wonderful standard for representing and communicating characters from all of the world’s scripts, but most modern languages needlessly expose the details of how the characters are encoded. This means that all programmers need to become Unicode experts to make high-quality internationalised software, which means it is far less likely that programs will work when non-ASCII characters, or even worse, astral characters, are used. Since most English programmers don’t test with non-ASCII characters, and almost nobody tests with astral characters, this makes it pretty unlikely that software will ever see such inputs before it is released.
Programming language designers need to become Unicode experts, so that regular programmers don’t have to. The next generation of languages should provide character-oriented string operations only (except where the programmer explicitly asks for text to be encoded). Then the rest of us can get back to programming, instead of worrying about encoding issues.