The importance of language-level abstract Unicode strings

In Language design, Language implementation, Unicode on April 19, 2012 by Matt Giuca Tagged: ,

Note: This article may not display properly if your browser has no font installed with some special Unicode characters.

Unicode is a beautiful thing. Where in the past, we had to be content with text sequences of merely 128 different characters, we can now use over 110,000 characters from almost every script in use today or throughout history. But there’s a problem: despite the fact that Unicode has been around and relatively unchanged for the past 16 years, most modern programming languages do not give programmers a proper abstraction over Unicode strings, which means that we must carefully deal with a range of encoding issues or suffer a plethora of bugs. And there are still languages coming out today that fall into this trap. This blog post is a call to language designers: your fundamental string data type should abstract over Unicode encoding schemes, such that your programmer sees only a sequence of Unicode characters, thus preventing most Unicode bugs.

Over the past few years, I have encountered a surprising amount of resistance to this idea, with counter-arguments including performance reasons, compatibility reasons and issues of low-level control. In this post, I will argue that having a consistent, abstract Unicode string type is more important, and that programmers can be given more control in the special cases that require it.

To give a quick example of why this is a problem, consider the Hello World program on the front page of the Go programming language website. This demo celebrates Go’s Unicode support by using the Chinese word for “world”, “世界”:

fmt.Println("Hello, 世界")

But while Go ostensibly has Unicode strings, they are actually just byte strings, with a strong mandate that they be treated as UTF-8-encoded Unicode text. The difference is apparent if we wrap the string in the built-in len function:

fmt.Println(len("Hello, 世界"))

This code, perhaps surprisingly, prints 13, where one might have expected it to print 9. This is because while there are 9 characters in the string, there are 13 bytes if the string is encoded as UTF-8: in Go, the string literal “Hello, 世界” is equivalent to “Hello, \xe4\xb8\x96\xe7\x95\x8c”. Similarly, indexing into the string uses byte offsets, and retrieves single bytes instead of whole characters. This is a dangerous leaky abstraction: it appears to be a simple sequence of characters until certain operations are performed. Worse, English-speaking programmers have a nasty habit of only testing with ASCII input, which means that any bugs where the programmer has presumed to be dealing with characters (but is actually dealing with bytes) will go undetected.

In a language that properly abstracts over Unicode strings, the above string should have a length of 9. Programmers should only be exposed to encoding details when they explicitly ask for it.

The folly of leaky abstraction

The fundamental problem lies in the difference between characters and code units. Characters are the abstract entities that correspond to code points. Code units are the pieces that represent a character in the underlying encoding, and they vary depending on the encoding. For example, consider the character OSMANYA LETTER BA or ‘𐒁’ (with code point U+10481 — I go into details on the difference between characters and code points in this blog post). In UTF-8, it is represented by the four byte sequence F0 90 92 81; code units are bytes, so we would say this character is represented by four code units in UTF-8. In UTF-16, the same character is represented by two code units: D801 DC81; code units are 16-bit numbers. The problem with many programming languages is that their string length, indexing and, often, iteration operations are in code units rather than in characters (or code points).

(An aside: I find it interesting that the Unicode Standard itself defines a “Unicode string” as “an ordered sequence of code units.” That’s code units, not code points, suggesting that the standard itself disagrees with my thesis that a string should be thought of as merely a sequence of code points. This section of the standard seems concerned with how the strings should be represented, not the programming interface for accessing such strings, so I feel that it doesn’t directly contradict this post. This post is about the string interface, not the underlying representation.)

There are three main types of culprit languages that exhibit this issue:

  1. Languages with byte-oriented string operations and no specified encoding. These leave Unicode support up to programmers and third-party library authors, resulting in a general mish-mash of encoding across different code bases. These include C, Python 2, Ruby, PHP, Perl, Lua, and almost all pre-Unicode (1992) programming languages.
  2. Languages with code-unit-oriented string operations and no specified encoding. These are probably the worst offenders, as the language manages the encoding, but leaks abstraction details that the programmer has no control over (for example, different builds of the compiler may have different underlying encoding schemes, affecting the behaviour of programs). These include certain builds of Python 3.2 and earlier, and Mercury.
  3. Languages with code-unit-oriented string operations, but a specific encoding scheme. These languages at least have well-defined behaviour, but still expose the programmer to implementation details, which can result in bugs. Most modern languages fit into this category, including Go (UTF-8), JVM-based languages such as Java and Scala (UTF-16), .NET-based languages such as C# (UTF-16) and JavaScript (UTF-16).

Edit: A number of comments indicate that some of the other languages have optional Unicode support. In Perl 5, you can write “use utf8” to turn on proper Unicode strings. In Ruby 1.9, you can attach encodings to strings to make them behave better. In Python 2, as I’ll get to later, you have a separate Unicode string type.

Above, I talked about the leaky abstraction of UTF-8 in Go, but it is equally important to recognise the leaky abstraction of UTF-16 in many modern language frameworks; in particular, the Java platform (1996) and the .NET platform (2001). Both treat strings as a sequence of UTF-16 code units (and the corresponding “char” data type as a single UTF-16 code unit). This is disappointingly close to the ideal. In Java, for instance, you can essentially treat a string to be a sequence of characters. ‘e’ (U+0065) is a character, and ‘汉’ (U+6C49) is a character — it feels like the language is abstracting over Unicode characters, until you realise that it isn’t. ‘𐒁’ (U+10481) is not a valid Java character. It is an astral character (one that cannot be represented in a single UTF-16 code unit) — it needs to be represented by the two-code-unit sequence D801 DC81. So it is possible to store this character in a Java string, but not without understanding the underlying encoding details. For example, the Java string “𐒁” has length 2, and worse, if you ask for “charAt(0)” you will get the character U+D801; “charAt(1)” gives U+DC81. Any code which directly manipulates the characters of the string is likely to fail in the presence of astral characters. Java has the excuse of history: Java was first released in January 1996; Unicode 2.0 was released in July 1996, and before that, there were no astral characters (so for six months, Java had it right!). The .NET languages, such as C#, behave exactly the same, and don’t have the excuse of time. JavaScript is the same again. This is perhaps worse than the UTF-8 languages, because even non-English-speaking programmers are likely to forget about astral characters: there are no writing systems in use today that use astral characters. But astral characters include historical scripts of interest to historians, emoticons which may be used by software, and large numbers of mathematical symbols which are commonly used by mathematicians, including myself, so it is important that software gets them right.

I must stress how harmful the second class of languages are in terms of string encoding. In 2011, the Mercury language switched from a type 1 language (strings are just byte sequences) to a type 2 language — the language now specifies all of the basic string operations in terms of code units, but does not specify the encoding scheme. Worse, Mercury has several back-ends — if compiling to C or Erlang, it uses UTF-8, whereas compiling to Java or C# results in UTF-16-encoded strings. Now programmers must contend with the fact that string.length(“Hello, 世界”) might be 13 or 9 depending on the back-end. (Fortunately, character-oriented alternatives, such as count_codepoints, are provided.) This is a major blow to code portability, which I would recommend language designers avoid at all costs.

Python too has traditionally had this problem: the interpreter can be compiled in “UCS-2” or “UCS-4” modes, which represent strings as UTF-16 or UTF-32 respectively, and expose those details to the programmer (the UCS-2 build behaves much like the other UTF-16 languages, while the UCS-4 build behaves exactly as I want, with one character per code unit). Fortunately, Python is about to correct this little wart entirely with the introduction of PEP 393 in upcoming version 3.3. This version will remove UCS-2/UCS-4 build, so all future versions will behave as the UCS-4 build did, but with a nice optimisation: all strings with only Latin-1 characters are encoded in Latin-1; all strings with only basic multilingual plane (BMP) characters are encoded in UCS-2 (UTF-16); all strings with astral characters are encoded in UTF-32. (It’s more complicated than that, but that’s the gist of it.) This ensures the correct semantics, but allows for compact string representation in the overwhelmingly common case of BMP-only strings.

Somewhat surprisingly, Bash (1989) appears to support character-oriented Unicode strings. The only other languages I have found which properly abstract over Unicode strings are from the functional world: Haskell 98 (1998) and Scheme R6RS (2007). Haskell 98 specifies that “the character type Char is an enumeration whose values represent Unicode characters” (with a link to the Unicode 5.0 specification). Scheme R6RS was the first version of Scheme to mention Unicode, and got it right on the first go. It specifies that “characters are objects that represent Unicode scalar values,” and goes into details explicitly stating that a character value is in the range [0, D7FF16] ∪ [E00016, 10FFFF16], then defines strings as sequences of characters.

With only four languages that I know of properly providing character-oriented string operations by default (are there any more?), this is a pretty poor track record. Sadly, many modern languages are being built on top of either the JVM or .NET framework, and so naturally absorb the poor character handling of those platforms. Still, it would be nice to see some more languages that behave correctly.

Some real-world problems

So far, the discussion has been fairly academic. What are some of the actual problems that have come about as a result of the leaky abstraction?

Edit: I mention a heap of technologies here, in a way that could be interpreted to be derisive. I don’t mean any offense towards the creators of these technologies, but rather, I intend to point out how difficult it can be to work with Unicode on a language that doesn’t provide the right abstractions.

A memorable example for me was a program I worked on which sent streams of text from a Python 2 server (UTF-8 byte strings) to a JavaScript client (UTF-16 strings). We had chosen the arbitrary message size of 512 bytes, and were chopping up text arbitrarily on those boundaries, and sending them off encoded in UTF-8 to the client, where they were concatenated back together. This almost worked, but we discovered a strange problem: some non-ASCII characters were being messed up some of the time. It occurred if you happened to have a multi-byte character split across a 512-byte boundary. For example, “ü” (U+00FC) is encoded as hex C3 BC. If byte 511 of a message was C3 and byte 512 was BC, the first packet would be sent ending in the byte C3 (which JavaScript wouldn’t know how to convert to UTF-16), and the second packet would be sent beginning with the byte BC (which JavaScript also wouldn’t know how to convert to UTF-16). So the result is “��”. In general, if a language exposes its underlying code units, then strings may not be split, converted to a different encoding, and re-concatenated.

An obvious example of an operation that goes wrong in these situations is the length operator. If you use UTF-8, Chinese characters are going to each report a length of 3, while in UTF-16, astral characters will each report a length of 2. Think this doesn’t matter? What about Twitter? On Twitter, you have to type messages into 140 characters. That’s 140 characters, not 140 bytes. Imagine if Twitter ate up three “characters” every time you hit a key (if you spoke Chinese, for example). Fortunately, Twitter does it correctly, even for astral characters. I just tweeted the following:

𝐓𝐰𝐢𝐭𝐭𝐞𝐫 𝐬𝐮𝐩𝐩𝐨𝐫𝐭𝐬 𝐚𝐬𝐭𝐫𝐚𝐥 𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫𝐬 — 𝐞𝐚𝐜𝐡 𝐨𝐧𝐞 𝐜𝐨𝐮𝐧𝐭𝐬 𝐚𝐬 𝐚 𝐬𝐢𝐧𝐠𝐥𝐞 𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫. 𝐈𝐭’𝐬 𝐧𝐨𝐭 𝐭𝐡𝐞 𝐬𝐢𝐳𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐭𝐰𝐞𝐞𝐭 𝐭𝐡𝐚𝐭 𝐜𝐨𝐮𝐧𝐭𝐬, 𝐢𝐭’𝐬 𝐡𝐨𝐰 𝐲𝐨𝐮 𝐮𝐬𝐞 𝐢𝐭.

Notice that the text is unusually bold — these are no ordinary letters. They are MATHEMATICAL BOLD letters, which are astral characters. For example, the ‘𝐓’ is U+1D413 MATHEMATICAL BOLD CAPITAL T. The tweet contains exactly 140 characters — 28 ASCII, 3 in the basic multilingual plane, and 109 astral. In UTF-16, it comprises 249 code units, while in UTF-8, it comprises 473 bytes. Yet it was accepted in Twitter’s 140 character limit, because they count characters. I would guess that Twitter’s counting routines are written in both Scala and JavaScript, which means that in both versions, the engineers would need to have specially considered surrogate pairs as a single character. Had Twitter been written in a more Unicode-aware language, they could just have called the length method and not given it another thought.

Let’s quickly look at the way Tomboy (a note taking app written in C# on the Mono/.NET platform) loads notes from XML files. I discovered a bug in Tomboy where the entire document would be messed up if you type an astral character, save, quit, and load. It turns out that one function is keeping an int “offset” which counts the number of “characters” (actually code units) read, and passes the offset to another function, which “returns the location at a particular character offset.” This one really means character. Needless to say, one astral character means that the character offset is too high, resulting in a complete mess. Why is it the programmer’s responsibility to deal with these concepts?

Other software I’ve found that doesn’t handle astral characters correctly includes Bugzilla (ironically, I discovered this when it failed to save my report of the above Tomboy bug) and (an otherwise-fantastic resource for Unicode character information). The latter shows what can happen when a UTF-16 string is encoded to UTF-8 without a great deal of care, presumably due to an underlying language exposing a UTF-16 string representation.

A common theme here is programmer confusion. While these problems can be solved with enough effort, the leaky abstraction forces the programmer to expend effort that could be spent elsewhere. Importantly, it makes documentation more difficult. Any language with byte- or code-unit-oriented strings forces programmers to clarify what they mean every time they say “length” or “n characters”. Only when the language completely hides the encoding details are programmers free to use the phrase “n characters” unambiguously.

What about performance?

The most common objection to character-oriented strings is that it impacts the performance of strings. If you have an abstract Unicode string that uses UTF-8 or UTF-16, your indexing operation (charAt) goes from O(1) to O(n), as you must iterate over all of the variable-length code points. It is easier to let the programmer suffer an O(1) charAt that returns code units rather than characters. Alternatively, if you have an abstract Unicode string that uses UTF-32, you’ll be fine from a performance standpoint, but you’ve made strings take up two or four times as much space.

But let’s think about what this really means. The performance argument really says, “it is better to be fast than correct.” That isn’t something I’ve read in any software engineering book. Conventional wisdom is the opposite: 1. Be correct — in all cases, not just the common ones, 2. Be fast (here it is OK to optimise for the common cases). While performance is an important consideration, is unbelievable to me that designers of high-level languages would prioritise speed over correctness. (Again, you could argue that a code unit interface is not “incorrect,” merely “low level,” but hopefully this post shows that it frequently results in incorrect code.)

I don’t know exactly what the best implementation is, and I’m not here to tell you how to implement it. But for what it’s worth, I don’t consider UTF-32 to be a tremendous burden in this day and age (in moving to 64-bit platforms, we doubled the size of all pointers — why can’t we also double the size of strings?) Alternatively, Python’s PEP 393 shows that we can have correct semantics and optimise for common cases as well. In any case, we are talking about software correctness here — can we get the semantics right first, then worry about optimisation?

But it isn’t supposed to be abstract — that’s what libraries are for

The second most common objection to character-oriented strings is “but the language just provides the basic primitives, and the programmer can build whatever they like on top of that.” The problem with this argument is that it ignores the ecosystem of the language. If the language provides, for example, a low-level byte-string data type, and programmers can choose to use it as a UTF-8 string (or any other encoding of their choice), then there is going to be a problem at the seams between code written by different programmers. If you see a library function that takes a byte-string, what should you pass? A UTF-8 string? A Latin-1 string? Perhaps the library hasn’t been written with Unicode awareness at all, and any non-ASCII string will break. At best, the programmer will have written a comment explaining how you should encode Unicode characters (and this is what you need to do in C if you plan to support Unicode), but most programmers will forget most of the time. It is far better to have everybody on the same page by having the main string type be Unicode-aware right out of the box.

The ecosystem argument can be seen very clearly if we compare Python 2 and 3. Python 3 is well known for better Unicode support than its predecessor, but fundamentally, it is a rather simple difference: Python 2’s byte-string type is called “str”, while its Unicode string type is called “unicode”. By contrast, Python 3’s byte-string type is called “bytes”, while its Unicode string type is called “str”. They behave pretty much the same, just with different names. But in practice, the difference is immense. Whenever you call the str() function, it produces a Unicode string. Quoted literals are Unicode strings. All Python 3 functions, regardless of who wrote them, deal with Unicode, unless the programmer had a good reason to deal with bytes, whereas in Python 2, most functions dealt with bytes and often broke when given a Unicode string. The lesson of Python 3 is: give programmers a Unicode string type, make it the default, and encoding issues will mostly go away.

What about binary strings and different encodings?

Of course, programming languages, even high-level ones, still need to manipulate data that is actually a sequence of 8-bit bytes, that doesn’t represent text at all. I’m not denying that — I just see it as a totally different data structure. There is no reason a language can’t have a separate byte array type. UTF-16 languages generally do. Java has separate types “byte” and “char”, with separate semantics (for example, byte is a numeric type, whereas char is not). In Java, if you want a byte array, just say “byte[]”. Python 3, similarly, has separate “bytes” and “str” types. JavaScript recently added a Uint8Array type for this purpose (since JavaScript previously only had UTF-16 strings). Having a byte array type is not an excuse for not having a separate abstract string type — in fact, it is better to have both.

Similarly, a common argument is that text might come into the program in all manner of different encodings, and so abstracting away the string representation means programmers can’t handle the wide variety of encodings out there. That’s nonsense — of course even high-level languages need encoding functions. Python 3 handles this ideally — the string (str) data type has an ‘encode’ method that can encode an abstract Unicode string into a byte string using an encoding of your choice, while the byte string (bytes) data type has a ‘decode’ method that can interpret a byte string using a specified encoding, returning an abstract Unicode string. If you need to deal with input in, say, Latin-1, you should not allow the text to be stored in Latin-1 inside your program — then it will be incompatible with any other non-Latin-1 strings. The correct approach is to take input as a byte string, then immediately decode it into an abstract Unicode string, so that it can be mixed with other strings in your program (which may have non-Latin characters).

What about non-Unicode characters?

There’s one last thorny issue: it’s all well and good to say “strings should just be Unicode,” but what about characters that cannot be represented in Unicode? “Like what,” you may ask. “Surely all of the characters from other encodings have made their way into Unicode by now.” Well, unfortunately, there is a controversial topic called Han unification, which I won’t go into details about here. Essentially, some languages (notably Japanese) have borrowed characters from Chinese and changed their appearance over thousands of years. The Unicode consortium officially considers these borrowed characters as being the same as the original Chinese, but with a different appearance (much like a serif versus a sans-serif font). But many Japanese speakers consider them to be separate to the Chinese characters, and Japanese character encodings reflect this. As a result, Japanese speakers may want to use a different character set than Unicode.

This is unfortunate, because it means that a programming language designed for Japanese users needs to support multiple encodings and expose them to the programmer. As Ruby has a large Japanese community, this explains why Ruby 1.9 added explicit encodings to strings, instead of going to all-Unicode as Python did. I personally find that the advantages of a single high-level character sequence interface outweigh the need to change character sets, but I wanted to mention this issue anyway, by way of justifying Ruby’s choice of string representation.

What about combining characters?

I’ve faced the argument that my proposed solution doesn’t go far enough in abstracting over a sequence of characters. I am suggesting that strings be an abstract sequence of characters, but the argument has been put to me that characters (as Unicode defines them) are not the most logical unit of text — combining character sequences (CCSes) are.

For example, consider the string “𐒁̸”, which consists of two characters: ‘𐒁’ (U+10481 OSMANYA LETTER BA) followed by ‘◌̸’ (U+0338 COMBINING LONG SOLIDUS OVERLAY). What is the length of this string?

  • Is it 6? That’s the number of UTF-8 bytes in the string (4 + 2).
  • Is it 3? That’s the number of UTF-16 code units in the string (2 + 1).
  • Is it 2? That’s the number of characters (code points) in the string.
  • Is it 1? That’s the number of combining character sequences in the string.

My argument is that the first two answers are unacceptable, because they expose implementation details to the programmer (who is then likely to pass them on to the user) — I am in favour of the third answer (2). The argument put to me is that the third answer is also unacceptable, because most people are going to think of “𐒁̸” as a single “character”, regardless of what the Unicode consortium thinks. I think that either the third or fourth answers are acceptable, but the third is a better compromise, because we can count characters in constant time (just use UTF-32), whereas it is not possible to count CCSes in constant time.

I know, I know, I said above that performance arguments don’t count. But I see a much less compelling case for counting CCSes over characters than for counting characters over code units.

Firstly, users are already exposed to combining characters, but they aren’t exposed to code units. We typically expect users to type the combining characters separately on a keyboard. For instance, to type the CCS “e̸”, the user would type ‘e’ and then type the code for ‘◌̸’. Conversely, we never expect users to type individual UTF-8 or UTF-16 code units — the representation of characters is entirely abstract from the user. Therefore, the user can understand when Twitter, for example, counts “e̸” as two characters. Most users would not understand if Twitter were to count “汉” as three characters.

Secondly, dealing with characters is sufficient to abstract over the encoding implementation details. If a language deals with code units, a program written for a UTF-8 build may behave differently if compiled with a UTF-16 string representation. A language that deals with characters is independent of the encoding. Dealing with CCSes doesn’t add any further implementation independence. Similarly, dealing with characters is enough to allow strings to be spliced, re-encoded, and re-combined on arbitrary boundaries; dealing with CCSes isn’t required for this either.

Thirdly, since Unicode allows arbitrarily long combining character sequences, dealing with CCSes could introduce subtle exploits into a lot of code. If Twitter counted CCSes, a user could write a multi-megabyte tweet by adding hundreds of combining characters onto a single base character.

In any case, I’m happy to have a further debate about whether we should be counting characters or CCSes — my goal here is to convince you that we should not be counting code units. Lastly, I’ll point out that Python 3 (in UCS-4 build, or from version 3.3 onwards) and Haskell both consider the above string to have a length of 2, as does Twitter. I don’t know of any programming languages that count CCSes. Twitter’s page on counting characters goes into some detail about this.


My point is this: the Unicode Consortium has provided us with a wonderful standard for representing and communicating characters from all of the world’s scripts, but most modern languages needlessly expose the details of how the characters are encoded. This means that all programmers need to become Unicode experts to make high-quality internationalised software, which means it is far less likely that programs will work when non-ASCII characters, or even worse, astral characters, are used. Since most English programmers don’t test with non-ASCII characters, and almost nobody tests with astral characters, this makes it pretty unlikely that software will ever see such inputs before it is released.

Programming language designers need to become Unicode experts, so that regular programmers don’t have to. The next generation of languages should provide character-oriented string operations only (except where the programmer explicitly asks for text to be encoded). Then the rest of us can get back to programming, instead of worrying about encoding issues.

37 Responses to “The importance of language-level abstract Unicode strings”

  1. […] The importance of language-level abstract Unicode strings ( […]

  2. Hang on – Perl?

    brong@prin:~$ cat
    #!/usr/bin/perl -w

    use 5.010;
    use utf8;

    my $string = “Hello, 世界”;
    say length($string);
    brong@prin:~$ perl


    You need the “use utf8” for historical reasons, the default encoding of the perl code itself is otherwise in bytes… but if you read data into perl characters, all the builtins work correctly. The pain of encoding handling is relegated to talking to and reading from external data sources – and there’s no way around that – you need to indicate the charsets you communicate with, because they are ambiguous.

  3. Ruby 1.9’s behavior is similar to Perl’s:

    > ‘Hello, 世界’.encoding
    => #
    > ‘Hello, 世界’.length
    => 9
    > ‘Hello, 世界'[7,2]
    => “世界”
    > ‘𐒁’.length
    => 1
    > ‘Hello, 世界’.bytes.to_a.length
    => 13
    > ‘𐒁’.bytes.to_a.length
    => 4
    > ‘e̸’.length
    => 2

    You need to keep telling it what charset you’re using (default is ASCII/”binary”), which is kind of a PITA, but as long as you do it works as expected.

    I do see Ruby programmers who don’t understand character encodings struggle with the problems such a lack of understanding can generate, but I don’t think languages or their libraries should be designed with programmer laziness in mind. Pure Unicode character-based strings with a separate byte-array type would probably be better, but Ruby’s compromise works well as long as you understand it.

    (For anyone interested in learning how Ruby 1.9 handles encoding, James Edward Gray II’s articles at are excellent.)

    • The first two lines from IRB were supposed to be:

      > ‘Hello, 世界’.encoding
      => #<Encoding:UTF-8>


      Thank you, Unicode 😃

    • Thanks. I have edited the article to be a bit fairer to Ruby. From your example (and I verified it), it looks like UTF-8 is the default encoding. What makes you say that ASCII/binary is? Anyway, my problem with the Ruby approach is that it is possible to have two objects which you think of as “strings” but which are actually incompatible. I’m not sure what happens if you take, say a Latin-1-encoded string and concatenate it onto a UTF-8 string, but it sounds like it could be problematic. I prefer a language where you deal with encoding at the edges (reading and writing from streams), but once they are inside the program, the encoding details are safely abstracted away. But as I said in the third-last section, I do see the rationale for Ruby not constraining itself to the set of Unicode characters.

  4. Amen. I’ve encountered practically all of these issues myself.

    Though this post would not be complete without mention of the most blatant problem with astral characters on the web: MySQL. Initially it supported “utf8”, but only for characters 0-65535. If you want real utf-8, you have to ask for “utf8mb4”. But programmers have no reason to doubt the advertised “utf8”, so they’ll rarely notice, and software remains broken.

    Unicode should be unicode should be unicode.

  5. The language specification is not really very big and if the results were suprising then I guess you missed the section on strings.

    “The elements of strings have type byte and may be accessed using the usual indexing operations. It is illegal to take the address of such an element; if s[i] is the ith byte of a string, &s[i] is invalid. The length of string s can be discovered using the built-in function len. The length is a compile-time constant if s is a string literal.”

    Ergo, the length of a string in Go using the builtin len call is the length in bytes. Which to me makes perfect sense because the level of abstraction you are looking for brings more drawbacks and pain to implement at the language level.

    Use the utf8 package for utf8 string operations which is actually really the level of abstraction you are advocating, or simply cast the byte string to a slice of runes.

    fmt.Println(len([]rune(“Hello, 世界”)))

    • “The language specification is not really very big and if the results were suprising then I guess you missed the section on strings.”

      Do you mean Go? I’ve read it. Obviously when I said “perhaps surprisingly,” I meant it’s surprising to my initial expectation of what a language should be. Having read the spec, it is not surprising (the language implementation conforms to the spec), it’s just inconvenient, for the reasons I outlined in the article.

      True, I could use rune slices instead of strings. But that is roughly equivalent to using ‘unicode’ in Python 2 — it works, but it isn’t the true string type, so it doesn’t “mesh” with the language’s ecosystem. (Consider if I wrote a library which accepted []rune wherever you expected to pass a string — it would seem quite weird.) Search the article above for “ecosystem” where I made this argument in full.

  6. Python handles this just fine:

    Python 2.6.4 (r264:75706, Jun 4 2010, 18:20:31)
    [GCC 4.4.4 20100503 (Red Hat 4.4.4-2)] on linux2
    Type “help”, “copyright”, “credits” or “license” for more information.
    >>> len(u”Hello, 世界”)

    • Of course, but that is not the basic “string” type. It’s the “unicode” type, which is somewhat incompatible with normal strings. To give an example of where this breaks horribly, imagine I wrote a Twitter API which takes a parameter s, checks its length, and if it’s <=140 posts to Twitter, otherwise giving an error. This function is going to behave differently depending on whether the client passes a str or a unicode. Search the article above for "ecosystem" where I made this argument in full.

      It's all fixed in Python 3 — the one true string type is a Unicode string. There is a separate binary "bytes" type but it is completely incompatible with str, so there can be no confusion.

  7. I am the author of FileFormat.Info and was horrified to be called out as an example of “encoding without a great deal of care”. I went to the source code to fix the problem, and discovered:

    1. I had originally taken a great deal of care getting this to work. I had tried about 6 different ways of doing it before finding one that worked.

    2. When I switched from Tomcat to Jetty (the server is running Java), I hit a bug in the Jetty encoder: This means that none of my pages work with astral characters, which is pretty disappointing.

    Sigh. So it will take longer than I expected to fix…

    • Hi fileformat. I’m ashamed now — I thought I had sent you an email about this bug before posting this, but it must have skipped my mind. Please don’t take offense. I was intending to point out a flaw in the underlying programming language (Java, as I suspected), not your programming. I can tell by your site that you care a great deal about getting Unicode right, and I want to thank you for providing the world with such a great user interface for learning about characters and their properties.

      I think your example proves my point a bit more: even if YOU take special care to get it right, you are relying on an ecosystem that simply is not built to think in terms of astral characters (in Java’s case), and so is likely to contain bugs at all levels.

  8. As soon as your article got to the “counting characters” topic, I had to stop reading.

    Its no good to have an opinion on how programming languages should do things if you dont know the first thing about unicode.

    In unicode, counting characters is less than useless: at best dangerous. Its certainly no reason why a clusmy imperfect abstraction overtop the encoding.

    Perl is a nightmare to use with unicode. Precisely because it attempts to do what you advocate.

    C on the other hand, is a dream, with most programs requiring zero changes to adapt from ascii to unicode.

    The reason for the difference is that C treats strings a UTF-8 encoded, as all languages should.

    • Hi anon. I’m not sure why you say that “counting characters is less than useless” — do you think that it would be more helpful to count a lower level of abstraction, like bytes? Or a higher level, like combining character sequences. I wrote a section later on combining character sequences.

      I think that counting bytes is important in certain key circumstances — when you are either a) dealing with binary (non-textual) data, or b) serializing text that is entering or exiting the program from another source, such as the network or files. In case (a), you should not be dealing with “string” objects at all — use a dedicated byte array type.

      For case (b), you need to be very careful. When you read or write strings, they can come in all manner of different encodings. You can’t blindly assume they are UTF-8. Let’s assume you’re using C. If you take some bytes from the Internet and dump them in an array, what if they aren’t UTF-8? You’ll have corrupted strings when you try to do anything with them. So the only correct course of action is to determine the encoding somehow (e.g., from the charset parameter of the Content-Length HTTP header) and convert them into UTF-8. Then you can treat your strings as UTF-8. So for strings entering and exiting the program, you had to do exactly the same amount of work that I would in a language like Python 3, with a separate Unicode string type. Once the strings are inside the program, there is no need to be concerned with the underlying encoding at all. In C, you still have to be concerned with that non-abstraction. In Python 3, I don’t have to worry about it.

      I haven’t used Perl, so I don’t know.

      “The reason for the difference is that C treats strings a UTF-8 encoded, as all languages should.”
      Please be very careful when you make statements like this. The C language specification says nothing about character encodings (other than to explicitly deny any particular one). A C char is a byte, and has no additional semantics. A C string is a sequence of bytes. You are free to interpret those bytes as UTF-8, but someone else passing a string into your function may not. Please search for “ecosystem” in the above article where I make this argument in full. I think you will find that a language like C, which only has a byte array type, is no match for a language like Python 3, which has a byte array type PLUS a proper Unicode string type.

  9. great post. as a Phoenician language user having to operate in the 10900-1091F range I find communicating with other computers extremely challenging. 𐤈𐤈𐤈𐤈𐤈𐤈𐤈𐤈𐤈𐤈

  10. Go uses the 32-bit rune type to represent Unicode code points. There are no leaky abstractions when working with []rune.

    The Go language specification is very clear that strings are sequences of bytes, not sequences of runes.

    Go provides built-in conversions between []rune and a UTF-8 encoded string ( There are also standard packages for working with utf-8 ( and utf-16 ( encodings.

    • Apologies for not publishing this comment sooner … it was awaiting moderation and I don’t know why.

      Yes, Go has a dedicated rune type, and you could consider []rune to be the sort of data structure I am interested in. However, as I detailed in the section “But it isn’t supposed to be abstract — that’s what libraries are for”, it isn’t sufficient for a language to provide an abstract string type, but it also has to be the default. Otherwise, the majority of libraries will not accept or return it. Furthermore, it would be considered bad practice for me to write a library that deals in []rune wherever a string is expected. For example, it wouldn’t be possible to pass a string literal there.

      To fulfil my requirements, a language needs to have an abstract string type, and it has to be the default string type (which basically means it has to be the type you get when you write a string literal, and be called ‘string’ or ‘str’ or similar).

  11. Tom Christiansen made a bit more detailed analysis of Unicode support across languages:

  12. I agree with the above poster on the matter of counting characters (counting bytes makes sense for capacity reasons). There are very few instances were you would need to do that. Twitter is one of those few instances. In my subjective opinion this is the exception, rather then the rule, most of the time you need byte length (nothing to do with characters).

    The problem you mentioned with sending/receiving strings wouldn’t occur in Go. In Go most transfer is done through Read/Write which both deal with byte arrays, never strings.

    This whole problem you speak of depends very much on habits. To many people (including myself) Go’s way of doing things just makes sense and is incredibly easy to use. There are very few rules to keep in mind if you want to implement Unicode correctly (which is an inherently difficult task, no matter the language):

    1. Don’t index into strings (why would you?); you pretty much always need to parse a string once (at least until the point of interest), you can keep the bits you’re interested in in substrings; if you really need to remember positions you can remember byte offsets but even this situation is twisted, normally you shouldn’t need to do that, just make a substring; iterate on strings using range (extremely simple to use)

    2. Count runes, don’t use len(string) when you want character count (in the few instances you would need this)

    3. When transferring data always use byte arrays.

  13. We’ve recently added full unicode support to our Rascal Meta-programming language ( Our language runs on Java, and String does have newer methods just for dealing with the surrogate pairs. Here is a discussion about the impact of using these api’s on performance: the Java i18n docs also give good enough advice on how to handle this…

  14. Both the SBCL and CLISP implementations of Common Lisp support proper strings (as defined in this article). Common Lisp specifies strings to be sequences of characters, not bytes.

  15. Just try to come up with an “abstraction” that handles all cases and you’ll soon learn why the burden is still on the developer. There is no one-size-fits-all abstraction possible.

    Go ahead and post your “solutions” (if you actually have any) on a language mailing list and prepare to be educated.

    tl;dr: You are at level 1:

    • Care to elaborate? I’ve already made my argument in full, including precisely defining the abstraction I mean, giving an example implementation (Python 3.3), admitting a counterexample my abstraction can’t handle (Japanese, c.f. Ruby 1.9), and listing a number of languages that work exactly the way I describe. Numerous people have also commented other such languages. Did you read the article or do you just enjoy going on blogs and calling authors incompetent?

  16. Great article, worth pointing people to. I’ve run into quite a few of these sorts of issues, in various languages. But the news isn’t all doom-and-gloom! I can offer you two positive examples – systems that *correctly* handle the full Unicode range. One’s a programming language, one’s not.

    Pike is an obscure language that’s mainly used for long-running servers (eg MUDs). Its inbuilt string type is actually very similar to a PEP 393 string – it can handle any codepoint, but its underlying represenation varies according to the highest codepoint in it (if you’re curious, checkout the String.width() and Pike.count_memory() functions). In both Python and Pike, strings are immutable; this makes this optimization practical – it’d be hopelessly slow if modifying a string could widen it, or (even worse) shrink it to a narrower representation.

    PostgreSQL also has perfect Unicode handling. And there’s a very handy Pike-Postgres module that lets me abstract the whole issue of Unicode away – presumably the data flows from one process to the other as a stream of bytes, but it doesn’t concern me. Both ends count string length in code points.

    The main downside of Pike is that, having a smaller community than many languages, it tends to lack clear documentation. If Python is “batteries included”, then Pike is “the batteries are there, just dig through the packing peanuts”. But it’s an excellent open-source language, and one that I heartily recommend to anyone who likes C-style syntax and Python-style semantics. I also strongly recommend Python 3.3 for anyone who likes Python-style syntax and Pike-style semantics. 🙂

  17. I just tested, and both Haskell and Scala fail the unicode test. But in their case it’s a different unicode char:

    val b = “𝐹”
    b: java.lang.String = 𝐹
    scala> println(b.length)
    scala> println(b)

    I think the simplest solution to this kind of issue is that only two languages really matter:

    C and Java, because most languages use these languages string implementations to handle unicode. C is already UTF-32 compliant. That just leaves Java.

    • Which version of Haskell are you using? With GHCi version 7.4.1:

      > let b = “𝐹”
      > length b
      > putStrLn b

      As for Scala, yes, it does fail the test, and I used it as an example in the above post. It is based on the JVM so it uses the same String representation as Java.

      “Only two languages really matter” — what? Why do only C and Java matter?

      “C is already UTF-32 compliant”. Not really.
      The wchar_t type is implementation-defined. It is allowed to be 8-bits, and it’s typically 16-bits on Windows and 32-bits on Unix. So that is NOT a portable way to get UTF-32 behaviour. Apparently there is a new (as of 2011) char32_t type, which is a good start, but it is far from useful as all existing APIs still accept either char* or wchar_t. For example, the basic string type in C++ is still defined as a vector of char, so it is not UTF-32 compliant. As I said in the above post, it is not sufficient for a language to provide a 32-bit char type. It has to be the default string representation, or the ecosystem will not be built to support it.

  18. imho, insisting on “unicode everywhere” is stupid idea anyway. short-sighted.
    and it generates whole bunch of problems.
    sure you can “make money” by solving problems which you introduce earlier 😉
    and _thats_ is how IT and all industry after 1990 works… but…
    looks at it so: better way to get/save money is to not create problems for yourself.

    so, unicode… unicode is _badly designed_ encoding and may
    be used only for exchange of information, sometimes…
    but NOT for daily _processing_ it.
    and there exist better methods for representing multilanguage
    _documents_. simpler, efficient, and _less error prone_.

    i give you key idea – do not try to encode each character,
    it’s plainly stupid and meaningless.
    encode whole words and sentences, bigger parts of text in document.
    simply use normal _document_ format, not raw text.

    …or use very smarty Rosetta encoding, if you wish.
    in this multilanguage encoding text contains _words_ in different languages.
    not _characters_. each word has 7 bit characters in _same_ encoding.
    encoding defined by one of 127 _space codepage characters_, which set
    encoding for next word.
    get it? all processing is 8 bit, you can use mmx, you don’t have processor
    slowing jumps, use memory bus efficiently and without latency,
    you can get charAt() in O(1)…
    just permit to have more than one space character.
    ok, enough about unknown Rosetta…

    get the idea? do not encode _characters_. it’s meaningless.
    just attach encoding to larger units of text, use documents, not texts.

    because this is purpose of documents – to encode structure of text.
    by definition.

    • Wow. I don’t really know whether you’re trolling or not, but I’ll reply as if you’re serious. You seem to like this Rosetta encoding scheme, so I read the spec. (I assume this is what you’re talking about: From my quick reading of it, it seems like an utter disaster. It’s trying to solve the same problem as Unicode already solved about five years earlier. Your rhetoric about Unicode simply doesn’t hold up.

      Here is why it is natural to deal with characters, and not words. Characters are the fundamental unit of text storage. Yes, characters (in most languages) have no individual meaning. But that isn’t the point. When we type with a keyboard, we type one character at a time — that means computer systems need to be able to store incomplete words. We select text on a per-character basis, so we need to be able to copy and paste incomplete words. Yes, Rosetta can handle all of these, but it is awkward and requires special programming support.

      When you say Unicode is error-prone because you can’t do processing on it, I assume you’re making the same argument as the Rosetta spec: “The lack of generic processing capabilities makes Unicode practically useless for such applications as databases, search engines, or text indexing because those applications routinely perform case conversion and lexicographical sorting.” Clearly that’s a load of crap because modern databases, search engines and text indices all use Unicode. What you and the Rosetta author seem to have missed is that Unicode has thorough (if a little complex) support for both sorting (see: collation) and case conversion, as well as conversion to and from ligatures (see: normalization). Yes, you do need to download a database file containing information about how to sort and case-convert on a character-by-character basis, and I can vaguely see how Rosetta might allow software to sort and convert without an external database. But there is so much added complexity in this scheme that it is hardly worth it.

      Here are some of the problems that Rosetta introduces:
      – It is a stateful protocol, with an explicit switch between languages. Any given byte could have one of hundreds of different meanings depending on the word context. Compare to UTF-8 where each byte identifies itself as the start of a character or a continuation, and if I find myself mid-stream, I can pick up at the next character.
      – It is vastly more complex. Just look at the code at the end of the document — the getc code has twelve separate cases, and the Rune struct needs to carry around a language ID. UTF-8 has just four cases when reading a character.
      – It has arbitrary limits like words cannot be more than 127 octets, languages cannot have more than 16,384 characters, and words cannot have characters from more than one language. (Is there a reason to go over these limits? Maybe, maybe not, but the best computer systems do not impose unnecessary limitations.)
      – The concept of a substring does not work with Rosetta encoding. Say that I select some text including the end of one word and the start of the next, then Ctrl+C it. Now a naive implementation would copy just the selected bytes and not know what language the word was in, so you need special code to deal with it. Now say that I do a search on the page for a particular string — I want to find all instances of that string, not just whole words. Your find algorithm can’t just do a byte match. It needs to deal with all of the complexities of word boundaries and language switching. Programming against Rosetta-encoded strings is batshit insane.
      – There may be multiple ways to represent the same character in different languages. For example, there will presumably be a “German” language, and German has the letter “e” just as English and a huge range of other languages do. How will you be able to say “English e is equal to German e”?
      – It lacks many of the advanced features of Unicode. For example, there doesn’t appear to be any real support for combining characters. Unicode allows arbitrary combining characters such as accents to be applied to other characters, but the Rosetta the spec glosses over ligatures, saying “ligatures are purely [a] rendering issue, and should not be assigned separate letter codes.” But this ignores other combining characters — for example, Sinhalese requires special combining characters to be applied — sometimes more than one per base character — so you need to assign code points to combining characters, and an algorithm for how to apply them and in what order. They are not just a rendering issue; they have semantics.
      – The bidirectional support is insane. Words are always left-to-right but letters within a word are stored right-to-left for RTL languages? There is a reason Unicode stores right-to-left characters in natural order — because that’s the order you type them and that’s also the order you sort them. Rosetta would require every program (not just font rendering) to deal with the complexities of right-to-left. This also destroys any advantage you get with sorting, because now you need a database to know which languages are right-to-left.
      – Most importantly: Unicode is already everywhere. Almost every computer system on the planet now understands the concept of a Unicode code point. Every font has a table allowing you to look up a code point and get a glyph. Almost all modern software and programming languages store text in Unicode, and only deal with other encodings at the boundary. The purpose of Unicode is to unify all of the existing standards, and it has succeeded emphatically. Rosetta is incompatible with every other system on the planet (except ASCII). It is vastly inferior simply because it is not in use, and for a language system that is kind of important.

      Neither you nor the Rosetta spec has really identified any problems with UTF-8 that are solved by Rosetta, except perhaps the space usage. And even that is dubious — for non-ASCII text, you appear to need three extra bytes per word to identify the language, so for two-byte languages like Chinese, that is a net loss of 3 bytes per word. For European languages, the vast majority of letters are ASCII, so UTF-8 wins easily. Only for non-European single-byte languages (such as Cyrillic and Arabic) are you more compact than UTF-8, and even then, only for words of four letters or more. On balance, since I’d wager the vast majority of Internet traffic is ASCII or Chinese, I’d say UTF-8 wins.

      • I wouldn’t criticize Rosetta too much; in 1997, Unicode’s popularity wasn’t as well established, and certainly UTF-8 wasn’t as ubiquitous as it is now; most people’s understanding of “Unicode” was “it inserts nulls between all the characters in my string”. Nothing wrong with having a competing protocol.

        But in 2013, there’s absolutely no reason to go for Rosetta. It offers little or nothing that Unicode and UTF-8 don’t, and is arbitrarily incompatible with the rest of the world. A programming language that uses UTF-32 as its internal coding needs only to do a simple encode/decode to turn that into a globally-comprehensible stream of bytes (take one character, output 1-4 bytes UTF-8); one that uses Rosetta internally has to do a complete lookup and translation. That’s going to make it pretty much useless for anything involving the internet, or displaying text on a console, or any modern GUI, or anything. Hmm. Here’s an idea: Reimplement GNU Readline to work with Rosetta strings. See how much work *that* would be.

        • Yes good point that it was 1997. However, I still note that even though “most people”‘s understanding of Unicode was UCS-2, UTF-8 was available since 1993, and Rosetta clearly talks about UTF-8 when it says “Unicode”. While there is nothing wrong with a competing protocol, I don’t think the authors of that spec really understood Unicode when they wrote it, nor did they offer any real advantages over UTF-8, whilst adding a huge amount of complexity.

          (And this is 2013 and apparently some people still think Rosetta is a good idea…)

  19. > I assume this is what you’re talking about

    yep, those rosetta. no, i not trolling)) don’t have time for it.
    just try to say some again about unicode alternatives.

    and i _really_ think that unicode is not solution, but problem.
    it is used over the world now _only_ because java use it.
    java get it from plan9. who now knows about plan9 and inferno (forget android)?
    but java pushed everywhere now. and so unicode is.

    and i _really_ think that algorithms processing of texts must do
    it in 8bit fixed charlen. that enought for all “europian languages”,
    and thats 90% of world languages.
    it’s better to left chieneese/japan problems to chieneese/japan programmers))

    do not let specific of keyboard _input_ of text or _transfer_of texts…
    to complicate _processing_ and _storage_ of text. thats different tasks with
    different requirements. unicode and xml is for transfer, not for processing.
    “unicodisation” of programs was painful and do not finished until now.
    “rosettisation” is simpler. just replace “char=space” with “char>127” everywhere.
    (just joking)
    utf8 is also “stateful”, at letter level. its not better than rosetta here.
    algorithm is same here and also can be placed to libs.

    to make it clear – rosetta _fixed_ width _8_ bit.
    1) fixed is means less branching in cycles.
    that’s good for memory latency and efficient caching. and mmx-able.
    2) 8 bit is good for memory thoroughtput.

    modern compiler optimisations is all about memory access patterns.
    utf8 bad for latency because compicate those patterns
    (on other side fixed unicode have good patterns but need bigger thoroughtput).
    you want more characters? really? make 16-bit rosetta)) just joking.

    looks like unicode is just one more argument to hardware upgrade? ))
    yep, this way commercial world works now. it “stimulating” needs.

    yes, all unicode sorting procedures etc is already exist in
    libraries. so what? inefficiencies remains.
    it just now hidden in those libraries or bulky language runtimes.
    rosetta _also_ can have all such libraries, and those will be more
    efficient on modern hardware. that’s the point.
    unicode is not “magic snap, voila!”. it has it’s price.
    no free lunch, as usual.

    ok, i stop now. no time for flaming.
    i do _not_ try to praise rosetta.
    i just want to say that unicode and utf8 badly designed.
    and unicode also have many many arbitrary limits.
    it’s not rosetta is interesting. it’s unicode deficiencies is interesting.
    it’s interesting that with unicode we need to deal with all those problems everywhere.
    you _always_ need to know about ligatures, code points, etc.
    with old 8-bit language specific encodings in 90% of cases we was able
    just work and not bother. having simple common case is good engeneering.

    … wel, enought about all that simple things…world is simple. just open eyes 😉
    …sure we will continue to use unicode in future.
    _use_, but must not _praise_. thats’s all i want to say.
    there is nothing in unicode that can be _praised_.

    we have many errors-of-past to be compatible with.
    so better think like “one more old error do not make situation much worse”))
    yep, unicode is just one more mistake, i think. nonsolution for nonproblem.

  20. I am excited to see the hard work on Twitter’s character counting used as a positive example (full disclosure: I’m the author). I spent much of my early time at Twitter wading through this same topic and coming to the same conclusion. Great write up I recommend for all new engineers, especially on the JVM where they seem to be convinced the problem is “solved”. Worth nothing is the recent rise in Emoji use outside of SMS and Twitter, which include astral characters. That trend is uncovering these issues left-and-right (see also:

    • Thanks, it’s good to hear from the guy that wrote that spec :). Interesting post about MySQL. I haven’t seen astral problems with UTF-8 before. I assumed that UTF-16 was a natural breeding ground for astral problems because a naive implementation completely ignores surrogate pairs. I assumed that UTF-8 would avoid these problems because you have to implement multi-byte encoding anyway, and the 4-byte encoding is just as simple as the 3-byte encoding. It’s disheartening (but I suppose not surprising) that people found a way to screw it up.

      • It’s crazy. Somehow they managed to support UTF-8 but not all of it… and then when the full range was supported, instead of making it a bug-fix to the “utf8” charset, they left that one buggy and made a new one. So you have to say “utf8mb4”, PLUS it’s not the default. Thank you very much.

        I’ve moved almost completely off MySQL (just a few legacy things remaining), and intend to avoid using it anywhere in future. This isn’t its only problem.

        • in 2016, ditching MySQL because encoding is still the correct answer. Just saying. What PHP is to programming languages, MySQL is to databases.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: