Archive for the ‘Language design’ Category


The importance of language-level abstract Unicode strings

In Language design,Language implementation,Unicode on April 19, 2012 by Matt Giuca Tagged: ,

Note: This article may not display properly if your browser has no font installed with some special Unicode characters.

Unicode is a beautiful thing. Where in the past, we had to be content with text sequences of merely 128 different characters, we can now use over 110,000 characters from almost every script in use today or throughout history. But there’s a problem: despite the fact that Unicode has been around and relatively unchanged for the past 16 years, most modern programming languages do not give programmers a proper abstraction over Unicode strings, which means that we must carefully deal with a range of encoding issues or suffer a plethora of bugs. And there are still languages coming out today that fall into this trap. This blog post is a call to language designers: your fundamental string data type should abstract over Unicode encoding schemes, such that your programmer sees only a sequence of Unicode characters, thus preventing most Unicode bugs.

Over the past few years, I have encountered a surprising amount of resistance to this idea, with counter-arguments including performance reasons, compatibility reasons and issues of low-level control. In this post, I will argue that having a consistent, abstract Unicode string type is more important, and that programmers can be given more control in the special cases that require it.

To give a quick example of why this is a problem, consider the Hello World program on the front page of the Go programming language website. This demo celebrates Go’s Unicode support by using the Chinese word for “world”, “世界”:

fmt.Println("Hello, 世界")

But while Go ostensibly has Unicode strings, they are actually just byte strings, with a strong mandate that they be treated as UTF-8-encoded Unicode text. The difference is apparent if we wrap the string in the built-in len function:

fmt.Println(len("Hello, 世界"))

This code, perhaps surprisingly, prints 13, where one might have expected it to print 9. This is because while there are 9 characters in the string, there are 13 bytes if the string is encoded as UTF-8: in Go, the string literal “Hello, 世界” is equivalent to “Hello, \xe4\xb8\x96\xe7\x95\x8c”. Similarly, indexing into the string uses byte offsets, and retrieves single bytes instead of whole characters. This is a dangerous leaky abstraction: it appears to be a simple sequence of characters until certain operations are performed. Worse, English-speaking programmers have a nasty habit of only testing with ASCII input, which means that any bugs where the programmer has presumed to be dealing with characters (but is actually dealing with bytes) will go undetected.

In a language that properly abstracts over Unicode strings, the above string should have a length of 9. Programmers should only be exposed to encoding details when they explicitly ask for it.

The folly of leaky abstraction

The fundamental problem lies in the difference between characters and code units. Characters are the abstract entities that correspond to code points. Code units are the pieces that represent a character in the underlying encoding, and they vary depending on the encoding. For example, consider the character OSMANYA LETTER BA or ‘𐒁’ (with code point U+10481 — I go into details on the difference between characters and code points in this blog post). In UTF-8, it is represented by the four byte sequence F0 90 92 81; code units are bytes, so we would say this character is represented by four code units in UTF-8. In UTF-16, the same character is represented by two code units: D801 DC81; code units are 16-bit numbers. The problem with many programming languages is that their string length, indexing and, often, iteration operations are in code units rather than in characters (or code points).

(An aside: I find it interesting that the Unicode Standard itself defines a “Unicode string” as “an ordered sequence of code units.” That’s code units, not code points, suggesting that the standard itself disagrees with my thesis that a string should be thought of as merely a sequence of code points. This section of the standard seems concerned with how the strings should be represented, not the programming interface for accessing such strings, so I feel that it doesn’t directly contradict this post. This post is about the string interface, not the underlying representation.)

There are three main types of culprit languages that exhibit this issue:

  1. Languages with byte-oriented string operations and no specified encoding. These leave Unicode support up to programmers and third-party library authors, resulting in a general mish-mash of encoding across different code bases. These include C, Python 2, Ruby, PHP, Perl, Lua, and almost all pre-Unicode (1992) programming languages.
  2. Languages with code-unit-oriented string operations and no specified encoding. These are probably the worst offenders, as the language manages the encoding, but leaks abstraction details that the programmer has no control over (for example, different builds of the compiler may have different underlying encoding schemes, affecting the behaviour of programs). These include certain builds of Python 3.2 and earlier, and Mercury.
  3. Languages with code-unit-oriented string operations, but a specific encoding scheme. These languages at least have well-defined behaviour, but still expose the programmer to implementation details, which can result in bugs. Most modern languages fit into this category, including Go (UTF-8), JVM-based languages such as Java and Scala (UTF-16), .NET-based languages such as C# (UTF-16) and JavaScript (UTF-16).

Edit: A number of comments indicate that some of the other languages have optional Unicode support. In Perl 5, you can write “use utf8” to turn on proper Unicode strings. In Ruby 1.9, you can attach encodings to strings to make them behave better. In Python 2, as I’ll get to later, you have a separate Unicode string type.

Above, I talked about the leaky abstraction of UTF-8 in Go, but it is equally important to recognise the leaky abstraction of UTF-16 in many modern language frameworks; in particular, the Java platform (1996) and the .NET platform (2001). Both treat strings as a sequence of UTF-16 code units (and the corresponding “char” data type as a single UTF-16 code unit). This is disappointingly close to the ideal. In Java, for instance, you can essentially treat a string to be a sequence of characters. ‘e’ (U+0065) is a character, and ‘汉’ (U+6C49) is a character — it feels like the language is abstracting over Unicode characters, until you realise that it isn’t. ‘𐒁’ (U+10481) is not a valid Java character. It is an astral character (one that cannot be represented in a single UTF-16 code unit) — it needs to be represented by the two-code-unit sequence D801 DC81. So it is possible to store this character in a Java string, but not without understanding the underlying encoding details. For example, the Java string “𐒁” has length 2, and worse, if you ask for “charAt(0)” you will get the character U+D801; “charAt(1)” gives U+DC81. Any code which directly manipulates the characters of the string is likely to fail in the presence of astral characters. Java has the excuse of history: Java was first released in January 1996; Unicode 2.0 was released in July 1996, and before that, there were no astral characters (so for six months, Java had it right!). The .NET languages, such as C#, behave exactly the same, and don’t have the excuse of time. JavaScript is the same again. This is perhaps worse than the UTF-8 languages, because even non-English-speaking programmers are likely to forget about astral characters: there are no writing systems in use today that use astral characters. But astral characters include historical scripts of interest to historians, emoticons which may be used by software, and large numbers of mathematical symbols which are commonly used by mathematicians, including myself, so it is important that software gets them right.

I must stress how harmful the second class of languages are in terms of string encoding. In 2011, the Mercury language switched from a type 1 language (strings are just byte sequences) to a type 2 language — the language now specifies all of the basic string operations in terms of code units, but does not specify the encoding scheme. Worse, Mercury has several back-ends — if compiling to C or Erlang, it uses UTF-8, whereas compiling to Java or C# results in UTF-16-encoded strings. Now programmers must contend with the fact that string.length(“Hello, 世界”) might be 13 or 9 depending on the back-end. (Fortunately, character-oriented alternatives, such as count_codepoints, are provided.) This is a major blow to code portability, which I would recommend language designers avoid at all costs.

Python too has traditionally had this problem: the interpreter can be compiled in “UCS-2” or “UCS-4” modes, which represent strings as UTF-16 or UTF-32 respectively, and expose those details to the programmer (the UCS-2 build behaves much like the other UTF-16 languages, while the UCS-4 build behaves exactly as I want, with one character per code unit). Fortunately, Python is about to correct this little wart entirely with the introduction of PEP 393 in upcoming version 3.3. This version will remove UCS-2/UCS-4 build, so all future versions will behave as the UCS-4 build did, but with a nice optimisation: all strings with only Latin-1 characters are encoded in Latin-1; all strings with only basic multilingual plane (BMP) characters are encoded in UCS-2 (UTF-16); all strings with astral characters are encoded in UTF-32. (It’s more complicated than that, but that’s the gist of it.) This ensures the correct semantics, but allows for compact string representation in the overwhelmingly common case of BMP-only strings.

Somewhat surprisingly, Bash (1989) appears to support character-oriented Unicode strings. The only other languages I have found which properly abstract over Unicode strings are from the functional world: Haskell 98 (1998) and Scheme R6RS (2007). Haskell 98 specifies that “the character type Char is an enumeration whose values represent Unicode characters” (with a link to the Unicode 5.0 specification). Scheme R6RS was the first version of Scheme to mention Unicode, and got it right on the first go. It specifies that “characters are objects that represent Unicode scalar values,” and goes into details explicitly stating that a character value is in the range [0, D7FF16] ∪ [E00016, 10FFFF16], then defines strings as sequences of characters.

With only four languages that I know of properly providing character-oriented string operations by default (are there any more?), this is a pretty poor track record. Sadly, many modern languages are being built on top of either the JVM or .NET framework, and so naturally absorb the poor character handling of those platforms. Still, it would be nice to see some more languages that behave correctly.

Some real-world problems

So far, the discussion has been fairly academic. What are some of the actual problems that have come about as a result of the leaky abstraction?

Edit: I mention a heap of technologies here, in a way that could be interpreted to be derisive. I don’t mean any offense towards the creators of these technologies, but rather, I intend to point out how difficult it can be to work with Unicode on a language that doesn’t provide the right abstractions.

A memorable example for me was a program I worked on which sent streams of text from a Python 2 server (UTF-8 byte strings) to a JavaScript client (UTF-16 strings). We had chosen the arbitrary message size of 512 bytes, and were chopping up text arbitrarily on those boundaries, and sending them off encoded in UTF-8 to the client, where they were concatenated back together. This almost worked, but we discovered a strange problem: some non-ASCII characters were being messed up some of the time. It occurred if you happened to have a multi-byte character split across a 512-byte boundary. For example, “ü” (U+00FC) is encoded as hex C3 BC. If byte 511 of a message was C3 and byte 512 was BC, the first packet would be sent ending in the byte C3 (which JavaScript wouldn’t know how to convert to UTF-16), and the second packet would be sent beginning with the byte BC (which JavaScript also wouldn’t know how to convert to UTF-16). So the result is “��”. In general, if a language exposes its underlying code units, then strings may not be split, converted to a different encoding, and re-concatenated.

An obvious example of an operation that goes wrong in these situations is the length operator. If you use UTF-8, Chinese characters are going to each report a length of 3, while in UTF-16, astral characters will each report a length of 2. Think this doesn’t matter? What about Twitter? On Twitter, you have to type messages into 140 characters. That’s 140 characters, not 140 bytes. Imagine if Twitter ate up three “characters” every time you hit a key (if you spoke Chinese, for example). Fortunately, Twitter does it correctly, even for astral characters. I just tweeted the following:

𝐓𝐰𝐢𝐭𝐭𝐞𝐫 𝐬𝐮𝐩𝐩𝐨𝐫𝐭𝐬 𝐚𝐬𝐭𝐫𝐚𝐥 𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫𝐬 — 𝐞𝐚𝐜𝐡 𝐨𝐧𝐞 𝐜𝐨𝐮𝐧𝐭𝐬 𝐚𝐬 𝐚 𝐬𝐢𝐧𝐠𝐥𝐞 𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫. 𝐈𝐭’𝐬 𝐧𝐨𝐭 𝐭𝐡𝐞 𝐬𝐢𝐳𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐭𝐰𝐞𝐞𝐭 𝐭𝐡𝐚𝐭 𝐜𝐨𝐮𝐧𝐭𝐬, 𝐢𝐭’𝐬 𝐡𝐨𝐰 𝐲𝐨𝐮 𝐮𝐬𝐞 𝐢𝐭.

Notice that the text is unusually bold — these are no ordinary letters. They are MATHEMATICAL BOLD letters, which are astral characters. For example, the ‘𝐓’ is U+1D413 MATHEMATICAL BOLD CAPITAL T. The tweet contains exactly 140 characters — 28 ASCII, 3 in the basic multilingual plane, and 109 astral. In UTF-16, it comprises 249 code units, while in UTF-8, it comprises 473 bytes. Yet it was accepted in Twitter’s 140 character limit, because they count characters. I would guess that Twitter’s counting routines are written in both Scala and JavaScript, which means that in both versions, the engineers would need to have specially considered surrogate pairs as a single character. Had Twitter been written in a more Unicode-aware language, they could just have called the length method and not given it another thought.

Let’s quickly look at the way Tomboy (a note taking app written in C# on the Mono/.NET platform) loads notes from XML files. I discovered a bug in Tomboy where the entire document would be messed up if you type an astral character, save, quit, and load. It turns out that one function is keeping an int “offset” which counts the number of “characters” (actually code units) read, and passes the offset to another function, which “returns the location at a particular character offset.” This one really means character. Needless to say, one astral character means that the character offset is too high, resulting in a complete mess. Why is it the programmer’s responsibility to deal with these concepts?

Other software I’ve found that doesn’t handle astral characters correctly includes Bugzilla (ironically, I discovered this when it failed to save my report of the above Tomboy bug) and (an otherwise-fantastic resource for Unicode character information). The latter shows what can happen when a UTF-16 string is encoded to UTF-8 without a great deal of care, presumably due to an underlying language exposing a UTF-16 string representation.

A common theme here is programmer confusion. While these problems can be solved with enough effort, the leaky abstraction forces the programmer to expend effort that could be spent elsewhere. Importantly, it makes documentation more difficult. Any language with byte- or code-unit-oriented strings forces programmers to clarify what they mean every time they say “length” or “n characters”. Only when the language completely hides the encoding details are programmers free to use the phrase “n characters” unambiguously.

What about performance?

The most common objection to character-oriented strings is that it impacts the performance of strings. If you have an abstract Unicode string that uses UTF-8 or UTF-16, your indexing operation (charAt) goes from O(1) to O(n), as you must iterate over all of the variable-length code points. It is easier to let the programmer suffer an O(1) charAt that returns code units rather than characters. Alternatively, if you have an abstract Unicode string that uses UTF-32, you’ll be fine from a performance standpoint, but you’ve made strings take up two or four times as much space.

But let’s think about what this really means. The performance argument really says, “it is better to be fast than correct.” That isn’t something I’ve read in any software engineering book. Conventional wisdom is the opposite: 1. Be correct — in all cases, not just the common ones, 2. Be fast (here it is OK to optimise for the common cases). While performance is an important consideration, is unbelievable to me that designers of high-level languages would prioritise speed over correctness. (Again, you could argue that a code unit interface is not “incorrect,” merely “low level,” but hopefully this post shows that it frequently results in incorrect code.)

I don’t know exactly what the best implementation is, and I’m not here to tell you how to implement it. But for what it’s worth, I don’t consider UTF-32 to be a tremendous burden in this day and age (in moving to 64-bit platforms, we doubled the size of all pointers — why can’t we also double the size of strings?) Alternatively, Python’s PEP 393 shows that we can have correct semantics and optimise for common cases as well. In any case, we are talking about software correctness here — can we get the semantics right first, then worry about optimisation?

But it isn’t supposed to be abstract — that’s what libraries are for

The second most common objection to character-oriented strings is “but the language just provides the basic primitives, and the programmer can build whatever they like on top of that.” The problem with this argument is that it ignores the ecosystem of the language. If the language provides, for example, a low-level byte-string data type, and programmers can choose to use it as a UTF-8 string (or any other encoding of their choice), then there is going to be a problem at the seams between code written by different programmers. If you see a library function that takes a byte-string, what should you pass? A UTF-8 string? A Latin-1 string? Perhaps the library hasn’t been written with Unicode awareness at all, and any non-ASCII string will break. At best, the programmer will have written a comment explaining how you should encode Unicode characters (and this is what you need to do in C if you plan to support Unicode), but most programmers will forget most of the time. It is far better to have everybody on the same page by having the main string type be Unicode-aware right out of the box.

The ecosystem argument can be seen very clearly if we compare Python 2 and 3. Python 3 is well known for better Unicode support than its predecessor, but fundamentally, it is a rather simple difference: Python 2’s byte-string type is called “str”, while its Unicode string type is called “unicode”. By contrast, Python 3’s byte-string type is called “bytes”, while its Unicode string type is called “str”. They behave pretty much the same, just with different names. But in practice, the difference is immense. Whenever you call the str() function, it produces a Unicode string. Quoted literals are Unicode strings. All Python 3 functions, regardless of who wrote them, deal with Unicode, unless the programmer had a good reason to deal with bytes, whereas in Python 2, most functions dealt with bytes and often broke when given a Unicode string. The lesson of Python 3 is: give programmers a Unicode string type, make it the default, and encoding issues will mostly go away.

What about binary strings and different encodings?

Of course, programming languages, even high-level ones, still need to manipulate data that is actually a sequence of 8-bit bytes, that doesn’t represent text at all. I’m not denying that — I just see it as a totally different data structure. There is no reason a language can’t have a separate byte array type. UTF-16 languages generally do. Java has separate types “byte” and “char”, with separate semantics (for example, byte is a numeric type, whereas char is not). In Java, if you want a byte array, just say “byte[]”. Python 3, similarly, has separate “bytes” and “str” types. JavaScript recently added a Uint8Array type for this purpose (since JavaScript previously only had UTF-16 strings). Having a byte array type is not an excuse for not having a separate abstract string type — in fact, it is better to have both.

Similarly, a common argument is that text might come into the program in all manner of different encodings, and so abstracting away the string representation means programmers can’t handle the wide variety of encodings out there. That’s nonsense — of course even high-level languages need encoding functions. Python 3 handles this ideally — the string (str) data type has an ‘encode’ method that can encode an abstract Unicode string into a byte string using an encoding of your choice, while the byte string (bytes) data type has a ‘decode’ method that can interpret a byte string using a specified encoding, returning an abstract Unicode string. If you need to deal with input in, say, Latin-1, you should not allow the text to be stored in Latin-1 inside your program — then it will be incompatible with any other non-Latin-1 strings. The correct approach is to take input as a byte string, then immediately decode it into an abstract Unicode string, so that it can be mixed with other strings in your program (which may have non-Latin characters).

What about non-Unicode characters?

There’s one last thorny issue: it’s all well and good to say “strings should just be Unicode,” but what about characters that cannot be represented in Unicode? “Like what,” you may ask. “Surely all of the characters from other encodings have made their way into Unicode by now.” Well, unfortunately, there is a controversial topic called Han unification, which I won’t go into details about here. Essentially, some languages (notably Japanese) have borrowed characters from Chinese and changed their appearance over thousands of years. The Unicode consortium officially considers these borrowed characters as being the same as the original Chinese, but with a different appearance (much like a serif versus a sans-serif font). But many Japanese speakers consider them to be separate to the Chinese characters, and Japanese character encodings reflect this. As a result, Japanese speakers may want to use a different character set than Unicode.

This is unfortunate, because it means that a programming language designed for Japanese users needs to support multiple encodings and expose them to the programmer. As Ruby has a large Japanese community, this explains why Ruby 1.9 added explicit encodings to strings, instead of going to all-Unicode as Python did. I personally find that the advantages of a single high-level character sequence interface outweigh the need to change character sets, but I wanted to mention this issue anyway, by way of justifying Ruby’s choice of string representation.

What about combining characters?

I’ve faced the argument that my proposed solution doesn’t go far enough in abstracting over a sequence of characters. I am suggesting that strings be an abstract sequence of characters, but the argument has been put to me that characters (as Unicode defines them) are not the most logical unit of text — combining character sequences (CCSes) are.

For example, consider the string “𐒁̸”, which consists of two characters: ‘𐒁’ (U+10481 OSMANYA LETTER BA) followed by ‘◌̸’ (U+0338 COMBINING LONG SOLIDUS OVERLAY). What is the length of this string?

  • Is it 6? That’s the number of UTF-8 bytes in the string (4 + 2).
  • Is it 3? That’s the number of UTF-16 code units in the string (2 + 1).
  • Is it 2? That’s the number of characters (code points) in the string.
  • Is it 1? That’s the number of combining character sequences in the string.

My argument is that the first two answers are unacceptable, because they expose implementation details to the programmer (who is then likely to pass them on to the user) — I am in favour of the third answer (2). The argument put to me is that the third answer is also unacceptable, because most people are going to think of “𐒁̸” as a single “character”, regardless of what the Unicode consortium thinks. I think that either the third or fourth answers are acceptable, but the third is a better compromise, because we can count characters in constant time (just use UTF-32), whereas it is not possible to count CCSes in constant time.

I know, I know, I said above that performance arguments don’t count. But I see a much less compelling case for counting CCSes over characters than for counting characters over code units.

Firstly, users are already exposed to combining characters, but they aren’t exposed to code units. We typically expect users to type the combining characters separately on a keyboard. For instance, to type the CCS “e̸”, the user would type ‘e’ and then type the code for ‘◌̸’. Conversely, we never expect users to type individual UTF-8 or UTF-16 code units — the representation of characters is entirely abstract from the user. Therefore, the user can understand when Twitter, for example, counts “e̸” as two characters. Most users would not understand if Twitter were to count “汉” as three characters.

Secondly, dealing with characters is sufficient to abstract over the encoding implementation details. If a language deals with code units, a program written for a UTF-8 build may behave differently if compiled with a UTF-16 string representation. A language that deals with characters is independent of the encoding. Dealing with CCSes doesn’t add any further implementation independence. Similarly, dealing with characters is enough to allow strings to be spliced, re-encoded, and re-combined on arbitrary boundaries; dealing with CCSes isn’t required for this either.

Thirdly, since Unicode allows arbitrarily long combining character sequences, dealing with CCSes could introduce subtle exploits into a lot of code. If Twitter counted CCSes, a user could write a multi-megabyte tweet by adding hundreds of combining characters onto a single base character.

In any case, I’m happy to have a further debate about whether we should be counting characters or CCSes — my goal here is to convince you that we should not be counting code units. Lastly, I’ll point out that Python 3 (in UCS-4 build, or from version 3.3 onwards) and Haskell both consider the above string to have a length of 2, as does Twitter. I don’t know of any programming languages that count CCSes. Twitter’s page on counting characters goes into some detail about this.


My point is this: the Unicode Consortium has provided us with a wonderful standard for representing and communicating characters from all of the world’s scripts, but most modern languages needlessly expose the details of how the characters are encoded. This means that all programmers need to become Unicode experts to make high-quality internationalised software, which means it is far less likely that programs will work when non-ASCII characters, or even worse, astral characters, are used. Since most English programmers don’t test with non-ASCII characters, and almost nobody tests with astral characters, this makes it pretty unlikely that software will ever see such inputs before it is released.

Programming language designers need to become Unicode experts, so that regular programmers don’t have to. The next generation of languages should provide character-oriented string operations only (except where the programmer explicitly asks for text to be encoded). Then the rest of us can get back to programming, instead of worrying about encoding issues.


Why Python’s whitespace rule is right

In Language design,Python on October 18, 2011 by Matt Giuca Tagged: , , ,

Python is famous among programming languages for its fairly unique syntax: rather than being delimited by curly braces or “begin/end” keywords, blocks are delimited by indentation. Indenting a line is like adding an opening curly brace, and de-denting is like a closing curly brace. When people criticise Python, it is usually the first complaint: “why would I want to use a language which requires me to indent code?” Indeed, while programmers are very used to indenting their code, they are very un-used to being forced to do so, and I can understand why they may take it as an insult that a language tells them how to write code. I don’t usually like to get into syntax arguments, because I find them very superficial — it is much more important to discuss the semantics of a language than its syntax. But this is such a common argument among Python detractors, I wanted to address it. Python is right, and it’s just about the only language that is.

I think the rub is that programmers like to think of languages as a tool, and tools should be as flexible as possible. I think in general it is a good principle for programming languages not to enforce conventions. Languages that do tend to annoy people who don’t subscribe to the same conventions. For example, the Go programming language enforces the “One True Brace Style” — every opening curly brace must appear on the same line as the function header or control statement. This irritates me because that’s not my preferred convention. But the indentation convention is so universal that it is considered bad programming practice to not indent in all cases. (There is disagreement over tabs vs spaces, the number of spaces, etc, but we all agree that indentation is good.) There is not a single situation in any country, in any programming language, or at any skill level, in which is it acceptable to not indent your code the way Python requires it. Therefore, it is technically redundant to have a language that is not whitespace-sensitive. Any language that is not whitespace-sensitive requires (by universal convention) that programmers communicate the scoping of the code in two distinct manners for every single line of code: braces (or begin/end) and indentation. You are required to make sure that these two things match up, and if you don’t, then you have a program that doesn’t work the way it looks like it works, and the compiler isn’t going to tell you.

There are two solutions to this problem. 1: Make the compiler tell you. Force the programmer to indent and put in curly braces, and have the compiler check the indentation and give either a warning or error if they don’t match up. Now you’ve solved the problem of accidentally getting it wrong, but now what is the point of requiring curly braces at all? The programmer would just be doing extra work to please the compiler. We may as well go with 2: take out the curly braces and just have the compiler determine the blocks based on indentation.

When you really analyse it, Python’s whitespace sensitivity is actually the only logical choice for a programming language, because you only communicate your intent one way, and that intent is read the same way by humans and computers. The only reason to use a whitespace-insensitive language is that that’s the way we’ve always done things, and that’s never a good reason. That is why my programming language, Mars, has the same indentation rule as Python.

* * *

An interesting aside: there is a related syntax rule in Python which doesn’t seem quite so logical: you are required to place a colon at the end of any line preceding an indent. I haven’t fully tested this, but I’m pretty sure there is no technical reason for that (the parser could still work unambiguously without that colon), and it doesn’t seem to add much to the readability either. I slavishly followed this rule in Mars too, because as a Python programmer it “feels right” to me. But perhaps it would be better to drop it.


A Ruby porting challenge

In Language design on January 27, 2011 by Matt Giuca Tagged: ,

Having an in-agreement discussion with Tim about Ruby. I’ll write a full post later, but for now (to make sure I don’t make an ass of myself later, if anyone has a better answer), who can port this small Python function to Ruby?

def allprefixes(x):
    prefixes = []
    for i in range(len(x)+1):
    return prefixes

Expected output:

>>> allprefixes([1,2,3])
[[], [1], [1, 2], [1, 2, 3]]

See the problem? Any non-hacky solutions? (I have mine, but it is hacky, and I’ll save it for the follow-up post.)

Edit: Arg, OK I just found out about Ruby’s … operator. That makes things significantly easier (having previously known only about the .. operator). Never mind then! But it is still a nasty default.


Still totally undecided about implementation inheritance

In Language design,Object-orientation on September 14, 2010 by Matt Giuca Tagged: , ,

Everyone keeps saying it’s a bad thing. They cite the fragile base class problem and other various issues which I don’t really have a problem with. (The FBC problem as far as I can tell almost always is caused when someone violates the Liskov substitution principle.)

I recently went to a Go talk. Go is another in a string of languages which extol the virtues of interface inheritance and not implementation inheritance. But in most of the OO code I write (including the code I encourage and expect my students to write), I use class hierarchies with a lot of data and code re-use in the child classes, I think to great effect. Am I mistaken?

To this end I shall undergo an experiment: Having never used Go, I will rewrite the Java project I gave my students in Go, making extensive use of interface inheritance and see if there’s a natural “Go-ism” for achieving code and data re-use where my existing code uses implementation inheritance — I assume using ordinary composition and functions.


Why nondeterminism is awesome

In Language design on June 19, 2010 by Matt Giuca Tagged: , ,

I’ve been using the Mercury programming language for several years for my PhD project. Now Mercury is a functional logic language (predominantly logic), which means instead of functions, it has predicates. Predicates are like subroutines (they have “output parameters” but don’t actually return values), but with an added power called “nondeterminism”. I’ll explain what that means shortly, but the point of this post is: I’ve never used it. I understand it fully but I’ve never used it in a real program because the opportunity never came up. I think a lot of people think it just adds complexity to the language with no real benefit. But the other day, I discovered how useful it can be.

Basically, nondeterminism means that a predicate can succeed multiple times, producing multiple results. The caller is largely unaware that the callee produced multiple results, as it isn’t returned as a “set” or “list”, it simply causes the caller to also produce multiple results. To give a quick example, say I have the nondeterministic predicate p:

:- pred p(int::out) is nondet.
p(X) :-
        X = 1
        X = 2
        X = 3

This predicate succeeds three times, with the solutions {X = 1, X = 2 and X = 3}. Now when you call p, you don’t have to do anything special — you just automatically become nondeterministic too. So take predicate q:

:- pred q(int::out) is nondet.
q(Y) :-
    p(X),         % Set X to the result of p
    Y = X + 1.    % Return X + 1

Now q just looks like regular code, but because p is nondeterministic, q also has three solutions: {Y = 2, Y = 3 and Y = 4}.

So what’s it all useful for?

Well in my compiler, I had a lot of code looking like this (only much more complex):

% A predicate that a tree, and returns a set of values
:- pred tree_processor(tree::in, set(val)::out) is det.
tree_processor(X, Vals) :-
    % Call another predicate to get a set of values
    get_values(X, Vals1),

    % Recurse on the children of X and union all results together
    XChildren = tree_children(X),   % Children as a set, XChildren, Vals2Power),
    Vals2 = set.power_union(Vals2Power),

    % Union both sets together into the final set of results
    Vals = set.union_list([Vals1, Vals2]).

This code is what I call “set-oriented”. Not only does this predicate deal with set objects all over the place, but the “get_values” and “tree_children” predicates must return a set, and any predicate which calls this code must accept a set.

There are a couple of problems with this code. Firstly, I need to number each of the sets “Vals1” and “Vals2” (the actual code had four sets). Secondly, note that I do a good deal of work in the recursion. The recursive call produces a set, but since I am recursing over a set, I need to map over each child, resulting in a set of sets of results. Then I have to power_union (flatten the set of sets into a set). This furious mapping and unioning is a hallmark of set-oriented code and it was steadily spreading throughout my code.

Then I realised: This is what nondeterminism is for. The properties of my code (here and in several other places) were:

  • A predicate which produces a set of values as a result,
  • The final set is a union of a number of separately derived sets (Vals1, Vals2),
  • The predicate deals with other sets of values,
  • When dealing with a set of values, it produces results for each value in the set, independently (it maps over the set, doesn’t fold).

What this means is, I could convert this and other predicates to nondeterminism to save a lot of complexity, making the code shorter and easier to read (assuming the reader understands nondeterminism):

% A predicate that takes a tree, and returns single value (with multiple solutions)
:- pred tree_processor(tree::in, val::out) is nondet.
tree_processor(X, Val) :-
        % Call another predicate to get a set of values
        get_values(X, Val)          % Many solutions
        % Recurse on the children of X
        tree_child(X, XChild),      % Many children
        tree_processor(XChild, Val) % Many solutions

Look what I’ve got now:

  • Just one name for all of the results (“Val”).
  • Cleanly separated (via the “;” disjunction operator) the distinct computations of the result set (which were previously bound to Vals1 and Vals2).
  • No calls to map or union. Once a value is bound nondeterministically (such as XChild), it automatically “maps” all subsequent code. And the resulting set is automatically a “union” of every binding of Val.
  • A nice declarative reading: “tree_processor(X, Val) succeeds if Val is a ‘get_values’ of X (whatever that means), or if XChild is any child of X, and tree_processor(XChild, Val) succeeds.”

I did have to convert all the functions to nondeterminism at once, but I can stop at any level. It is fairly easy to convert back and forth between “sets of values” and “nondeterminism” (use solutions.solutions_set to convert a nondeterministic predicate’s results into a set; use set.member to convert a set into nondeterministic results).

So hey, this turned out to be quite a useful pattern, and it’s only possible in a logic language such as Prolog or Mercury.

In other languages

Haskell allows a similar nondeterministic style, perhaps surprisingly, via the list data type (treating it as a monad), but that’s a completely separate topic. As ever, it’s really clean if you can wrap your head around monads.

Python’s generators offer something like nondeterminism. It sort of gives you half the features — it is easy to produce results in nondeterministic style, but not to consume them. To explain, using the example above:

# A function that takes a tree, and yields a sequence of values
def tree_processor(x):
    # Call another function to get a sequence of values
    for val in get_values(x):   # Sequence of solutions
        yield val

    # Recurse on the children of X
    for xchild in tree_child(x):    # Many children
        for val in tree_processor(xchild):
            yield val

This has the advantage of not having to produce sets (vals1 and vals2) and union them. However, unlike nondeterminism, calling a generator doesn’t automatically cause the rest of the caller to map over the generator’s results — you have to explicitly map or for-loop over the results. As a language designer, I wonder if there is a middle-ground — some generator-like mechanism which does automatically map over the results but which doesn’t take me all the way to a logic language.

Also note that while nondeterminism in Prolog/Mercury represents a set of results, in Haskell and Python it represents a sequence as the results come out in a definitive order.


Literate programming is a terrible idea

In Language design on June 4, 2010 by Matt Giuca

The code samples for the subject I’m teaching are all written in Literate Haskell. Having worked with a bit of this code, I can say it’s a complete nightmare. The Wikipedia article discusses the philosophy behind it — I’m not here to debate that. My criticism is primarily that literate programs are god awful to work with in a text editor.

What I’m talking about is a programming style (specially supported by the Haskell language) in which comments are the default, and if you want code, you need to explicitly delimit it with a “>”. It inverts the comment/code relationship so that comment is the default. It means that instead of this:

-- This program calculates the factorial of a number.
fact 0 = 1
fact n = n * (fact (n-1))

You write this:

This program calculates the factorial of a number.
>fact 0 = 1
>fact n = n * (fact (n-1))

I’ve compiled this list of gripes after working with this style for just a few minutes:

  • It encourages extremely verbose comments. This seems at odds with the programmers’ mantra that “good code is self-documenting” and therefore, ideally, code doesn’t need comments at all (except to briefly explain what each function does, and explain any tricky bits).
  • Text editors don’t highlight the comments (as I demonstrate above). At least vim doesn’t. As far as I can tell, this is by design — the idea is that the program is actually an essay with code snippets, and thus you wouldn’t want the majority of the characters to be blue/green.
  • You have to write a > on every line of code. I find this much more annoying than having to write a “–” on comments, possibly because when I’m coding, I need to concentrate more than when I’m just writing English.
  • Working with code is a nightmare. The main reason is that code is indented. So if you forget to add a > on an indented line, you have to go to the left and add it, then you’ve offset your line by 1 character and have to fix it up. If you join a line (Shift+J in vim), it won’t nicely delete the indentation from the subsequent line, because there’ll be a “>” in the way followed by a lot of spaces. You spend a good percentage of your time messing about with “>” characters.
  • Can’t use line or block indentation commands (such as Shift+> in vim to indent the current line or block), as they’ll indent the > (which must be in column 0). Shift+< doesn’t work either, as vim considers all lines to be unindented. You have to put the cursor at the real start of each line and hit tab.
  • Shift+^ no longer takes you to the start of the line; it takes you to column 0.
  • Forgetting a “–” usually means a syntax error. Forgetting a “>” means a missing line of code, which may or may not generate a compiler error (could be a missing pattern, for example, which is a runtime error in Haskell).

This could be a useful style for writing an actual essay with code snippets, but not for an actual program.


On Python’s Whitespace

In Language design,Python on January 13, 2009 by Matt Giuca Tagged: , ,

It seems like a certainty that any given Slashdot article involving Python will feature at least a handful of top-level comments where people complain about Python’s indentation/whitespace rule, regardless of the topic of the news item. There’s probably a race to be the first to complain about it in any given post.

I usually bite. This time, I decided to bite hard, and I’ve decided to re-publish my response here because I think it was rather blog-worthy.

AuMatar says:

On the other hand, I’ve spent at least a full work week of my life fixing problems due to whitespace. Guido made a major f*** up there- by removing braces but not strictly defining whitespace, he’s created a language where it’s possible to have two identical looking pieces of code do very different things. If he had said that it must be indented by exactly 1 tab or exactly 4 spaces or whatever other measure and everything else would throw a syntax error, it would have been fine. As it is I’d say about 15-20% of the time I spent doing Python was spent fixing these kinds of bugs.

To which I replied:

Guido made a major f*** up there- by removing braces but not strictly defining whitespace

Stop. First, the whitespace rule in Python *is* strictly defined.

The formal, exact, unambiguous specification of how Python interprets whitespace is in the official language reference – Lexical analysis [].

It’s pretty wordy, but I’ve studied it and it’s quite precise. The relevant section is here:

“Firstly, tabs are replaced (from left to right) by one to eight spaces such that the total number of characters up to and including the replacement is a multiple of eight”

This is exactly the same as the default behaviour of Unix `expand`.

[Guido has] created a language where it’s possible to have two identical looking pieces of code do very different things.

It depends what you mean by “looking”. To you, perhaps 1 tab looks the same as 4 spaces. To me, maybe it looks the same as 2 spaces. To Jeff, maybe it looks like a red dot in his specially-configured editor. To Python, it happens to look the same as 8 spaces.

DO NOT MIX TABS AND SPACES. Then, I guarantee you that any two pieces of code which look the same to you (whether they use tabs or spaces) will also look the same to Python. (You don’t have to enforce this across a whole file, just on a per-block basis, but it’s best if your whole project has an agreed indentation standard).

If he had said that it must be indented by exactly 1 tab or exactly 4 spaces or whatever other measure and everything else would throw a syntax error.

That’s silly. Then you’d be at Guido’s whim; you’d have to indent the way he chose. This way, you can choose any indentation you like. Tabs, 2 spaces, 4 spaces, 3 tabs if you like. As long as you are internally-consistent, Python will be happy.

My second point to you: If you are pasting code from somewhere into your code, and you do not fix up indentation so it matches the surrounding code, you are worse than Hitler. Or at least very lazy. I don’t care if you are using Python or C or Brainf***.

If you carelessly paste 1-tab-indented code into a surrounding block which is 4-tab-indented, and don’t fix it up, then how do you think I will feel when I open it in my editor configured to expand tabs to 2 spaces instead. It will be totally unreadable — and this is why we indent in the first place (in any language, that is).

Python forces you to tidy this up, and that can only be a good thing. If your code is confusing Python, it’s probably confusing a bunch of other readers as well.


Concise writing

In Language design on January 5, 2009 by Matt Giuca Tagged: ,

I’ve been known to ramble a lot in my writing. Being more concise is something I know I have to work on. Apologies for some of the long posts on this blog.

I was working on the spec for my language, Mars, just now. Writing language specs, it turns out, is pretty hard. You have to precisely capture the semantics of the language in a way that’s clear and understandable. I got up to that awfully mundane bit where you explain the really basic things like if statements and while loops. Things for which everybody knows how they work, but they have to be there anyway, because it is, after all, a formal language spec. And here’s what I wrote under the heading “If statements”:

The if statement conditionally executes a sequence of statements:

if_stmt ::=  "if" expression ":" NEWLINE
                 INDENT statement+ DEDENT
             [ "else" ":" NEWLINE
                 INDENT statement+ DEDENT ]

The expression on the first line is called the “condition”. The condition’s type must be Int, or the compiler MUST reject the program.

The condition is evaluated, and any side-effects of evaluating the expression take place during the execution of the if statement, and before the execution of any sub-expressions.

The first nested block of statements is called the “then” clause. If the optional else part is supplied, then the second nested block of statements is called the “else” clause. One of these clauses, but not both, will be executed, by executing each of the sequence of statements in order.

If the condition’s value is not 0, the “then” clause is executed.

If the condition’s value is 0, then the “else” clause, if supplied, is executed. If it is not supplied, the if statement has no effect, besides the evaluation of the initial expression.

Quite a mouthful! Then I went to the Python language reference, which I absolutely idolise, and have been referring to periodically for guidance. And here’s what Python says on the matter:

The if statement is used for conditional execution:

if_stmt ::=  "if" expression ":" suite
             ( "elif" expression ":" suite )*
             ["else" ":" suite]

It selects exactly one of the suites by evaluating the expressions one by one until one is found to be true (see section Boolean operations for the definition of true and false); then that suite is executed (and no other part of the if statement is executed or evaluated). If all expressions are false, the suite of the else clause, if present, is executed.

“Right!” I said, “If they can describe it in a paragraph, then so can I!” I said, and rewrote mine to a much more condensed version, which I believe describes the same logic:

The expression on the first line is called the “condition”. The condition’s type must be Int, or the compiler MUST reject the program. The first nested block of statements is called the “then” clause. If the optional else part is supplied, then the second nested block of statements is called the “else” clause.

The condition is evaluated, and any side-effects take place immediately. If the value is not 0, the “then” clause is executed. If the value is 0, then the “else” clause, if supplied, is executed.

My point is that if you read my original version, it’s a lot more stiff and formal, and dare I say, condescending? It is clear but not concise. It tells you every little detail about how an if statement works!

The way the Python language reference is written is really assuming some intelligence on the part of the reader. It uses more conversational phrases such as “one by one until one is found to be true; then that suite is executed”. There’s no formal notion of “one by one”, nor do they formally describe what is meant by “that suite”. But of course, any reader with an ounce of sense will know what is meant by “one by one” and “that suite”.

It almost seems counter-intuitive, but assuming some intuition or thought on behalf of your readers can actually make documents easier to read, not harder.

If you compare the size of, say, the C++ language spec (~1300 pages), or even the C language spec (~500 pages) to Python (~100 pages), you can see the difference this style makes.