Archive for the ‘Python’ Category


How do you escape a complete URI?

In JavaScript,Python,Unicode,URI,Web development on February 12, 2012 by Matt Giuca

This question comes up quite a lot: “I have a URI. How do I percent-encode it?” In this post, I want to explore this question fully, because it is a hard question, and one that, on the face of it, is nonsense. The short answer to this question is: “You can’t. If you haven’t percent-encoded it yet, you’re too late.” But there are some specific reasons why you might want to do this, so read on.

Let’s get at the meaning of this question. I’ll use the term URI (Uniform Resource Identifier), but I’m mostly talking about URLs (a URI is just a bit more general). A URI is a complete identifier of a resource, such as:

A URI component is any of the individual atomic pieces of the URI, such as “”, “admin”, “login”, “name”, “Helen”, “gender” or “f”. The components are the parts that the URI syntax has no interest in parsing; they represent plain text strings. Now the problem comes when we encounter a URI such as this: Ødegård&gender=f

This isn’t a legal URI because the last query argument “Helen Ødegård” is not percent-encoded — the space (U+0020, meaning that this is the Unicode character with hexadecimal value 20), as well as the non-ASCII characters ‘Ø’ (U+00D8) and ‘å’ (U+00E5) are forbidden in any URI. So the answer to “can’t we fix this?” is “yes” — we can retroactively percent-encode the URI so it appears like this:

A digression: note that the Unicode characters were first encoded to bytes with UTF-8: ‘Ø’ (U+00D8) encodes to the byte sequence hex C3 98, which then percent-encodes to %C3%98. This is not actually part of the standard: none of the URI standards (the most recent being RFC 3986) specify how a non-ASCII character is to be converted into bytes. I could also have encoded them using Latin-1: “Helen%20%D8deg%E5rd,” but then I couldn’t support non-European scripts. This is a mess, but it isn’t the subject of this article, and the world mostly gets along fine by using UTF-8, which I’ll assume we’re using for the rest of this article.

Okay, so that’s solved, but will it work in all cases? How about this URI:

Clearly, a human looking at this can tell that the value of the “redirect” argument is “”, which means that the “#” (U+0023) needs to be percent-encoded as “%23”:

But how did we know to encode the “#”? What if whoever typed this URI genuinely meant for there to be a query of “redirect=” and a fragment of “funny&name=Helen&gender=f”. It is wrong for us to meddle with the given URI, assuming that the “#” was intended to be a literal character and not a delimiter. The answer to “can we fix it?” is “no“. Fixing the above URI would only introduce bugs. The answer is “if you wanted that ‘#’ to be interpreted literally, you should have encoded it before you stuck it in the URI.”

The idea that you can:

  1. Take a bunch of URI components (as bare strings),
  2. Concatenate them together using URI delimiters (such as “?”, “&” and “#”),
  3. Percent-encode the URI.

is nonsense, because once you have done step #2, you cannot possibly know (in general) which characters were part of the original URI components, and which are delimiters. Instead, error-free software must:

  1. Take a bunch of URI components (as bare strings),
  2. Percent-encode each individual URI component,
  3. Concatenate them together using URI delimiters (such as “?”, “&” and “#”).

This is why I previously recommended never using JavaScript’s encodeURI function, and instead to use encodeURIComponent. The encodeURI function “assumes that the URI is a complete URI” — it is designed to perform step #3 in the bad algorithm above, which by definition, is meaningless. The encodeURI function will not encode the “#” character, because it might be a delimiter, so it would fail to interpret the above example in its intended meaning.

The encodeURIComponent function, on the other hand, is designed to be called on the individual components of the URI before they are concatenated together — step #2 of the correct algorithm above. Calling that function on just the component “” would produce:

which is a bit of overkill (the “:” and “/” characters do not strictly need to be encoded in a query parameter), but perfectly valid — when the data reaches the other end it will be decoded back into the original string.

So having said all of that, is there any legitimate need to break the rule and percent-encode a complete URI?

URI cleaning

Well, yes there is. (I have been bullish in the past that there isn’t, such as in my answer to this question on Stack Overflow, so this post is me reconsidering that position a bit.) It happens all the time: in your browser’s address bar. If you type this URL into the address bar: Ødegård&gender=f

it is not typically an error. Most browsers will automatically “clean up” the URL and send an HTTP request to the server with the line:

GET /admin/login?name=Helen%20%C3%98deg%C3%A5rd&gender=f HTTP/1.1

(Unfortunately, IANA will redirect this immediately, but if you inspect the packets or try it on a server you control, then check the logs, you will see this is true.) Most browsers don’t show you that they’re cleaning up the URIs — they attempt to display them as nicely as possible (in fact, if you type in the escaped version, Firefox will automatically convert it so you can see the space and Unicode characters in the address bar).

Does this mean that we can relax and encode our URIs after composing them? No. This is an application of Postel’s Law (“Be liberal in what you accept, and conservative in what you send.”) The browser’s address bar, being a human interface mechanism, is helpfully attempting to take an invalid URI and make it valid, with no guarantee of success. It wouldn’t help on my second example (“redirect=”). I think this is a great idea, because it lets users type spaces and Unicode characters into the address bar, and it isn’t necessarily a bad idea for other software to do it too, particularly where user interfaces are concerned. As long as the software is not relying on it.

In other words, software should not use this technique to construct URIs internally. It should only ever use this technique to attempt to “clean up” URIs that have been supplied from an external source.

So that is the point of JavaScript’s encodeURI function. I don’t like to call this “encoding” because that implies it is taking something unencoded and converting it into an encoded form. I prefer to call this “URI cleaning”. That name is suggestive of the actual process: taking a complete URI and cleaning it up a bit.

Unfortunately (as pointed out by Tim Cuthbertson), encodeURI is not quite good for this purpose — it encodes the ‘%’ character, meaning it will double-escape any URI that already has percent-escaped content. More on that later.

We can formalise this process by describing a new type of object called a “super URI.” A super URI is a sequence of Unicode characters with the following properties:

  1. Any character that is in any way valid in a URI is interpreted as normal for a URI,
  2. Any other character is interpreted as a URI would interpret the sequence of characters resulting from percent-encoding the UTF-8 encoding of the character (or some other character encoding scheme).

Now it becomes clear what we are doing: URI cleaning is simply the process of transforming a super URI into a normal URI. In this light, rather than saying that the string: Ødegård&gender=f

is “some kind of malformed URI,” we can say it is a super URI, which is equivalent to the normal URI:

Note that super URIs have nothing to do with not percent-encoding of delimiter characters — delimiters such as “#” must still be percent-escaped in the super URI. They are only about not percent-encoding invalid characters. We can consider super URIs to be human-readable syntax, while proper URIs are required for data transmission. This means that we can also take a proper URI and convert it into a more human-readable super URI for display purposes (as web browsers do). That is the purpose of JavaScript’s decodeURI function. Note that, again, I don’t consider this to be “decoding,” rather, “pretty printing.” It doesn’t promise not to show you percent-encoded characters. It only decodes characters that are illegal in normal URIs.

It is probably a good idea for most applications that want to “pretty print” a URI to not decode control characters (U+0000 — U+001F and U+007F), to avoid printing garbage and newlines. Note that decodeURI does decode these characters, so it is probably unwise to use it for display purposes without some post-processing.

Update: My “super URI” concept is similar to the formally specified IRI (Internationalized Resource Identifier) — basically, a URI that can have non-ASCII characters. However, my “super URIs” also allow other ASCII characters that are illegal in URLs.

Which characters?

Okay, so exactly which characters should be escaped for this URI cleaning operation? I thought I’d take the opportunity to break down the different sets of characters described by the URI specification. I will address two versions of the specification: RFC 2396 (published in 1998) and RFC 3986 (published in 2005). 2396 is obsoleted by 3986, but since a lot of encoding functions (including JavaScript’s) were invented before 2005, it gives us a good historical explanation for their behaviour.

RFC 2396

This specification defines two sets of characters: reserved and unreserved.

  • The reserved characters are: $&+,/:;=?@
  • The unreserved characters are: ALPHA and NUM and !'()*-._~

Where ALPHA and NUM are the ASCII alphabetic and numeric characters, respectively. (They do not include non-ASCII characters.)

There is a semantic difference between reserved and unreserved characters. Reserved characters may have a syntactic meaning in the URI syntax, and so if one of them is to appear as a literal character in a URI component, it may need to be escaped. (This will depend upon context — a literal ‘?’ in a path component will need to be escaped, whereas a ‘?’ in a query does not need to be escaped.) Unreserved characters do not have a syntactic meaning in the URI syntax, and never need to be escaped. A corollary to this is that the escaping or unescaping an unreserved character does not change its meaning (“Z” means the same as “%5A”; “~” means the same as “%7E”), but escaping or unescaping a reserved character might change its meaning (“?” may have a different meaning to “%3F”).

The URI component encoding process should percent-encode all characters that are not unreserved. It is safe to escape unreserved characters as well, but not necessary and generally not preferable.

Together, these two sets comprise the valid URI characters, along with two other characters: ‘%’, used for encoding, and ‘#’, used to delimit the fragment (the ‘#’ and fragment were not considered to be part of the URI). I would suggest that both ‘%’ and ‘#’ be treated as reserved characters. All other characters are illegal. The complete set of illegal characters, under this specification, follows:

  • The ASCII control characters (U+0000 — U+001F and U+007F)
  • The space character
  • The characters: “<>[\]^`{|}
  • Non-ASCII characters (U+0080 — U+10FFFD)

The URI cleaning process should percent-encode precisely this set of characters: no more and no less.

RFC 3986

The updated URI specification from 2005 makes a number of changes, both to the way characters are grouped, and to the sets themselves. The reserved and unreserved sets are now as follows:

  • The reserved characters are: !#$&'()*+,/:;=?@[]
  • The unreserved characters are: ALPHA and NUM and -._~

This version features ‘#’ as a reserved character, because fragments are now considered part of the URI proper. There are two more important additions to the restricted set. Firstly, the characters “!'()*” have been moved from unreserved to reserved, because they are “typically unsafe to decode.” This means that, while these characters are still technically legal in a URI, their encoded form may be interpreted differently to their bare form, so encoding a URI component should encode these characters. Note that this is different than banning them from URIs altogether (for example, a “javascript:” URI is allowed to contain bare parentheses, and that scheme simply chooses not to distinguish between “(” and “%28”). Secondly, the characters ‘[‘ and ‘]’ have been moved from illegal to reserved. As of 2005, URIs are allowed to contain square brackets. This unfortunate change was made to allow IPv6 addresses in the host part of a URI. However, note that they are only allowed in the host, and not anywhere else in the URI.

The reserved characters were also split into two sets, gen-delims and sub-delims:

  • The gen-delims are: #/:?@[]
  • The sub-delims are: !$&'()*+,;=

The sub-delims are allowed to appear anywhere in a URI (although, as reserved characters, their meaning may be interpreted differently if they are unescaped). The gen-delims are the important top-level syntactic markers used to delimit the fields of the URI. The gen-delims are assigned meaning by the URI syntax, while the sub-delims are assigned meaning by the scheme. This means that, depending on the scheme, sub-delims may be considered unreserved. For example, a program that encodes a JavaScript program into a “javascript:” URI does not need to encode the sub-delims, because JavaScript will interpret them the same whether they are encoded or not (such a program would need to encode illegal characters such as space, and gen-delims such as ‘?’, but not sub-delims). The gen-delims may also be considered unreserved in certain contexts — for example, in the query part of a URI, the ‘?’ is allowed to appear bare and will generally mean the same thing as “%3F”. However, it is not guaranteed to compare equal: under the Percent-Encoding Normalization rule, encoded and bare versions of unreserved characters must be considered equivalent, but this is not the case for reserved characters.

Taking the square brackets out of the illegal set leaves us with the following illegal characters:

  • The ASCII control characters (U+0000 — U+001F and U+007F)
  • The space character
  • The characters: “<>\^`{|}
  • Non-ASCII characters (U+0080 — U+10FFFD)

A modern URI cleaning function must encode only the above characters. This means that any URI cleaning function written before 2005 (hint: encodeURI) will encode square brackets! That’s bad, because it means that a URI with an IPv6 address:

http://[2001:db8:85a3:8d3:1319:8a2e:370:7348]/admin/login?name=Helen Ødegård&gender=f

would be cleaned as:


which refers to the domain name “[2001:db8:85a3:8d3:1319:8a2e:370:7348]” (not the IPv6 address). Mozilla’s reference on encodeURI contains a work-around that ensures that square brackets are not encoded. (Note that this still double-escapes ‘%’ characters, so it isn’t good for URI cleaning.)

So what exactly should I do?

If you are building a URI programmatically, you must encode each component individually before composing them.

  • Escape the following characters: space and !”#$%&'()*+,/:;<=>?@[\]^`{|} and U+0000 — U+001F and U+007F and greater.
  • Do not escape ASCII alphanumeric characters or -._~ (although it doesn’t matter if you do).
  • If you have specific knowledge about how the component will be used, you can relax the encoding of certain characters (for example, in a query, you may leave ‘?’ bare; in a “javascript:” URI, you may leave all sub-delims bare). Bear in mind that this could impact the equivalence of URIs.

If you are parsing a URI component, you should unescape any percent-encoded sequence (this is safe, as ‘%’ characters are not allowed to appear bare in a URI).

If you are “cleaning up” a URI that someone has given you:

  • Escape the following characters: space and “<>\^`{|} and U+0000 — U+001F and U+007F and greater.
  • You may (but shouldn’t) escape ASCII alphanumeric characters or -._~ (if you really want to; it will do no harm).
  • You must not escape the following characters: !#$%&'()*+,/:;=?@[]
  • For an advanced URI cleaning, you may also fix any other syntax errors in an appropriate way (for example, a ‘[‘ in the path segment may be encoded, as may a ‘%’ in an invalid percent sequence).
  • An advanced URI cleaner may be able to escape some reserved characters in certain contexts. Bear in mind that this could impact the equivalence of URIs.

If you are “pretty printing” a URI and want to display escaped characters as bare, where possible:

  • Unescape the following characters: space and “-.<>\^_`{|}~ and ASCII alphanumeric characters and U+0080 and greater.
  • It is probably not wise to unescape U+0000 — U+001F and U+007F, as they are control characters that could cause display problems (and there may be other Unicode characters with similar problems.)
  • You must not unescape the following characters: !#$%&'()*+,/:;=?@[]
  • An advanced URI printer may be able to unescape some reserved characters in certain contexts. Bear in mind that this could impact the equivalence of URIs.

These four activities roughly correspond to JavaScript’s encodeURIComponent, decodeURIComponent, encodeURI and decodeURI functions, respectively. In the next section, we look at how they differ.

Some implementations


As I stated earlier, never use escape. First, it is not properly specified. In Firefox and Chrome, it encodes all characters other than the following: *+-./@_. This makes it unsuitable for URI construction and cleaning. It encodes the unreserved character ‘~’ (which is harmless, but unnecessary), and it leaves the reserved characters ‘*’, ‘+’, ‘/’ and ‘@’ bare, which can be problematic. Worse, it encodes Latin-1 characters with Latin-1 (instead of UTF-8) — not technically a violation of the spec, but likely to be misinterpreted, and even worse, it encodes characters above U+00FF with the malformed syntax “%uxxxx”. Avoid.

JavaScript’s “fixed” URI encoding functions behave according to RFC 2396, and assuming Unicode characters are to be encoded with UTF-8. This means that they are lacking the 2005 changes:

  • encodeURIComponent does not escape the previously-unreserved characters ‘!’, “‘”, “(“, “)” and “*”. Mozilla’s reference includes a work-around for this.
  • decodeURIComponent still works fine.
  • encodeURI erroneously escapes the previously-illegal characters ‘[‘ and ‘]’. Mozilla’s reference includes a work-around for this.
  • decodeURI erroneously unescapes ‘[‘ and ‘]’ (although there doesn’t seem to be a practical case where this is a problem).

Edit: Unfortunately, encodeURI and decodeURI have a single, critical flaw: they escape and unescape (respectively) percent signs (‘%’), which means they can’t be used to clean a URI. (Thanks to Tim Cuthbertson for pointing this out.) For example, assume we wanted to clean the URI: Ødegård&gender=f

This URI has the ‘#’ escaped already, because no URI cleaner can turn a ‘#’ into a “%23”, but it doesn’t have the space or Unicode characters escaped. Passing this to encodeURI produces:

Note that the “%23” has been double-escaped so it reads “%2523” — completely wrong! We can fix this by extending Mozilla’s work-around to also correct against double-escaped percent characters:

function fixedEncodeURI(str) {
    return encodeURI(str).replace(/%25/g, '%').replace(/%5[Bb]/g, '[').replace(/%5[Dd]/g, ']');

Note that decodeURI is similarly broken. The fixed version follows:

function fixedDecodeURI(str) {
    return decodeURI(str.replace(/%25/g, '%2525').replace(/%5[Bb]/g, '%255B').replace(/%5[Dd]/g, '%255D'));

Edit: Fixed fixedEncodeURI and fixedDecodeURI so they work on lowercase escape codes. (Thanks to Tim Cuthbertson for pointing this out.)


Python 2’s urllib.quote and urllib.unquote functions perform URI component encoding and decoding on byte strings (non-Unicode).

  • urllib.quote works as I specify above, except that it does escape ‘~’, and does not escape ‘/’. This can be overridden by supplying safe=’~’.
  • urllib.unquote works as expected, returning a byte string.

Note that these do not work properly at all on Unicode strings — you should first encode the string using UTF-8 before passing it to urllib.quote.

In Python 3, the quote and unquote functions have been moved into the urllib.parse module, and upgraded to work on Unicode strings (by me — yay!). By default, these will encode and decode strings as UTF-8, but this can be changed with the encoding and errors parameters (see urllib.parse.quote and urllib.parse.unquote).

I don’t know of any Python built-in functions for doing URI cleaning, but urllib.quote can easily be used for this purpose by passing safe=”!#$%&'()*+,/:;=?@[]~” (the set of reserved characters, as well as ‘%’ and ‘~’; note that alphanumeric characters, and ‘-‘, ‘.’ and ‘_’ are always safe in Python).

Mozilla Firefox

Firefox 10’s URL bar performs URL cleaning, allowing the user to type in URLs with illegal characters, and automatically converting them to correct URLs. It escapes the following characters:

  • space, “‘<>` and U+0000 — U+0001F and U+007F and greater. (Note that this includes the double and single quote.)
  • Note that the control characters for NUL, tab, newline and carriage return don’t actually transmit.

I would say this is erroneous: on a minor note, it should not be escaping the single quote, as that is a reserved character. It also fails to escape the following illegal characters: \^{|}, sending them to the server bare.

Firefox also “prettifies” any URI, decoding most of the percent-escape sequences for the characters that it knows how to encode.

Google Chrome

Chrome 16’s URL bar also performs URL cleaning. It is rather similar to Firefox, but encoding the following characters:

  • space, “<> and U+0000 — U+0001F and U+007F and greater. (Note that this includes only the double quote.)

So Chrome also fails to escape the illegal characters \^`{|} (including the backtick, which Firefox escapes correctly), but unlike Firefox, it does not erroneously escape the single quote.


Why Python’s whitespace rule is right

In Language design,Python on October 18, 2011 by Matt Giuca Tagged: , , ,

Python is famous among programming languages for its fairly unique syntax: rather than being delimited by curly braces or “begin/end” keywords, blocks are delimited by indentation. Indenting a line is like adding an opening curly brace, and de-denting is like a closing curly brace. When people criticise Python, it is usually the first complaint: “why would I want to use a language which requires me to indent code?” Indeed, while programmers are very used to indenting their code, they are very un-used to being forced to do so, and I can understand why they may take it as an insult that a language tells them how to write code. I don’t usually like to get into syntax arguments, because I find them very superficial — it is much more important to discuss the semantics of a language than its syntax. But this is such a common argument among Python detractors, I wanted to address it. Python is right, and it’s just about the only language that is.

I think the rub is that programmers like to think of languages as a tool, and tools should be as flexible as possible. I think in general it is a good principle for programming languages not to enforce conventions. Languages that do tend to annoy people who don’t subscribe to the same conventions. For example, the Go programming language enforces the “One True Brace Style” — every opening curly brace must appear on the same line as the function header or control statement. This irritates me because that’s not my preferred convention. But the indentation convention is so universal that it is considered bad programming practice to not indent in all cases. (There is disagreement over tabs vs spaces, the number of spaces, etc, but we all agree that indentation is good.) There is not a single situation in any country, in any programming language, or at any skill level, in which is it acceptable to not indent your code the way Python requires it. Therefore, it is technically redundant to have a language that is not whitespace-sensitive. Any language that is not whitespace-sensitive requires (by universal convention) that programmers communicate the scoping of the code in two distinct manners for every single line of code: braces (or begin/end) and indentation. You are required to make sure that these two things match up, and if you don’t, then you have a program that doesn’t work the way it looks like it works, and the compiler isn’t going to tell you.

There are two solutions to this problem. 1: Make the compiler tell you. Force the programmer to indent and put in curly braces, and have the compiler check the indentation and give either a warning or error if they don’t match up. Now you’ve solved the problem of accidentally getting it wrong, but now what is the point of requiring curly braces at all? The programmer would just be doing extra work to please the compiler. We may as well go with 2: take out the curly braces and just have the compiler determine the blocks based on indentation.

When you really analyse it, Python’s whitespace sensitivity is actually the only logical choice for a programming language, because you only communicate your intent one way, and that intent is read the same way by humans and computers. The only reason to use a whitespace-insensitive language is that that’s the way we’ve always done things, and that’s never a good reason. That is why my programming language, Mars, has the same indentation rule as Python.

* * *

An interesting aside: there is a related syntax rule in Python which doesn’t seem quite so logical: you are required to place a colon at the end of any line preceding an indent. I haven’t fully tested this, but I’m pretty sure there is no technical reason for that (the parser could still work unambiguously without that colon), and it doesn’t seem to add much to the readability either. I slavishly followed this rule in Mars too, because as a Python programmer it “feels right” to me. But perhaps it would be better to drop it.


Python’s new classes vs old classes

In Python on November 18, 2010 by Matt Giuca Tagged: ,

I just got my head around the “method resolution order” (MRO) rules for new classes in Python. If you aren’t aware, in Python 2.2 (now nine years old, from 2001) they introduced so-called “new-style classes”. Basically, to create a new-style class, you inherit from the built-in class object or any other new-style class. If you don’t inherit from anything, you create an old-style class. (And in Python 3 onwards, there are no more old-style classes, so you no longer need to write (object) on every class.)

I’ve known for a while that they’re “good” and I should use them. Aside from the immediate benefits of being able to inherit from built-in types and them being unified with the type system (type(myobject) == MyClass and type(MyClass) == type, rather than type(myobject) == instance and type(MyClass) == classobj), I knew there was something “fixed” about the method resolution order — the set of rules for deciding which version of an overridden method to call in a complex multiple-inheritance hierarchy.

Now having read  The Python 2.3 Method Resolution Order, I understand it, but that document is a little wordy, so here is the simplest example I can think of. I’ll build the classic diamond problem, out of old-style classes:

class D:       # Note: Old-style
    def f(self): return "D.f()"
class B(D): pass
class C(D):
    def f(self): return "C.f()"
class A(B, C): pass

>>> a = A()
>>> a.f()

The f method of B returns “D.f()” (due to inheriting from D and not overriding). The f method of C returns “C.f()”. Obviously since A inherits both B and C, there is some contention about which to use. Under the old rules (depth first left to right), it chooses the version in B, even though C is closer to A and in some sense “more authoritative”. It’s certainly a tricky question: it seems like it should have chosen the version in C, because after all, C has explicitly overridden the version in D and B hasn’t. But on the other hand, why should B behave differently depending on whether it inherited a definition or supplied its own?

In any event, the new-style rules favour the “most authoritative” version:

class D(object):       # Note: New-style
    def f(self): return "D.f()"
class B(D): pass
class C(D):
    def f(self): return "C.f()"
class A(B, C): pass

>>> a = A()
>>> a.f()

Because C overrides D, it “wins”. The actual logic here is to go up depth-first but stop when you find a class which is subclassed later on. For example, when looking for the definition of f to use, it tries to go up A – B – D, but stops when it realises that D is being subclassed by a class which we haven’t considered yet (C). So it considers C first before looking at D, and finds a “stronger” definition there.

Most importantly, this new rule ensures monotonicity, and that is explained in the article.

[Edit: I’m pretty sure when the article talks about “Python 2.2 classes”, it does not refer to “old-style classes”, but rather the initial implementation of new-style classes in Python 2.2, which was flawed, and fixed in Python 2.3. So it is no longer possible to try out the “flawed” behaviour mentioned there without going back and installing version 2.2.]


Python 2: My new URI/Unicode crusade

In Python on March 14, 2010 by Matt Giuca Tagged: , , ,

You may recall in 2008 I filed a bug on unicode URIs in Python 3, had a massive argument with the Python community, and ended up successfully getting a patch (a complete rewrite of urllib.parse.quote and unquote) accepted in Python 3.

Well two years later, I finally had the stamina to check out the situation with unicode URIs in Python 2. It’s just as bad, if not worse, than it was in Python 3. So I’m doing it all over again!

I’ve just submitted three patches (1, 2, 3) on four separate bugs relating to urllib.quote and urllib.unquote, all of which I already fixed in Python 3. Hopefully this time, the existing Python 3 precedent will mean less arguing. Also the fact that I made three separate patches will mean they’ll be accepted or rejected individually, rather than what happened last time, which was me having to maintain a giant patch fixing a dozen bugs over two months.


Thou Shalt Not Modify A List During Iteration

In Python on February 12, 2009 by Matt Giuca Tagged: , , ,

This came up recently when discussing a friend’s code. It relates specifically to Python, but I think the rule is more general.

Firstly, let me be clear that in this article, when I say “modify”, I mean inserting or removing items from the list. Merely updating or mutating the list items is fine.

Now, consider the following Python code:

elems = ['a','b','c']
for e in elems:
    print e

This prints:


The author of this code snippet probably expected it to iterate over each element of the list, print it out, then remove it so that the list ends up empty (this may be desirable if, for example, you are removing items conditionally). Hence, he is probably surprised that 'b' was skipped, and furthermore, remains in the list:

>>> elems

We can see why only 'a' and 'c' were touched by imagining that Python translates the loop into the following equivalent lower-level code:

elems = ['a','b','c']
i = 0
while i < len(elems):
    e = elems[i]
    print e
    i += 1

This has the same result. Now it’s clear that what’s happening: when you remove the 'a' from elems, the remaining elements slide down the list. The list is now ['b', 'c']. Now 'b' is at index 0, and 'c' is at index 1. Since the next iteration is going to look at index 1 (which is the 'c' element), the 'b' gets skipped entirely.

This rule doesn’t seem to be documented officially anywhere. It’s just a generally-followed rule. I believe the behaviour is, well, unspecified behaviour. Here is a discussion of the issue.

Also note that the same rule applies to the set and dict types (and similar), only those types will actually raise a RuntimeError. (If you know how hash tables work, it should be clear why — there’s no safe way to let that slide).

This rule should be applied in all languages, not just Python. While other languages have different rules, the basic principle above still applies — you’ll just get shot in different ways. (It would even apply to C, if you wrote your own code to insert or remove elements from an array). Also, this sort of code is invariably confusing.

So how can you work around it?

  • You could translate your code into the while-loop form, as above. Then you have explicit access to the variable i, and you can hand-code a fix. (For example, I could choose not to increment i when I am removing an element).
  • You could compute the set of modifications to make to the list during iteration, then apply them afterwards.
  • You could construct a new list during iteration rather than mutating the existing one. (For example, rather than removing all the elements which satisfy a condition, insert into a new list all the elements which don’t).
  • A solution recommended in the discussion linked above is to duplicate the input list and iterate over the copy, mutating the original.

There is always a way around. You just have to be creative!


On Python’s Whitespace

In Language design,Python on January 13, 2009 by Matt Giuca Tagged: , ,

It seems like a certainty that any given Slashdot article involving Python will feature at least a handful of top-level comments where people complain about Python’s indentation/whitespace rule, regardless of the topic of the news item. There’s probably a race to be the first to complain about it in any given post.

I usually bite. This time, I decided to bite hard, and I’ve decided to re-publish my response here because I think it was rather blog-worthy.

AuMatar says:

On the other hand, I’ve spent at least a full work week of my life fixing problems due to whitespace. Guido made a major f*** up there- by removing braces but not strictly defining whitespace, he’s created a language where it’s possible to have two identical looking pieces of code do very different things. If he had said that it must be indented by exactly 1 tab or exactly 4 spaces or whatever other measure and everything else would throw a syntax error, it would have been fine. As it is I’d say about 15-20% of the time I spent doing Python was spent fixing these kinds of bugs.

To which I replied:

Guido made a major f*** up there- by removing braces but not strictly defining whitespace

Stop. First, the whitespace rule in Python *is* strictly defined.

The formal, exact, unambiguous specification of how Python interprets whitespace is in the official language reference – Lexical analysis [].

It’s pretty wordy, but I’ve studied it and it’s quite precise. The relevant section is here:

“Firstly, tabs are replaced (from left to right) by one to eight spaces such that the total number of characters up to and including the replacement is a multiple of eight”

This is exactly the same as the default behaviour of Unix `expand`.

[Guido has] created a language where it’s possible to have two identical looking pieces of code do very different things.

It depends what you mean by “looking”. To you, perhaps 1 tab looks the same as 4 spaces. To me, maybe it looks the same as 2 spaces. To Jeff, maybe it looks like a red dot in his specially-configured editor. To Python, it happens to look the same as 8 spaces.

DO NOT MIX TABS AND SPACES. Then, I guarantee you that any two pieces of code which look the same to you (whether they use tabs or spaces) will also look the same to Python. (You don’t have to enforce this across a whole file, just on a per-block basis, but it’s best if your whole project has an agreed indentation standard).

If he had said that it must be indented by exactly 1 tab or exactly 4 spaces or whatever other measure and everything else would throw a syntax error.

That’s silly. Then you’d be at Guido’s whim; you’d have to indent the way he chose. This way, you can choose any indentation you like. Tabs, 2 spaces, 4 spaces, 3 tabs if you like. As long as you are internally-consistent, Python will be happy.

My second point to you: If you are pasting code from somewhere into your code, and you do not fix up indentation so it matches the surrounding code, you are worse than Hitler. Or at least very lazy. I don’t care if you are using Python or C or Brainf***.

If you carelessly paste 1-tab-indented code into a surrounding block which is 4-tab-indented, and don’t fix it up, then how do you think I will feel when I open it in my editor configured to expand tabs to 2 spaces instead. It will be totally unreadable — and this is why we indent in the first place (in any language, that is).

Python forces you to tidy this up, and that can only be a good thing. If your code is confusing Python, it’s probably confusing a bunch of other readers as well.


Py3K: Solving the “outer scope” problem

In Python on June 27, 2008 by Matt Giuca

I recently built the beta of Python 3000 – the upcoming total revamp of Python (due to be released in September – 992 years before they promised!) Because Py3K is unashamedly “backwards incompatible”, they are finally fixing all the major language flaws and making things “the way they should be!” (Note there will be a somewhat automated conversion process from Python 2 to 3 code).

And I love it! Everything is fixed the way I hoped. Hence this is the first in the “Py3K rox my sox” series of blog posts. You can see a summary of new features here.

OK, so one of the major problems I’ve complained about (and heard) in Python is the so-called “outer scope” problem. This is a very definite limitation of what you can do in Python. Read on!

How globals really work

First a bit of background you may not know. This applies to all versions of Python, not just 3.0.

In Python if you don’t declare a variable, Python figures out whether you’re referring to a local or global based on whether you write to it. For example:

x = 4
def f():
    return x

Here, Python figures out that the x you refer to is actually the global x, and returns 4. It figures this out because the function never writes to x, anywhere. Not just because it hasn’t written to x yet, but because it has no statement which assigns to x. (It figures this out statically, not at runtime). So, for example:

x = 4
def f():
    if True:
        return x
        x = 2

This would be a neat quiz question actually: What does f() evaluate to?

Answer: UnboundLocalError: local variable ‘x’ referenced before assignment.

The mere fact that x is assigned somewhere in the function (even somewhere which will never be executed) causes Python to treat it as a local, and hence it is undefined when you go to return it.

The correct solution is to declare it “global” explicitly, which is the only way to make a function which writes to a global.

x = 4
def f():
    global x
    if True:
        return x
        x = 2

This works well in practice, because you can define constants like MAX_FOO and use them all over the place without declaring them global, but you need to be explicit if you want to update a global (which is usually a good idea because it’s dangerous – see JavaScript for a counter-example).

The “outer scope” problem

On to the “outer scope” problem. Basically, Python lets you write nested functions, and the nested functions have access to the local variables of their containing code. For example:

def outer():
    x = 9
    def inner_read():
        return x
    return inner_read()

If you call outer(), it will return 9. The variable x is local to the outer function. But the inner function can read it, and return it.

The problem comes when you want to write to a non-local variable, like this:

def outer():
    x = 9
    def inner_read():
        return x
    def inner_write():
        x = 3
    return inner_read()

As with global variables, Python can find outer scope variables if you only read them (as inner_read does), but if you write to them anywhere in the function, it assumes you are making a new local variable (as inner_write does). Hence inner_write creates a new local x, and assigns it 3, and the function outer returns 9. I would like for inner_write to update the existing x, and hence have outer return 3.

The solution is pretty simple: Have a keyword like global, but rather than going all the way to the top scope, it just tells Python to look for the innermost scope with a bound variable of that name.

Python 3.0 introduces exactly that: the nonlocal keyword. Let’s give it a try!

def outer():
    x = 9
    def inner_read():
        return x
    def inner_write():
        nonlocal x
        x = 3
    return inner_read()

Woot! Python 3.0 compiles this code and the outer function returns 3.

The funny thing is, this problem seems to be specific to Python. In most static languages, all variables are declared. In Haskell, all variables are read-only. In Ruby, you refer to global variables by prefixing them with a $dollar. In JavaScript, it’s the inverse of Python: you declare all local variables and they default to global (which is a hideous idea – if you forget to declare a variable you implicitly start sharing where you didn’t expect to be sharing). Of course there are probably other languages with this problem but Python is the only one I’ve ever seen.