Archive for the ‘Web development’ Category


How do you escape a complete URI?

In JavaScript,Python,Unicode,URI,Web development on February 12, 2012 by Matt Giuca

This question comes up quite a lot: “I have a URI. How do I percent-encode it?” In this post, I want to explore this question fully, because it is a hard question, and one that, on the face of it, is nonsense. The short answer to this question is: “You can’t. If you haven’t percent-encoded it yet, you’re too late.” But there are some specific reasons why you might want to do this, so read on.

Let’s get at the meaning of this question. I’ll use the term URI (Uniform Resource Identifier), but I’m mostly talking about URLs (a URI is just a bit more general). A URI is a complete identifier of a resource, such as:

A URI component is any of the individual atomic pieces of the URI, such as “”, “admin”, “login”, “name”, “Helen”, “gender” or “f”. The components are the parts that the URI syntax has no interest in parsing; they represent plain text strings. Now the problem comes when we encounter a URI such as this: Ødegård&gender=f

This isn’t a legal URI because the last query argument “Helen Ødegård” is not percent-encoded — the space (U+0020, meaning that this is the Unicode character with hexadecimal value 20), as well as the non-ASCII characters ‘Ø’ (U+00D8) and ‘å’ (U+00E5) are forbidden in any URI. So the answer to “can’t we fix this?” is “yes” — we can retroactively percent-encode the URI so it appears like this:

A digression: note that the Unicode characters were first encoded to bytes with UTF-8: ‘Ø’ (U+00D8) encodes to the byte sequence hex C3 98, which then percent-encodes to %C3%98. This is not actually part of the standard: none of the URI standards (the most recent being RFC 3986) specify how a non-ASCII character is to be converted into bytes. I could also have encoded them using Latin-1: “Helen%20%D8deg%E5rd,” but then I couldn’t support non-European scripts. This is a mess, but it isn’t the subject of this article, and the world mostly gets along fine by using UTF-8, which I’ll assume we’re using for the rest of this article.

Okay, so that’s solved, but will it work in all cases? How about this URI:

Clearly, a human looking at this can tell that the value of the “redirect” argument is “”, which means that the “#” (U+0023) needs to be percent-encoded as “%23”:

But how did we know to encode the “#”? What if whoever typed this URI genuinely meant for there to be a query of “redirect=” and a fragment of “funny&name=Helen&gender=f”. It is wrong for us to meddle with the given URI, assuming that the “#” was intended to be a literal character and not a delimiter. The answer to “can we fix it?” is “no“. Fixing the above URI would only introduce bugs. The answer is “if you wanted that ‘#’ to be interpreted literally, you should have encoded it before you stuck it in the URI.”

The idea that you can:

  1. Take a bunch of URI components (as bare strings),
  2. Concatenate them together using URI delimiters (such as “?”, “&” and “#”),
  3. Percent-encode the URI.

is nonsense, because once you have done step #2, you cannot possibly know (in general) which characters were part of the original URI components, and which are delimiters. Instead, error-free software must:

  1. Take a bunch of URI components (as bare strings),
  2. Percent-encode each individual URI component,
  3. Concatenate them together using URI delimiters (such as “?”, “&” and “#”).

This is why I previously recommended never using JavaScript’s encodeURI function, and instead to use encodeURIComponent. The encodeURI function “assumes that the URI is a complete URI” — it is designed to perform step #3 in the bad algorithm above, which by definition, is meaningless. The encodeURI function will not encode the “#” character, because it might be a delimiter, so it would fail to interpret the above example in its intended meaning.

The encodeURIComponent function, on the other hand, is designed to be called on the individual components of the URI before they are concatenated together — step #2 of the correct algorithm above. Calling that function on just the component “” would produce:

which is a bit of overkill (the “:” and “/” characters do not strictly need to be encoded in a query parameter), but perfectly valid — when the data reaches the other end it will be decoded back into the original string.

So having said all of that, is there any legitimate need to break the rule and percent-encode a complete URI?

URI cleaning

Well, yes there is. (I have been bullish in the past that there isn’t, such as in my answer to this question on Stack Overflow, so this post is me reconsidering that position a bit.) It happens all the time: in your browser’s address bar. If you type this URL into the address bar: Ødegård&gender=f

it is not typically an error. Most browsers will automatically “clean up” the URL and send an HTTP request to the server with the line:

GET /admin/login?name=Helen%20%C3%98deg%C3%A5rd&gender=f HTTP/1.1

(Unfortunately, IANA will redirect this immediately, but if you inspect the packets or try it on a server you control, then check the logs, you will see this is true.) Most browsers don’t show you that they’re cleaning up the URIs — they attempt to display them as nicely as possible (in fact, if you type in the escaped version, Firefox will automatically convert it so you can see the space and Unicode characters in the address bar).

Does this mean that we can relax and encode our URIs after composing them? No. This is an application of Postel’s Law (“Be liberal in what you accept, and conservative in what you send.”) The browser’s address bar, being a human interface mechanism, is helpfully attempting to take an invalid URI and make it valid, with no guarantee of success. It wouldn’t help on my second example (“redirect=”). I think this is a great idea, because it lets users type spaces and Unicode characters into the address bar, and it isn’t necessarily a bad idea for other software to do it too, particularly where user interfaces are concerned. As long as the software is not relying on it.

In other words, software should not use this technique to construct URIs internally. It should only ever use this technique to attempt to “clean up” URIs that have been supplied from an external source.

So that is the point of JavaScript’s encodeURI function. I don’t like to call this “encoding” because that implies it is taking something unencoded and converting it into an encoded form. I prefer to call this “URI cleaning”. That name is suggestive of the actual process: taking a complete URI and cleaning it up a bit.

Unfortunately (as pointed out by Tim Cuthbertson), encodeURI is not quite good for this purpose — it encodes the ‘%’ character, meaning it will double-escape any URI that already has percent-escaped content. More on that later.

We can formalise this process by describing a new type of object called a “super URI.” A super URI is a sequence of Unicode characters with the following properties:

  1. Any character that is in any way valid in a URI is interpreted as normal for a URI,
  2. Any other character is interpreted as a URI would interpret the sequence of characters resulting from percent-encoding the UTF-8 encoding of the character (or some other character encoding scheme).

Now it becomes clear what we are doing: URI cleaning is simply the process of transforming a super URI into a normal URI. In this light, rather than saying that the string: Ødegård&gender=f

is “some kind of malformed URI,” we can say it is a super URI, which is equivalent to the normal URI:

Note that super URIs have nothing to do with not percent-encoding of delimiter characters — delimiters such as “#” must still be percent-escaped in the super URI. They are only about not percent-encoding invalid characters. We can consider super URIs to be human-readable syntax, while proper URIs are required for data transmission. This means that we can also take a proper URI and convert it into a more human-readable super URI for display purposes (as web browsers do). That is the purpose of JavaScript’s decodeURI function. Note that, again, I don’t consider this to be “decoding,” rather, “pretty printing.” It doesn’t promise not to show you percent-encoded characters. It only decodes characters that are illegal in normal URIs.

It is probably a good idea for most applications that want to “pretty print” a URI to not decode control characters (U+0000 — U+001F and U+007F), to avoid printing garbage and newlines. Note that decodeURI does decode these characters, so it is probably unwise to use it for display purposes without some post-processing.

Update: My “super URI” concept is similar to the formally specified IRI (Internationalized Resource Identifier) — basically, a URI that can have non-ASCII characters. However, my “super URIs” also allow other ASCII characters that are illegal in URLs.

Which characters?

Okay, so exactly which characters should be escaped for this URI cleaning operation? I thought I’d take the opportunity to break down the different sets of characters described by the URI specification. I will address two versions of the specification: RFC 2396 (published in 1998) and RFC 3986 (published in 2005). 2396 is obsoleted by 3986, but since a lot of encoding functions (including JavaScript’s) were invented before 2005, it gives us a good historical explanation for their behaviour.

RFC 2396

This specification defines two sets of characters: reserved and unreserved.

  • The reserved characters are: $&+,/:;=?@
  • The unreserved characters are: ALPHA and NUM and !'()*-._~

Where ALPHA and NUM are the ASCII alphabetic and numeric characters, respectively. (They do not include non-ASCII characters.)

There is a semantic difference between reserved and unreserved characters. Reserved characters may have a syntactic meaning in the URI syntax, and so if one of them is to appear as a literal character in a URI component, it may need to be escaped. (This will depend upon context — a literal ‘?’ in a path component will need to be escaped, whereas a ‘?’ in a query does not need to be escaped.) Unreserved characters do not have a syntactic meaning in the URI syntax, and never need to be escaped. A corollary to this is that the escaping or unescaping an unreserved character does not change its meaning (“Z” means the same as “%5A”; “~” means the same as “%7E”), but escaping or unescaping a reserved character might change its meaning (“?” may have a different meaning to “%3F”).

The URI component encoding process should percent-encode all characters that are not unreserved. It is safe to escape unreserved characters as well, but not necessary and generally not preferable.

Together, these two sets comprise the valid URI characters, along with two other characters: ‘%’, used for encoding, and ‘#’, used to delimit the fragment (the ‘#’ and fragment were not considered to be part of the URI). I would suggest that both ‘%’ and ‘#’ be treated as reserved characters. All other characters are illegal. The complete set of illegal characters, under this specification, follows:

  • The ASCII control characters (U+0000 — U+001F and U+007F)
  • The space character
  • The characters: “<>[\]^`{|}
  • Non-ASCII characters (U+0080 — U+10FFFD)

The URI cleaning process should percent-encode precisely this set of characters: no more and no less.

RFC 3986

The updated URI specification from 2005 makes a number of changes, both to the way characters are grouped, and to the sets themselves. The reserved and unreserved sets are now as follows:

  • The reserved characters are: !#$&'()*+,/:;=?@[]
  • The unreserved characters are: ALPHA and NUM and -._~

This version features ‘#’ as a reserved character, because fragments are now considered part of the URI proper. There are two more important additions to the restricted set. Firstly, the characters “!'()*” have been moved from unreserved to reserved, because they are “typically unsafe to decode.” This means that, while these characters are still technically legal in a URI, their encoded form may be interpreted differently to their bare form, so encoding a URI component should encode these characters. Note that this is different than banning them from URIs altogether (for example, a “javascript:” URI is allowed to contain bare parentheses, and that scheme simply chooses not to distinguish between “(” and “%28”). Secondly, the characters ‘[‘ and ‘]’ have been moved from illegal to reserved. As of 2005, URIs are allowed to contain square brackets. This unfortunate change was made to allow IPv6 addresses in the host part of a URI. However, note that they are only allowed in the host, and not anywhere else in the URI.

The reserved characters were also split into two sets, gen-delims and sub-delims:

  • The gen-delims are: #/:?@[]
  • The sub-delims are: !$&'()*+,;=

The sub-delims are allowed to appear anywhere in a URI (although, as reserved characters, their meaning may be interpreted differently if they are unescaped). The gen-delims are the important top-level syntactic markers used to delimit the fields of the URI. The gen-delims are assigned meaning by the URI syntax, while the sub-delims are assigned meaning by the scheme. This means that, depending on the scheme, sub-delims may be considered unreserved. For example, a program that encodes a JavaScript program into a “javascript:” URI does not need to encode the sub-delims, because JavaScript will interpret them the same whether they are encoded or not (such a program would need to encode illegal characters such as space, and gen-delims such as ‘?’, but not sub-delims). The gen-delims may also be considered unreserved in certain contexts — for example, in the query part of a URI, the ‘?’ is allowed to appear bare and will generally mean the same thing as “%3F”. However, it is not guaranteed to compare equal: under the Percent-Encoding Normalization rule, encoded and bare versions of unreserved characters must be considered equivalent, but this is not the case for reserved characters.

Taking the square brackets out of the illegal set leaves us with the following illegal characters:

  • The ASCII control characters (U+0000 — U+001F and U+007F)
  • The space character
  • The characters: “<>\^`{|}
  • Non-ASCII characters (U+0080 — U+10FFFD)

A modern URI cleaning function must encode only the above characters. This means that any URI cleaning function written before 2005 (hint: encodeURI) will encode square brackets! That’s bad, because it means that a URI with an IPv6 address:

http://[2001:db8:85a3:8d3:1319:8a2e:370:7348]/admin/login?name=Helen Ødegård&gender=f

would be cleaned as:


which refers to the domain name “[2001:db8:85a3:8d3:1319:8a2e:370:7348]” (not the IPv6 address). Mozilla’s reference on encodeURI contains a work-around that ensures that square brackets are not encoded. (Note that this still double-escapes ‘%’ characters, so it isn’t good for URI cleaning.)

So what exactly should I do?

If you are building a URI programmatically, you must encode each component individually before composing them.

  • Escape the following characters: space and !”#$%&'()*+,/:;<=>?@[\]^`{|} and U+0000 — U+001F and U+007F and greater.
  • Do not escape ASCII alphanumeric characters or -._~ (although it doesn’t matter if you do).
  • If you have specific knowledge about how the component will be used, you can relax the encoding of certain characters (for example, in a query, you may leave ‘?’ bare; in a “javascript:” URI, you may leave all sub-delims bare). Bear in mind that this could impact the equivalence of URIs.

If you are parsing a URI component, you should unescape any percent-encoded sequence (this is safe, as ‘%’ characters are not allowed to appear bare in a URI).

If you are “cleaning up” a URI that someone has given you:

  • Escape the following characters: space and “<>\^`{|} and U+0000 — U+001F and U+007F and greater.
  • You may (but shouldn’t) escape ASCII alphanumeric characters or -._~ (if you really want to; it will do no harm).
  • You must not escape the following characters: !#$%&'()*+,/:;=?@[]
  • For an advanced URI cleaning, you may also fix any other syntax errors in an appropriate way (for example, a ‘[‘ in the path segment may be encoded, as may a ‘%’ in an invalid percent sequence).
  • An advanced URI cleaner may be able to escape some reserved characters in certain contexts. Bear in mind that this could impact the equivalence of URIs.

If you are “pretty printing” a URI and want to display escaped characters as bare, where possible:

  • Unescape the following characters: space and “-.<>\^_`{|}~ and ASCII alphanumeric characters and U+0080 and greater.
  • It is probably not wise to unescape U+0000 — U+001F and U+007F, as they are control characters that could cause display problems (and there may be other Unicode characters with similar problems.)
  • You must not unescape the following characters: !#$%&'()*+,/:;=?@[]
  • An advanced URI printer may be able to unescape some reserved characters in certain contexts. Bear in mind that this could impact the equivalence of URIs.

These four activities roughly correspond to JavaScript’s encodeURIComponent, decodeURIComponent, encodeURI and decodeURI functions, respectively. In the next section, we look at how they differ.

Some implementations


As I stated earlier, never use escape. First, it is not properly specified. In Firefox and Chrome, it encodes all characters other than the following: *+-./@_. This makes it unsuitable for URI construction and cleaning. It encodes the unreserved character ‘~’ (which is harmless, but unnecessary), and it leaves the reserved characters ‘*’, ‘+’, ‘/’ and ‘@’ bare, which can be problematic. Worse, it encodes Latin-1 characters with Latin-1 (instead of UTF-8) — not technically a violation of the spec, but likely to be misinterpreted, and even worse, it encodes characters above U+00FF with the malformed syntax “%uxxxx”. Avoid.

JavaScript’s “fixed” URI encoding functions behave according to RFC 2396, and assuming Unicode characters are to be encoded with UTF-8. This means that they are lacking the 2005 changes:

  • encodeURIComponent does not escape the previously-unreserved characters ‘!’, “‘”, “(“, “)” and “*”. Mozilla’s reference includes a work-around for this.
  • decodeURIComponent still works fine.
  • encodeURI erroneously escapes the previously-illegal characters ‘[‘ and ‘]’. Mozilla’s reference includes a work-around for this.
  • decodeURI erroneously unescapes ‘[‘ and ‘]’ (although there doesn’t seem to be a practical case where this is a problem).

Edit: Unfortunately, encodeURI and decodeURI have a single, critical flaw: they escape and unescape (respectively) percent signs (‘%’), which means they can’t be used to clean a URI. (Thanks to Tim Cuthbertson for pointing this out.) For example, assume we wanted to clean the URI: Ødegård&gender=f

This URI has the ‘#’ escaped already, because no URI cleaner can turn a ‘#’ into a “%23”, but it doesn’t have the space or Unicode characters escaped. Passing this to encodeURI produces:

Note that the “%23” has been double-escaped so it reads “%2523” — completely wrong! We can fix this by extending Mozilla’s work-around to also correct against double-escaped percent characters:

function fixedEncodeURI(str) {
    return encodeURI(str).replace(/%25/g, '%').replace(/%5[Bb]/g, '[').replace(/%5[Dd]/g, ']');

Note that decodeURI is similarly broken. The fixed version follows:

function fixedDecodeURI(str) {
    return decodeURI(str.replace(/%25/g, '%2525').replace(/%5[Bb]/g, '%255B').replace(/%5[Dd]/g, '%255D'));

Edit: Fixed fixedEncodeURI and fixedDecodeURI so they work on lowercase escape codes. (Thanks to Tim Cuthbertson for pointing this out.)


Python 2’s urllib.quote and urllib.unquote functions perform URI component encoding and decoding on byte strings (non-Unicode).

  • urllib.quote works as I specify above, except that it does escape ‘~’, and does not escape ‘/’. This can be overridden by supplying safe=’~’.
  • urllib.unquote works as expected, returning a byte string.

Note that these do not work properly at all on Unicode strings — you should first encode the string using UTF-8 before passing it to urllib.quote.

In Python 3, the quote and unquote functions have been moved into the urllib.parse module, and upgraded to work on Unicode strings (by me — yay!). By default, these will encode and decode strings as UTF-8, but this can be changed with the encoding and errors parameters (see urllib.parse.quote and urllib.parse.unquote).

I don’t know of any Python built-in functions for doing URI cleaning, but urllib.quote can easily be used for this purpose by passing safe=”!#$%&'()*+,/:;=?@[]~” (the set of reserved characters, as well as ‘%’ and ‘~’; note that alphanumeric characters, and ‘-‘, ‘.’ and ‘_’ are always safe in Python).

Mozilla Firefox

Firefox 10’s URL bar performs URL cleaning, allowing the user to type in URLs with illegal characters, and automatically converting them to correct URLs. It escapes the following characters:

  • space, “‘<>` and U+0000 — U+0001F and U+007F and greater. (Note that this includes the double and single quote.)
  • Note that the control characters for NUL, tab, newline and carriage return don’t actually transmit.

I would say this is erroneous: on a minor note, it should not be escaping the single quote, as that is a reserved character. It also fails to escape the following illegal characters: \^{|}, sending them to the server bare.

Firefox also “prettifies” any URI, decoding most of the percent-escape sequences for the characters that it knows how to encode.

Google Chrome

Chrome 16’s URL bar also performs URL cleaning. It is rather similar to Firefox, but encoding the following characters:

  • space, “<> and U+0000 — U+0001F and U+007F and greater. (Note that this includes only the double quote.)

So Chrome also fails to escape the illegal characters \^`{|} (including the backtick, which Firefox escapes correctly), but unlike Firefox, it does not erroneously escape the single quote.


You should XML-escape your URIs

In Web development on June 2, 2011 by Matt Giuca

A quick post about a sneaky little edge case I thought of. I don’t know if this is common practice, but when I encode a URI, I usually consider it “safe for XML”. It isn’t.

In other words, once I have taken a string like “dog’s bed & breakfast” and URI-encoded it to “dog%27s%20bed%20%26%20breakfast”, I would consider it safe to slap into an XML or HTML attribute value, such as <a href=”dog%27s%20bed%20%26%20breakfast”>…</a>. But it isn’t.

There’s one little problem: URIs allow (and frequently use) the ampersand character (“&”) to separate query string arguments. XML attribute values specifically disallow this character. It isn’t a problem with the above string, because the ampersand was part of the string, and so it was escaped as “%26”, which is a perfectly legal XML attribute value. But any URI with multiple query string parameters that you put into an attribute value is technically invalid XML. For instance, if you had the query parameters {“name”: “henry”, “age”: “27”}, that encodes to the query string “name=henry&age=27”. The XML element <a href=”″>…</a&gt; is invalid, because it contains a bare ampersand in the attribute value. Browsers, however, don’t seem to mind, and will process the above link properly.

The problem happens on the edge cases. Consider the query parameters {“name”: “henry”, “lt;”: “27”}. They encode to the query string “name=henry&lt;=27″, and if you put that unquoted into XML, you get <a href=”;=27”>…</a&gt;, which is valid, and completely different to what you intended (it parses as query parameters {“name”: “henry<=27”}). Even if your URI encoder escapes the “;” (which it should, as “;” is a reserved character), you’ll still get <a href=”″>…</a&gt;, once again invalid, but both Firefox and Chrome still manage to parse &lt as “<“.

So even if you have URI-encoded (which you should do), you still need to XML-encode before putting it into the attribute value: <a href=”;lt;=27”>…</a&gt;. A minor point is that if you are using single-quoted XML attribute values, you also need to make sure that if your URI has single quotes (which it is technically allowed to have), that they are being XML-encoded as well.

Also see my full post about URI encoding, URI Encoding Done Right.


This Week in Google’s discussion of H.264

In Web development on January 27, 2011 by Matt Giuca Tagged: , , , , , ,

I’ve been an avid listener of This Week in Google for the past few months now. Don’t get me wrong — I really love Leo, Jeff and Gina. They are usually very insightful and nearly always defend Google when they need defending (e.g., in the Wi-Fi spying debacle), but they’ll go against Google when they do the wrong thing (e.g., the Verizon net neutrality deal). Being three web entrepreneurs, these guys seem to really “get” the concept of the open web. So I couldn’t understand why, in episode #77, the three of them unanimously and almost without any debate or counter-arguments, bagged out the recent Google decision to drop H.264 from Chrome (which I blogged about previously). They brought up all the same arguments: Google are just doing this to get back at Apple, Google are trying to support Flash, Google made an Evil move. I tweeted “Surprised & dismayed to hear a unanimously anti-Google stance re #h264 on #TWiG. Was expecting a less short-sighted and Apple-centric view.”

So I was keen to catch their follow-up discussion this week, and it wasn’t what I expected. They came back a little sheepish (from having been too hard on Google), a little apologetic, and a little more understanding. Unfortunately, though, they seem to have misunderstood some crucial technical details, rendering their new defense of Google (“it was a bad PR move”) almost entirely invalid. I took the time to transcribe the entire conversation on this topic, from TWiG #78 (from 26 minutes in — underlines mine):

Leo: Now it’s been a week. We’ve absorbed the dropping of H.264. We talked a lot about it, if you didn’t hear last week’s TWiG, Kevin Marks was on. He was great, he explained it all. We’ve had time to think about it. Do either of you feel a little bit better about Google dropping H.264, cause I’ll say up front, I kinda do. I kind of, maybe I’m a true believer, but I kind of think that they took a pretty big hit in the name of openness. Do you feel better, Gina?
Gina: They definitely took a pretty big hit. We were pretty hard on them. I do feel a little better about it, although I have to say I’m looking at the Engadget article now. Did they issue a public defense?
Leo: Well it wasn’t very …
Gina: It wasn’t very public, was it?
Leo: It was public, but I don’t think it was that compelling. They basically just re-iterated, “no, we did this for openness.”
Gina: And the core of this argument is “we’re not evil, like we’re doing this thing that could be interpreted as evil or not, just trust us, our motives are not evil.”
Leo: I don’t think that’s a good argument. (laughs) “Trust us, we’re not evil.  Trust us! Really! Honest!”
Gina: Well, you know, we care about the future of the web, we want to make it open.
Leo: I mean, it carried water for me because I believe fundamentally that’s true. I know far too many Google people, which is always a mistake, I admit it. In fact one of the things I’ve always done in my career is distance myself from the companies I cover, cause once you start knowing them and liking them it’s very hard to believe ill of them.
Jeff: Amen.
Leo: So maybe I know them too well.
Gina: I admit to that too.
Leo: But all the engineers I know at Google really are highly committed to open. So even if corporate says “No” for business reasons, the troops at Google aren’t going to go along for the ride. And I do believe you can say, look, we understand WebM may be slightly encumbered, may be slightly undesirable, nobody’s using WebM, no browser supports it, but we want to put a flag in the sand. We want to say “open is better.” There’s gonna be some pain. We’re gonna suffer some pain. I understood their argument. Don’t confuse Flash (cause we said that this de facto helps flash) — but they said OK, yes, maybe, but don’t confuse Flash with H.264. These are different things. What we are talking about is the video, and I did say this last week, it’s the thing. What is HTML5 gonna do when it sees . And you can all add plugins for every codec you want, that’s fine, but we are gonna put a stake in the sand and say, “What it should do is use WebM. It should use an unencumbered open standard.”
Jeff: Isn’t that a different way to say it, Leo, is that that’s a default, rather than support. They said “We’re going to stop supporting H.264. You can ship the codec with it, but we’re gonna default to WebM.”
Leo: If you use Windows, and Chrome, it will support H.264, cause Windows builds in the codec.
Gina: Well the Chromium project never supported H.264.
Leo: Nor did Mozilla.
Gina: But Chromium and Chrome are different, right. So Chrome is the Google product that they’re gonna ship, so they used to support H.264. Chromium didn’t, but now they’re not. I think honestly this was just a bad PR move.
Jeff: I think that’s what I’m saying, is you can say it differently, and say “we’re gonna default to open now“, and everyone would have hugged them.
Leo: I think the problem is it’s too technical of a … that’s kind of what they’re saying but it’s technical. They said “”, but I don’t think that most even technical people understand the difference between a plugin and native support. There has never been in HTML a definition of what happens when you see video. In every case, up until now, when you see video you have to have a plugin. HTML does not support video. HTML5 will support a tag. What happens when that tag happens? What does the browser do? Does it go out and and launch a plugin? Well it could, in fact it will have to in Safari, and that’s why Google says they’re gonna make WebM plugins for Safari, we’re gonna make a WebM plugin for IE9, but that’s interim. The real question is if it’s an HTML5 browser, and it runs across a tag, has no plugins, has no Flash, there’s nothing installed, what’s it gonna do when it sees ? And we believe that all browsers should default to a baseline of WebM. If you want to add more, that’s fine.
Jeff: That’s fine. The problem here is the precedent set by Apple, with Flash. It sounded like that.
Leo: You’re right. You know what, you’re exactly right. You hit the nail on the head. In the context of what Apple did, it sounded like they were doing the same thing.
Jeff: Exactly. That’s the PR problem they had, and so it was so easy to say “you can install whatever you want, folks, but we are now going to default to open.” Would have been fine.
Leo: Perfect. You’re right.
Jeff: “We’re gonna not support H.264,” that sounds like such a Steve Jobs thing (love you, Steve, hope you’re better).
Leo: They should hire us to do their PR. (laughs)
Gina: Yeah, I wonder. I don’t think they expected the backlash they got, like some project manager posted this on the Chromium blog, I don’t think they maybe expected the response that they got.
Leo: I think that’s the case, and this is one of the reasons I like Google. They’re not that polished actually.

So firstly, some factual issues. “No browser supports it” is bullshit — WebM is fully supported in recent releases of Chrome/Chromium and Opera, and in recent betas of Firefox. As for the claim that Chrome on Windows will use H.264 regardless, I believe this is false (and Leo won’t like it). It is my understanding (supported by this blog post) that Chrome, like Firefox and Opera, but unlike IE and Safari, uses only its own bundled codecs (Theora and WebM, and for the time being H.264), not the system ones. This means that dropping H.264 support from Chrome would mean it couldn’t play H.264 video, even if Windows had the codec installed. I think that’s the way it should be, because otherwise web admins will upload videos in a thousand random codecs of whatever they have installed on their system, and the web will fragment. As argued in the same post, it is better to have a small set of standard codecs in the browser.

Now I really don’t get what Gina meant by “this thing that could be interpreted as evil or not, just trust us, our motives are not evil.” As I argued previously, I don’t like having to trust companies not to be evil, and in this case, I don’t have to. Google has made a perpetual royalty-free license on WebM so they cannot turn around and stab us in the back. Google’s intentions here are out in the open. They don’t want browser manufacturers (which includes themselves) and video content hosts (which includes themselves) to have to pay a license to serve and play video. This isn’t an altruistic position. It is pragmatic and in their own interests (and happily, ours too). When a company does something altruistic, you need to be suspicious and look for the hidden agenda. There is no hidden agenda here; they are looking out for their own bottom line, and the health of the web (which they profit from).

And then we get bogged down in this very murky discussion of “defaults” versus “support”. Re-reading the transcript, Leo seems to be suggesting that which codec to use is up to the browser — somewhat true, but only if the website supports both codecs. They seem relieved to discover that this news means not that Chrome won’t support H.264, but only that it will prioritise WebM over H.264 if given the choice. Is that what they’re saying? It isn’t correct. It is true that Chrome won’t support H.264. If Chrome finds a website that only supports HTML5/H.264, it will not play the video. Jeff claims that saying “we won’t support H.264” is an Apple thing to say, whereas saying “we default to WebM” is what they should have said. I disagree — “we won’t support H.264” is the truth, whether you like it or not. “We default to WebM” sounds exactly like the sort of spin Apple would put on this announcement. It seems we’re now in such an Apple-centric world that the truth is considered “bad PR”, whereas a vague and confusing spin which hides the true nature of the underlying technology is something people are happy with.

But it gets murkier. They seem to be confusing “no support” as in “this software does not contain this feature” with “no support” as in “you are expressly, technically and legally, forbidden from adding this feature to your own device.” This is what distinguishes Google from Apple, and that’s what they missed. Because it’s true that Google will no longer be supplying the H.264 feature on Chrome, just as Apple decided not to supply the Flash feature on iOS — in that regard, this is the same thing as Steve Jobs saying “we will not support Flash.” Which, by the way, I have no problem with — Jobs can put any software he wants in his product, and leave out any he doesn’t want. The difference is that not only is Chromium open source, so you can add the feature back if you really want it, but that even the closed-source Chrome supports plugins, so you could add a H.264 plugin. iOS explicitly forbids you from installing Flash.

I feel like it’s been a week, and we’ve had all this arguing, and finally have forgiven Google for all the wrong reasons. Rather than celebrating a major step towards removing the last holdout for the open web, rather than realising as much as you like H.264, it would never have been supported by Firefox anyway (not because Mozilla are pig-headed, but because as an open source company they literally can’t support it, and even if they did, Linux distros would have to take it out again) — and that that is precisely why it is a bad thing, we have instead forgiven Google because it turns out we can work around this decision and use H.264 after all. Way to entirely miss the point.


A response to criticism over the Chrome H.264 debacle

In Web development on January 14, 2011 by Matt Giuca Tagged: , , , , ,

There has been a lot of arguing on the Internet in the past few days over Google’s decision to drop H.264 video support from Chrome. I was really surprised to see the majority of posts (I read) were negative towards Google. I thought the Internet had more sense. I think this move just might save us from twelve years of patent hell so I absolutely applaud Google for doing it.

So here is a rebuttal to pretty much every negative comment on the Chromium blog (linked above), as well as this Ars Technica opinion. If there are any negative arguments I’ve missed, yell in the comments and I’ll either add a rebuttal, or acknowledge it as a fair point. Now on the exact licensing terms of the MPEG-LA, I refer to this document and this additional press release. Firstly, from what I understand it, here are the terms of the license, divided up (simplified) into three use cases:

  • For manufacturers of encoders (that is, programs which create H.264 video), there is a license fee with a maximum of $6.5 million per year.
  • For manufacturers of decoders (that is, programs which view H.264 video, such as web browsers), there is also a license fee with a maximum of $6.5 million per year.
  • For distributors of content (that is, websites which serve H.264 video), there is a license fee with a maximum of $6.5 million per year (with a maximum of $100,000 per video). However, for distributors of content where the end user does not pay, there are no royalties for the life of the patent.

H.264 supporters are quick to jump on that last point, claiming that this makes H.264 free. But it doesn’t, because you still need to pay if you are serving video behind a firewall. It is still illegal to make a web browser without paying the fee (assuming H.264 becomes a mandatory feature of all web browsers), and it is illegal to make software for encoding video without paying the fee. Importantly, these fees still apply even if you are giving away your software for free. This means the end of open source web browsers and video encoders.

Now basically every comment boils down to one of the following complaints:

  • It’s not like Google can’t afford the $6.5 million-per-year fee.
    • True, but that misses the point. Firstly, when you say “Google should continue to pay the license fee,” you are making a business decision for them. If a company has a $6.5M-per-year expense and they want to cut it, they are well within their rights to do so, even if it affects you personally or they could afford it. They don’t have a contract with you to provide this service. Secondly, not all browser manufacturers can afford it. Mozilla can’t (or won’t), and if I decided to write a browser tomorrow, which suddenly became popular, I wouldn’t be able to afford it either. If Google paid the yearly fee, they would be asserting their vast wealth, and saying “look, we’re one of the few companies in the world that can afford to make a popular web browser.” By not paying the fee, and dealing a blow to H.264, they are saying “we want smaller companies and people to be able to make browsers too. More browsers is better for the web.”
  • Google is a hypocrite because they are dropping H.264 in the name of openness, but still support Flash. / H.264 is patented but at least fully open, whereas Flash is closed source. (This makes up about a third of the comments.)
    • Not even Google can change the world over night. Flash is currently so entrenched that you couldn’t possibly drop support for it (unless you are Apple and millions of developers and users will bend to your every whim). Google will probably eventually drop support for Flash, once HTML5 is far enough along. But for now it is simply impractical to drop Flash support, while it is quite practical to drop H.264 support.
    • In terms of which is more open out of H.264 and Flash, both are published standards. H.264 has open source implementations, but is patented, whereas Flash is a closed-source implementation that nobody has fully replicated yet. This is a totally different issue. With Flash, Adobe is protecting their implementation — their own work (as is their right), but they won’t stop competing implementations. With H.264, MPEG-LA is outlawing all possible implementations, even those which they didn’t write.
    • Update: Can I just re-iterate the first point? Google has to pay to put H.264 in the browser. They don’t have to pay to put Flash in the browser. It’s their wallet, not yours. It isn’t hypocritical to use something someone gives you for free (even if it’s “bad”), yet not be willing to pay for something else which is “bad”. (Hey, I should know: I used an iPod which my parents bought me for years, but like hell I would pay for one!)
  • Google are probably being paid by Adobe to hold back adoption of HTML5 video
    • Google have ties with Adobe due to their support of Flash on Android. But supporting Flash natively is just a way to make the browsing experience better. As I said above, Flash is too entrenched to get rid of, whatever your ideals are — for now. I’ll believe accusations of back-room deals when I see them.
  • Google is a hypocrite because YouTube supports H.264.
    • Yes. Your point? Everybody supports H.264 (at least in a Flash wrapper). That’s precisely the problem Google is trying to break away from. YouTube also supports WebM.
  • Stupid move, and it won’t have any impact. Chrome doesn’t have enough market share. / Nobody will bother to encode WebM just for the benefit of Chrome users.
    • True, Chrome only has 10% market share, and that might not be enough to convince web admins to support the format. But Firefox doesn’t support H.264 either (and will soon support WebM). Combined, they have over 30% of the market.
  • This is bad for the open web because sites will just go back to supporting Flash. / This will slow the adoption of HTML5 video.
    • Perhaps a bit, but since Firefox has double the market share of Chrome, and it doesn’t support H.264, this was already a problem — we can’t move to a pure HTML5+H.264 web while Firefox doesn’t support it. It is better to stay with a proprietary old standard while we build towards an open new standard than transition from a proprietary old standard to a proprietary new standard. Once the transition is complete, we’ll be too exhausted to do it again for another ten years. I disagree that transitioning to H.264 is better for the “open web”.
    • I couldn’t believe this extremely narrow-minded comment on the Ars Technica article, under the heading “This hurts the open web”: “even Firefox users would be able to use H.264 video through Microsoft’s plugin for that browser” — how the hell can we call it the “open web” if users of the leading open source browser are forced to use a proprietary plugin which only works on a single proprietary operating system? That’s just the same as Flash, only worse, because it’s for Windows only.
  • This is bad for site admins because now they have to encode their video in two formats.
    • True. But it’s already bad for site admins who have to support both Flash and HTML5+H.264 for Apple devices (though to be fair these both support the same underlying H.264 codec). These same admins are looking forward to a future where they can drop Flash support, but cannot due to browsers which don’t support HTML5. The problem is, H.264 video will never be supported by the open source browsers, so they will always have to support either Flash or an open video codec. This move might help move towards a single, open video codec.
    • Also, since Adobe has announced that Flash will soon support WebM, site admins will be able to provide HTML5+WebM content with a fallback to Flash+WebM for browsers which don’t support WebM directly, leaving only iOS (which supports neither). That would be just as reasonable as HTML5+H.264 with a fallback to Flash+H.264, only Apple can implement WebM if they want to (whereas open browsers cannot implement H.264).
  • This is bad for users because suddenly a whole bunch of sites will stop working.
    • False. Currently, there are no websites which exclusively support HTML5 video; they all fall back to Flash (obviously this won’t break the web, because otherwise Firefox would already be broken). Therefore, now is the time for any willing browser manufacturers to drop support for H.264 without Flash, before it becomes the standard. It is too late to drop support for Flash without it first being replaced by another standard. Dropping H.264 at this early stage will not affect any users. If we wait, it will be too late.
    • By contrast, when Apple dropped support for Flash, that was bad for users because it broke a shitload of existing websites. Apple was so powerful that they managed to get pretty much the whole Internet to switch over to their new patent-encumbered standard, H.264. That was bad for site admins and users.
  • All the other browsers support H.264. If only Google continued to support it, we could finally agree on a standard.
    • False. Firefox has never supported H.264 (without Flash), and never will. An open source product can’t ever support it, so therefore we will never have an open source browser supporting this standard. Bear in mind that Chromium, the open source version of Chrome, has never supported H.264 either, for the same reason. This is part of Google’s motivation: to make Chrome more open source (the H.264 part of Chrome is proprietary, by definition). Hence the announcement: “we are changing Chrome’s HTML5 support to make it consistent with the codecs already supported by the open Chromium project.”
  • H.264 was around in HTML5 first. Why is Google trying to change it now?
    • False. HTML5 video was first introduced with Ogg Theora as the standard format. Due to refusal by certain browser manufacturers to support Theora (largely Apple’s support of H.264 on iOS), the codec was removed from the standard. As it stands now, browser manufacturers are free to implement any codec.
  • This won’t kill open source browsers. You can still distribute source code, just not binaries.
    • False. Or, maybe technically true, but that isn’t how open source works. “Open source” does not refer to programs distributed only in source code. It refers to programs whose source code is available. The vast majority of open source users do not build their software from source. If, for example, Mozilla were to put H.264 support in Firefox, it would become illegal to distribute Ubuntu with a binary of Firefox, and all Ubuntu users would need to compile Firefox from source. Even Google, who currently pays the $6.5M-per-year fee, does not include H.264 support in the open source version of Chromium.
    • Furthermore, even if you could argue in court that you didn’t distribute an implementation of the patent, only the source code, this is not a risk many small open source developers would be willing to take. The cost of implementing a web browser is simply too high under this regime.
    • Edit: Nick Chadwick points out that FFmpeg (an open source video encoder/decoder) supports H.264 for both encoding and decoding. I am openly wondering how they are able to distribute binaries (and the implications for distros such as Ubuntu). Edit: David Coles points out that free distributions such as Debian and Ubuntu have removed H.264 support (and other codecs) for this reason.
  • This is about free-as-in-cost (gratis) not free-as-in-speech (libre). H.264 is open, you just have to pay for it.
    • This is a point the Ars Technica article raised. It treads the subtle line between gratis and libre — the implication being that you shouldn’t make this a moral issue when it is merely a financial issue. The problem is, patents are a free-as-in-speech issue. Gratis is about how much something costs, whereas libre is about what you can do with something once you have it (i.e., your freedoms). Now if MPEG-LA had developed an H.264 encoder and decoder, for example, and were charging for it, that would be a gratis issue. You would have to pay for it, but if someone figured out the protocol, they could make their own without being restricted. Instead, the MPEG-LA has given away the spec for free (gratis). But in doing so, they have told you what you can and cannot do with it, and what you cannot do is make a web browser without paying them. Imposing a financial cost on somebody if they take a certain action (once they already have your property) is limiting their freedoms (libre), not charging for a service (gratis).
  • H.264 is an open standard. The MPEG-LA have promised not to charge any royalties.
    • No, it isn’t. As I outlined above, MPEG-LA will still seek royalties from encoder and browser manufacturers, and site operators distributing behind a paywall.
  • WebM is an inferior codec / WebM isn’t as fast because H.264 is implemented in hardware.
    • Maybe it is inferior, but it’s the best codec we have that isn’t patent-encumbered. Here is a technical analysis (which is way over my head) of the WebM codec, which doesn’t speak well for it. Edit: See the analysis of WebM and its patent risk (in reply to the above link), which basically explains that WebM was specifically designed to be inferior to H.264 to avoid treading on MPEG-LA’s patents. It is sad times indeed.
    • Now of course, H.264 is implemented in very fast hardware on iPhones and many other devices, whereas WebM needs to be decoded in software. But of course, if WebM took off, newer devices could support it in hardware instead, so that argument doesn’t work in the long run.
  • Nobody other than Google supports WebM; it isn’t going anywhere.
    • For what it’s worth, the WebM Project page shows a very large list of supporters, including Mozilla (WebM will be implemented in Firefox 4), Opera (Opera already supports WebM), Adobe (WebM will be implemented in an upcoming version of Flash), FFmpeg, AMD (owner of ATI), NVidia and ARM.
    • Not to mention, Google. Between Chrome, Android and YouTube, that’s a significant chunk of the browser, hardware and content delivery markets.
  • WebM might be patented too.
    • Of course the problem with patents is you never know when you’ve infringed one. Unlike copyright, where if I create something entirely on my own then I own it, with patents I can infringe someone’s patent merely by inventing the same thing they did. Therefore, nobody can say for sure that WebM doesn’t infringe on any patents and MPEG-LA has audaciously suggested they will begin charging for use of WebM — typical behaviour of a patent troll. But so far, nobody has named any specific patents infringed by WebM.
    • Edit: Here is an analysis of WebM and its patent risk.
  • This will put Apple at a disadvantage; if the web moves over to WebM, iPhone won’t be able to play video any more / This is a power play by Google to lock out the iPhone.
    • If WebM does become the standard, Apple can easily implement it. It is open source and patent free, so it’s not like Google is trying to make everyone use a format they control and lock out the competition. (Of course, implementing WebM in hardware isn’t trivial, but I’m sure Apple have the resources to do it if they were pressed to.)

Edit: I just found this post by an Oracle developer which provides some similar rebuttals to mine.


Browser URI encoding: The best we can do

In Web development on July 8, 2008 by Matt Giuca Tagged: , ,

This is a follow-up post to last week’s post Do browsers encode URLs correctly?, in which I asked whether Firefox encodes form data correctly when submitted with the application/x-www-form-urlencoded content type.

I’ve done a bit more research into URI encoding since then. Firstly, I think I was a bit over-zealous in stating that “UTF-8 is the only way to go”. Clearly, URI encoding is an octet encoding, not a character encoding, so it’s still perfectly valid to have a non-UTF-8 encoded URI.

As for what browsers should do for form submission, I think I was reading the wrong place. I should have been reading the HTML specification – after all, that is what governs how to submit HTML forms, not the generic URI syntax. It seems once again, it’s very vague because we’re dealing with technology originally designed only for ASCII. I still cannot find anyone willing to say “you must use UTF-8” – so basically this means UAs are free to do whatever they want. That is incredibly annoying for everybody involved.

For the record, the two important parts of the HTML 4.01 specification are as follows. The appendix section B.2.1 (informatively) suggests that while URLs in documents should not contain non-ASCII characters, UAs encountering non-ASCII characters should encode them with UTF-8. It also warns about implementations which rely on the page encoding rather than UTF-8. However note that this deals with URLs in links, not form submission, so it doesn’t count. I just bring it up because it’s an official suggestion of a strategy for URI-encoding.

The HTML 4.01 spec does not, however, mention what to do with non-ASCII characters in form submissions. Well, actually, it explicitly forbids them, in section 17.13.1:

Note. The “get” method restricts form data set values to ASCII characters. Only the “post” method (with enctype=”multipart/form-data”) is specified to cover the entire [ISO10646] character set.

This is … kind of … unacceptable. We need to be able to write apps which accept non-ASCII characters. So basically all the browsers are allowing “illegal” activity, for a very good reason. And that means no standards!

So as I found out in the previous post, Firefox uses the document’s charset to encode the URI. I’ve since empirically discovered that at least two other major browsers (Safari and Internet Explorer 7) do the same. I’ve also discovered the form’s “accept-charset” attribute, which gives document authors a lot more explicit control. All three browsers will respect this attribute and use it to encode URIs. If unspecified, it falls back to the document’s charset.

For example, if you specify accept-charset=”utf-8″ on a form element, it will be submitted with that encoding, regardless of the document encoding. I thereby strongly recommend you use this on all forms, even if your documents are already encoded in UTF-8. This is so that if the document is transcoded, the browser behaviour doesn’t change.

The good news as far as standardization is concerned is that the upcoming HTML5 *may* explicitly sanction this behaviour. While this section of the spec is currently empty, it points you to a “Web Forms 2.0” draft, which states:

The submission character encoding is selected from the form’s accept-charset attribute. UAs must use the encoding that most completely covers the characters found in the form data set of the encodings specified. If the attribute is not specified, then the client should use either the page’s character encoding, or, if that cannot encode all the characters in the form data set, UTF-8. Character encodings that are not mostly supersets of US-ASCII must not be used (this includes UTF-16 and EBCDIC) even if specified in the accept-charset attribute.

Okay! I would like to see this get standardized. Note that it also states:

Authors must not specify an encoding other than UTF-8 or US-ASCII in the accept-charset attribute when the method used is get and the action indicates an HTTP resource.

So this means ISO-8859-1 (Latin-1) is out. Hence you should always use accept-charset=”utf-8″.

So to answer the question: “Do browsers encode URIs correctly?”, the answer is, “they do the best they can”. So I retract my initial accusations that Firefox is doing something wrong, and point my finger at the document authors – be explicit and no harm shall befall you! Also at the W3C – hurry up and get HTML5 standardised so there are some official guidelines on this matter!


Do browsers encode URLs correctly?

In Web development on June 29, 2008 by Matt Giuca

This post is an open question, which I have just discovered. I haven’t fully tested it or researched it.

As you may recall from my “URI Encoding Done Right” post, I said that non-ASCII characters in a URI are supposed to be first encoded using UTF-8 into a byte stream, then percent-encoded. I got this information from RFC 3986 – URI Generic Syntax Section 2.5, which states:

When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent-encoded.  For example, the character A would be represented as “A”, the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as “%C3%80”, and the character KATAKANA LETTER A would be represented as “%E3%82%A2”.

This document is the current official reference on URI syntax, and it seems to pretty clearly state that UTF-8 is the way to go. (Perhaps though, I am confused about what “a new URI scheme” means).

Well I noticed that Mozilla Firefox (3.0) does not always encode URIs using UTF-8. When you submit a form, the contents of the input boxes are encoded in the current page encoding. For instance, say you serve a page with the encoding “iso-8859-1” which has a text box. If the user types “À” (LATIN CAPITAL LETTER A WITH GRAVE) into the box and clicks submit, the data should be encoded as “%C3%80”, as given in the example (and it would be, were the content type “utf-8”). However, in Firefox at least, it is encoded as “%C0”, which is the ISO-8859-1 code for “À”.

Even more bizarre, if you give a character outside the character set’s repertoire, Firefox first escapes it as an XML character reference, then percent-encodes that! So if you type “ア” (KATAKANA LETTER A, a character not found in ISO-8859-1) and click submit, it should be encoded as “%E3%82%A2”, but is in fact encoded as “%26%2312450%3B” – which decodes to “& #12450;” (12450 being the decimal value of KATAKANA LETTER A’s Unicode code point, this will render correctly if displayed as HTML).

This first behaviour (encoding as whatever the current character set is) seems logical. If the server is dealing entirely in that character set, then it will work. But if I am interpreting RFC 3986 correctly, then it’s problematic, because a cooperating server will always decode the URI as UTF-8, in which case Firefox will be producing invalid input for pages not served as UTF-8.

The second behaviour (escaping as an XML character reference) seems completely wrong. If the server is going to echo the text back as raw HTML, then it will display correctly, but surely most applications are going to do some processing of input, and they should be HTML-escaping them anyway, making this pretty much a retarded idea.

So I’d like to know if anyone knows of a justification for this behaviour. I’ll also investigate some other browsers’ behaviour. There’s a discussion here on the subject: The Secret of Character Encoding. This page states:

There is no official way of determining the character encoding of such a request, since the percent encoding operates on a byte level, so it is usually assumed that it is the same as the encoding the page containing the form was submitted in.

It seems like this page’s author has not read RFC 3986! [Ed: The author has since contacted me, and we agree, even after 3986, there is no mandate to use UTF-8. It is just a suggestion for new protocols.]

It should be noted that prior to 2005, when 3986 was written, the syntax was governed by RFC 2396 (now obsolete), which is not specific about encodings.

I think the bottom line is, the browser is “wrong” because if you look at the HTTP request it is sending, it doesn’t actually state the charset at all (if it’s a GET request then it has no Content-Type header). Hence the character set is implicit in the previous page response. That doesn’t make sense because the request could be sent to a completely different site. It’s just the browser assuming that since the server is sending pages in a particular charset, it would like to receive URLs in that same charset, and I think that’s an invalid assumption.

[Ed: I’ve since done a follow-up post: Browser URI encoding: The best we can do, where I make a few more corrections to this post.]


Uploading files like Gmail client

In Web development on June 17, 2008 by Matt Giuca

One of those tutorials I wish I had written, since I already “reverse engineered” it (if I can call it that) on my own and coded up my own Ajax-style upload box for a web app I worked on last summer.

But someone beat me to it: Sajith.M.R explains how to upload files like the Gmail client. The first article on Gmail was very cool as well (link from there).

A small correction: “Ajax does not support multipart/form-data posts” – well you can certainly make it support such posts! The real problem here is that JavaScript provides no way to access the contents of, or dynamically send, a file upload box. Hence this trick is a nifty workaround.

In my experience, the biggest pain using this method is actually debugging it. If you’re used to using Firebug (for example) to review XMLHttpRequests, well you don’t get none of that here. (Though Wireshark should still be a handy debugging tool in this case).


Portable JavaScript: String indexing

In JavaScript,Web development on June 15, 2008 by Matt Giuca

A short community service announcement. I recently got burned on a JavaScript project which, as usual, wasn’t working in Internet Explorer. I decided to track down one of the issues which simply caused an “error on page” message in IE.

It turns out that string indexing (such as str[index]) doesn’t work in IE. You need to do this:


The technical reason for this is that IE doesn’t implement JavaScript, it implements JScript. Both are derivatives of ECMAScript. The str[index] feature is a feature of JavaScript alone, not ECMAScript or JScript.

(I think Microsoft were forced to call it JScript for legal reasons, but it does give them the convenient excuse “hey, this isn’t JavaScript, it’s JScript”).

From Mozilla’s reference:

The second way (treating the string as an array) is not part of the ECMAScript; it’s a JavaScript feature.

So there you go – charAt from now on!

(Another similar tip: Don’t leave a trailing comma at the end of an object literal – it works in JavaScript but not in ECMAScript, or IE).

Now if you’ll excuse me, I’m off to grep for ‘[‘. 😦

Update: The square-bracket indexing feature is included in ECMAScript 5 so should become a standard feature. From Mozilla’s reference:

Array-like character access (the second way above) is not part of ECMAScript 3. It is a JavaScript and ECMAScript 5 feature.

Also, it appears to be supported in Internet Explorer 8 and above (although I haven’t tested it). Thanks to Jean-Marc Desperrier and Chris Donnelly (in the comments) for pointing these out.


URI Encoding Done Right

In Web development on May 24, 2008 by Matt Giuca Tagged: , , ,

Well I guess it’s about time to actually make some content-filled posts!

I’m going to start by talking about web development, and correct handling URIs on the client and server side. This is something that I expect all good web frameworks should be able to deal with, but I’m not a big fan of such contrivances, and I believe whether you use one or not, you should understand how this stuff works.

With no work at all, you can make URIs that work for a bunch of strings you might try, but they’ll fail as soon as someone uses a special character. With a tiny amount of work, you can make your URIs encode and decode correctly for most things people might try. But there are still a bunch of edge cases you might not know about. This article is about the extra work you need to do (or at least the extra things you need to think about) for your URIs to encode and decode correctly for all characters including non-ASCII / Unicode ones.


  • I use the term “URI” – Uniform Resource Identifier – to refer to what are commonly called “URLs”. It’s just a slightly more general term for the same concept.
  • Apologies for the curly quotes in this article. They’re done automatically by the blogging software; please consider them to be straight quotes when quoting characters and strings.
  • You should know a bit about Unicode – if you don’t then go read up on it! You should not be programming if you don’t know at least a bit about Unicode.

URI Encoding Rules

The issue here is that URI syntax defines a set of “reserved characters”, including, the following: :, /, ?, #, &, =.

These characters are used in the URI syntax as delimiters. For example, the ‘?‘ indicates the beginning of the query string, and the ‘&‘ separates key/value pairs in the query string. When you put arbitrary strings into a URI, it becomes important to escape characters to avoid confusing them with delimiters.

For example, say you are writing a wiki, and you have a page called “black&white”, you might create a URL like this: “?page=black%26white&action=view”. Note the “%26” is a proper URL-encoded form for the ‘&’ character – it is a ‘%’ sign followed by the 2-digit hexadecimal ASCII code for ‘&’ (0x26). It is very common to see space characters (‘ ‘) encoded as “%20”, the ASCII code for ‘ ‘. Spaces may also be encoded as ‘+’ characters (though this is the behaviour of HTML forms, not part of the URI syntax).

The lesson is, you need to escape such characters, or they’ll be misconstrued as delimiters. If you wrote this wiki in JavaScript, and had code to construct part of a URI:

qs = "?page=" + pagename + "&action=view";

Then if pagename == “black&white”, your URI would be “?page=black&white&action=view”. This will be interpreted as key/value pairs: page = “black”, white (with no value) and action = “view”. So “black&white” should have been escaped as “black%26white”.

You also need to escape other characters in order to correctly generate the URI syntax. In particular, all non-ASCII characters (ie. Unicode characters) need to be first encoded as UTF-8 (a stream of bytes), and then those bytes need to be URI-encoded. For example, the string “Ω” (character code U+2126) is encoded in UTF-8 as the 3-byte sequence “\xe2\x84\xa6”. So it is encoded as the URI “%E2%84%A6”. URI encoding libraries should take care of this automatically.

In JavaScript

JavaScript has always provided the “escape” function which takes a string and produces a URI-encoded version of that string. Don’t use it. It’s poorly-specified (differs across browsers), doesn’t escape enough characters, and doesn’t handle non-ASCII characters properly. Newer (new enough so it’s safe to use) JavaScript versions provide two more escaping functions: “encodeURI” and “encodeURIComponent“. These three functions differ only in what characters they escape, and how they tread non-ASCII characters.

All of these results are empirically verified in Mozilla Firefox 3.

escape does not escape:

* + - . / @ _

encodeURI does not escape:

! # $ & ' ( ) * + , - . / : ; = ? @ _ ~

encodeURIComponent does not escape:

! ' ( ) * - . _ ~

(Also all three do not escape ASCII alphabetic or numeric characters).

RFC 3986 defines the “unreserved” characters as alphanumeric, and . _ ~ – these are the only characters that should not be escaped. Which makes you wonder why the JavaScript functions differ so much!


As you can see, escape doesn’t encode ‘+‘, which could cause trouble if the decoder decides to treat “+” as a space (as it may do if it’s expecting an HTML form). It also doesn’t encode ‘/’ which means you can get away with escaping a path, but will cause problems if you are encoding a string with an actual ‘/’ in it (which isn’t a path delimiter)! It does encode ‘~’, which should not be encoded according to the RFC.

Finally, it epically fails dealing with non-ASCII characters. escape(“\xd3”) gives “%D3”. This seems correct, but it isn’t. Remember, these things are not actually ASCII values – they are UTF-8 byte values. The UTF-8 value of U+00D3 is not “\xd3”, but “\xc3\x93”. So the correct URI-encoding for “\xd3” is “%C3%93”. For character codes above ‘\xff’, it fails even harder. escape(“\u2126”) gives “%u2126”. Now there is nowhere in the URI syntax which says to do that! Once again, it should be UTF-8-encoded first, so the correct URI-encoding for “\u2126” is “%E2%84%A6”.

So escape is crap. Don’t use it.

Fortunately, encodeURI and encodeURIComponent both behave correctly with respect to non-ASCII characters, so I won’t talk about that behaviour (it matches the expected output above).


As you can see, encodeURI is very relaxed about what it encodes. It deliberately does not encode any URI delimiters (such as : / ? & and =). The reason for this is that it is designed so that you shove a whole bunch of strings together into a URI-like thing, and then call encodeURI to encode as much as you can, still preserving the delimiter characters.

As far as I can tell, this makes it totally useless. It’s only useful if you’ve already constructed a malformed URI. You should never be shoving strings together like this without escaping their components first. By the time you’re ready to call encodeURI, you’ve already blurred the distinction between what is a delimiter and what is an actual character.

To continue our wiki example, you would use it like this:

qs = encodeURI("?page=" + pagename + "&action=view");

If pagename is “my rôle”, it will correctly encode the URI as “?page=my%20r%C3%B4le&action=view”. But if pagename is “black&white”, it will be just as useless as no encoding at all, because ‘&’ is not escaped.

The only use for this function is if you’re positive your strings don’t contain delimiter characters. But IMHO if you’re making that assumption, you’re asking for trouble.

Basically, (IMHO) you should never use encodeURI, because you should never construct the sort of string it’s expecting.


Of course, the proper solution is to use encodeURIComponent, which escapes just about everything. The important trick is to use it on all the components, before concatenating them together. So to fix our example, you would do this:

qs = "?page=" + encodeURIComponent(pagename) + "&action=view";

Now this will correctly encode any string you throw at it. A ‘&’ character in pagename will correctly become “%26”, while the ‘&’ in “&action=view” will remain a ‘&’ delimiter.

Note that if you’re encoding a path with ‘/’ characters in it, and you want to keep them unescaped, you need to split on ‘/’, encode all the path segments, and then recombine! This sucks, and I suspect it’s a motivation for using encodeURI. But don’t be tempted! Just write your own function to do it. (It would be nice if there was an encodeURIPath which is the same as encodeURIComponent but doesn’t escape ‘/’ characters).

I’m still confused as to why encodeURIComponent doesn’t escape ! ( ) or * (all of these are reserved characters). However, it doesn’t seem to do much harm, so I won’t complain.


JavaScript provides decoding functions for each encoding function.

escape <=> unescape

encodeURI <=> decodeURI

encodeURIComponent <=> decodeURIComponent

Each of the decoding functions comes with the same pitfalls as their encoding counterparts. For instance, decodeURI is just as useless as encodeURI, because if you use it on a full URI, it will give you a malformed URI. If you use it on a URI component, it won’t have decoded certain characters.

decodeURIComponent is the correct solution. It is guaranteed to decode all %xx sequences, but as with encodeURIComponent, you have to break up the URI into components first, or the output will be meaningless.

In Python

On the server side, you’ll be using some other language, and you’ll have the same problem. Since I primarily use Python, I’ll mention it here. Every language has its own library, and every library has slightly different rules. Read the documentation carefully before blindly sending your strings off to battle!

In Python, all of this is handled with the urllib module. This provides two quoting functions, quote and quote_plus.

By default, quote escapes all characters except alphanumeric and ‘_‘ ‘‘, ‘.‘ and ‘/‘. So it’s basically encodeURIComponent from JavaScript, except it also doesn’t escape ‘/‘ – this means it can be used to encode paths. Also note that it does escape ‘~‘, which it should not.

The good thing though, is that it lets you override what it doesn’t escape (except alphanum, _, and .). So quote(…, safe=”~”) gives you a version that does escape ‘/‘, but doesn’t escape ‘~‘. I’d recommend you use this most of the time. Only if you are escaping a path should you use the default, and I’d still recommend you allow ‘~‘ to be unescaped: quote(…, safe=”/~”).

There is also quote_plus, which converts spaces into ‘+’ symbols instead. You may want to do this for aesthetics, but be aware that this is not considered a space in URI syntax, so you will need to fix this up on the other end (and if you’re talking to JavaScript, remember none of JavaScript’s functions see this as a space).

Unquoting is straightforward. Choose unquote_plus if your URLs use ‘+’ for spaces, and remember that HTML forms do this automatically. (ie. if you are reading a URL from an HTML form, for some reason by hand, you would use this).

Lastly, Python’s urllib doesn’t yet know how to deal with Unicode strings, so these functions do not handle non-ASCII characters properly. Recall that in JavaScript, escape(“\xd3”) gives “%D3”, and it should have given “%C3%93”. Well in Python, urllib.quote(“\xd3”) also gives “%D3”, but I consider this to be “okay”. The reasoning is that in JavaScript, strings are considered Unicode strings, and should be treated as such. In Python (pre 3.0), strings are considered just 8-bit byte strings, so it is valid to encode this as the byte 0xd3, not the character U+00D3.

This simply means if you need to encode a Unicode string, you should manually encode it as UTF-8 first, using the encode(“utf-8″) method.

urllib.quote(u”\xd3”.encode(“utf-8”)) gives “%C3%93” (correct).

Similarly, if you decode a URI, technically it will come to you as a UTF-8-encoded byte string.

urllib.unquote(“%C3%93”) gives “\xc3\x93”.

If you want to treat this as a Unicode string, just decode it using the decode(“utf-8”) method.

urllib.unquote(“%C3%93”).decode(“utf-8″) gives u”\xd3”.

Also note the existence of the “urlencode” function, which encodes a dictionary into a query string, using quote_plus on all the strings, largely automating this whole process. Similarly, the cgi module provides a lot of functionality for parsing these things.

Double-encoding and double-decoding

It’s a pretty common problem to accidentally double-encode a URI. That means you’ve got two places in the code where they get escaped (or perhaps you encode it, then pass it to a library function which also encodes it). You end up with a string like this:


(Note the ‘%’ in “%26” was encoded to “%25”). The only way to prevent this is careful documentation and reasoning about the properties of the strings (ie. “this string is a raw string”, “this string is a URI component”). Check to see if your library expects a raw string or a URI component, for example.

A harder issue to catch is double-decoding. This can appear harmless unless you have good test cases. Consider some code which accidentally decodes a URI twice. The URI component “black%26white” is decoded to “black&white” then decoded again to “black&white” – it looks fine.

However, there are special cases where this won’t be fine – specifically cases with % signs in them. The URI component “26%2524” is an encoding of the string “26%24”. If you accidentally decode it twice, you will get “26$”. So it is a bug if you decode something twice, even if it rarely shows through. Once again, careful documentation.


As indicated by the length of this post, URI encoding is a harsh mistress. In general, you should carefully read and test any library functions you use to do encoding or decoding, and think about all the characters that may be encoded or decoded. Think about non-ASCII / Unicode characters and how they will be treated.

If you can get away with it, find higher-level functions such as Python’s urllib.urlencode which takes a lot of the work out of it. But even if your web framework does all of this for you, it’s a good thing to know what’s going on.


RFC 3986

Wikipedia: URI scheme