Articles

URI Encoding Done Right

In Web development on May 24, 2008 by Matt Giuca Tagged: , , ,

Well I guess it’s about time to actually make some content-filled posts!

I’m going to start by talking about web development, and correct handling URIs on the client and server side. This is something that I expect all good web frameworks should be able to deal with, but I’m not a big fan of such contrivances, and I believe whether you use one or not, you should understand how this stuff works.

With no work at all, you can make URIs that work for a bunch of strings you might try, but they’ll fail as soon as someone uses a special character. With a tiny amount of work, you can make your URIs encode and decode correctly for most things people might try. But there are still a bunch of edge cases you might not know about. This article is about the extra work you need to do (or at least the extra things you need to think about) for your URIs to encode and decode correctly for all characters including non-ASCII / Unicode ones.

Notes:

  • I use the term “URI” – Uniform Resource Identifier – to refer to what are commonly called “URLs”. It’s just a slightly more general term for the same concept.
  • Apologies for the curly quotes in this article. They’re done automatically by the blogging software; please consider them to be straight quotes when quoting characters and strings.
  • You should know a bit about Unicode – if you don’t then go read up on it! You should not be programming if you don’t know at least a bit about Unicode.

URI Encoding Rules

The issue here is that URI syntax defines a set of “reserved characters”, including, the following: :, /, ?, #, &, =.

These characters are used in the URI syntax as delimiters. For example, the ‘?‘ indicates the beginning of the query string, and the ‘&‘ separates key/value pairs in the query string. When you put arbitrary strings into a URI, it becomes important to escape characters to avoid confusing them with delimiters.

For example, say you are writing a wiki, and you have a page called “black&white”, you might create a URL like this: “?page=black%26white&action=view”. Note the “%26” is a proper URL-encoded form for the ‘&’ character – it is a ‘%’ sign followed by the 2-digit hexadecimal ASCII code for ‘&’ (0x26). It is very common to see space characters (‘ ‘) encoded as “%20”, the ASCII code for ‘ ‘. Spaces may also be encoded as ‘+’ characters (though this is the behaviour of HTML forms, not part of the URI syntax).

The lesson is, you need to escape such characters, or they’ll be misconstrued as delimiters. If you wrote this wiki in JavaScript, and had code to construct part of a URI:

qs = "?page=" + pagename + "&action=view";

Then if pagename == “black&white”, your URI would be “?page=black&white&action=view”. This will be interpreted as key/value pairs: page = “black”, white (with no value) and action = “view”. So “black&white” should have been escaped as “black%26white”.

You also need to escape other characters in order to correctly generate the URI syntax. In particular, all non-ASCII characters (ie. Unicode characters) need to be first encoded as UTF-8 (a stream of bytes), and then those bytes need to be URI-encoded. For example, the string “Ω” (character code U+2126) is encoded in UTF-8 as the 3-byte sequence “\xe2\x84\xa6”. So it is encoded as the URI “%E2%84%A6”. URI encoding libraries should take care of this automatically.

In JavaScript

JavaScript has always provided the “escape” function which takes a string and produces a URI-encoded version of that string. Don’t use it. It’s poorly-specified (differs across browsers), doesn’t escape enough characters, and doesn’t handle non-ASCII characters properly. Newer (new enough so it’s safe to use) JavaScript versions provide two more escaping functions: “encodeURI” and “encodeURIComponent“. These three functions differ only in what characters they escape, and how they tread non-ASCII characters.

All of these results are empirically verified in Mozilla Firefox 3.

escape does not escape:

* + - . / @ _

encodeURI does not escape:

! # $ & ' ( ) * + , - . / : ; = ? @ _ ~

encodeURIComponent does not escape:

! ' ( ) * - . _ ~

(Also all three do not escape ASCII alphabetic or numeric characters).

RFC 3986 defines the “unreserved” characters as alphanumeric, and . _ ~ – these are the only characters that should not be escaped. Which makes you wonder why the JavaScript functions differ so much!

escape

As you can see, escape doesn’t encode ‘+‘, which could cause trouble if the decoder decides to treat “+” as a space (as it may do if it’s expecting an HTML form). It also doesn’t encode ‘/’ which means you can get away with escaping a path, but will cause problems if you are encoding a string with an actual ‘/’ in it (which isn’t a path delimiter)! It does encode ‘~’, which should not be encoded according to the RFC.

Finally, it epically fails dealing with non-ASCII characters. escape(“\xd3”) gives “%D3”. This seems correct, but it isn’t. Remember, these things are not actually ASCII values – they are UTF-8 byte values. The UTF-8 value of U+00D3 is not “\xd3”, but “\xc3\x93”. So the correct URI-encoding for “\xd3” is “%C3%93”. For character codes above ‘\xff’, it fails even harder. escape(“\u2126”) gives “%u2126”. Now there is nowhere in the URI syntax which says to do that! Once again, it should be UTF-8-encoded first, so the correct URI-encoding for “\u2126” is “%E2%84%A6”.

So escape is crap. Don’t use it.

Fortunately, encodeURI and encodeURIComponent both behave correctly with respect to non-ASCII characters, so I won’t talk about that behaviour (it matches the expected output above).

encodeURI

As you can see, encodeURI is very relaxed about what it encodes. It deliberately does not encode any URI delimiters (such as : / ? & and =). The reason for this is that it is designed so that you shove a whole bunch of strings together into a URI-like thing, and then call encodeURI to encode as much as you can, still preserving the delimiter characters.

As far as I can tell, this makes it totally useless. It’s only useful if you’ve already constructed a malformed URI. You should never be shoving strings together like this without escaping their components first. By the time you’re ready to call encodeURI, you’ve already blurred the distinction between what is a delimiter and what is an actual character.

To continue our wiki example, you would use it like this:

qs = encodeURI("?page=" + pagename + "&action=view");

If pagename is “my rôle”, it will correctly encode the URI as “?page=my%20r%C3%B4le&action=view”. But if pagename is “black&white”, it will be just as useless as no encoding at all, because ‘&’ is not escaped.

The only use for this function is if you’re positive your strings don’t contain delimiter characters. But IMHO if you’re making that assumption, you’re asking for trouble.

Basically, (IMHO) you should never use encodeURI, because you should never construct the sort of string it’s expecting.

encodeURIComponent

Of course, the proper solution is to use encodeURIComponent, which escapes just about everything. The important trick is to use it on all the components, before concatenating them together. So to fix our example, you would do this:

qs = "?page=" + encodeURIComponent(pagename) + "&action=view";

Now this will correctly encode any string you throw at it. A ‘&’ character in pagename will correctly become “%26”, while the ‘&’ in “&action=view” will remain a ‘&’ delimiter.

Note that if you’re encoding a path with ‘/’ characters in it, and you want to keep them unescaped, you need to split on ‘/’, encode all the path segments, and then recombine! This sucks, and I suspect it’s a motivation for using encodeURI. But don’t be tempted! Just write your own function to do it. (It would be nice if there was an encodeURIPath which is the same as encodeURIComponent but doesn’t escape ‘/’ characters).

I’m still confused as to why encodeURIComponent doesn’t escape ! ( ) or * (all of these are reserved characters). However, it doesn’t seem to do much harm, so I won’t complain.

Decoding

JavaScript provides decoding functions for each encoding function.

escape <=> unescape

encodeURI <=> decodeURI

encodeURIComponent <=> decodeURIComponent

Each of the decoding functions comes with the same pitfalls as their encoding counterparts. For instance, decodeURI is just as useless as encodeURI, because if you use it on a full URI, it will give you a malformed URI. If you use it on a URI component, it won’t have decoded certain characters.

decodeURIComponent is the correct solution. It is guaranteed to decode all %xx sequences, but as with encodeURIComponent, you have to break up the URI into components first, or the output will be meaningless.

In Python

On the server side, you’ll be using some other language, and you’ll have the same problem. Since I primarily use Python, I’ll mention it here. Every language has its own library, and every library has slightly different rules. Read the documentation carefully before blindly sending your strings off to battle!

In Python, all of this is handled with the urllib module. This provides two quoting functions, quote and quote_plus.

By default, quote escapes all characters except alphanumeric and ‘_‘ ‘‘, ‘.‘ and ‘/‘. So it’s basically encodeURIComponent from JavaScript, except it also doesn’t escape ‘/‘ – this means it can be used to encode paths. Also note that it does escape ‘~‘, which it should not.

The good thing though, is that it lets you override what it doesn’t escape (except alphanum, _, and .). So quote(…, safe=”~”) gives you a version that does escape ‘/‘, but doesn’t escape ‘~‘. I’d recommend you use this most of the time. Only if you are escaping a path should you use the default, and I’d still recommend you allow ‘~‘ to be unescaped: quote(…, safe=”/~”).

There is also quote_plus, which converts spaces into ‘+’ symbols instead. You may want to do this for aesthetics, but be aware that this is not considered a space in URI syntax, so you will need to fix this up on the other end (and if you’re talking to JavaScript, remember none of JavaScript’s functions see this as a space).

Unquoting is straightforward. Choose unquote_plus if your URLs use ‘+’ for spaces, and remember that HTML forms do this automatically. (ie. if you are reading a URL from an HTML form, for some reason by hand, you would use this).

Lastly, Python’s urllib doesn’t yet know how to deal with Unicode strings, so these functions do not handle non-ASCII characters properly. Recall that in JavaScript, escape(“\xd3”) gives “%D3”, and it should have given “%C3%93”. Well in Python, urllib.quote(“\xd3”) also gives “%D3”, but I consider this to be “okay”. The reasoning is that in JavaScript, strings are considered Unicode strings, and should be treated as such. In Python (pre 3.0), strings are considered just 8-bit byte strings, so it is valid to encode this as the byte 0xd3, not the character U+00D3.

This simply means if you need to encode a Unicode string, you should manually encode it as UTF-8 first, using the encode(“utf-8″) method.

urllib.quote(u”\xd3”.encode(“utf-8”)) gives “%C3%93” (correct).

Similarly, if you decode a URI, technically it will come to you as a UTF-8-encoded byte string.

urllib.unquote(“%C3%93”) gives “\xc3\x93”.

If you want to treat this as a Unicode string, just decode it using the decode(“utf-8”) method.

urllib.unquote(“%C3%93”).decode(“utf-8″) gives u”\xd3”.

Also note the existence of the “urlencode” function, which encodes a dictionary into a query string, using quote_plus on all the strings, largely automating this whole process. Similarly, the cgi module provides a lot of functionality for parsing these things.

Double-encoding and double-decoding

It’s a pretty common problem to accidentally double-encode a URI. That means you’ve got two places in the code where they get escaped (or perhaps you encode it, then pass it to a library function which also encodes it). You end up with a string like this:

“black%2526white”

(Note the ‘%’ in “%26” was encoded to “%25”). The only way to prevent this is careful documentation and reasoning about the properties of the strings (ie. “this string is a raw string”, “this string is a URI component”). Check to see if your library expects a raw string or a URI component, for example.

A harder issue to catch is double-decoding. This can appear harmless unless you have good test cases. Consider some code which accidentally decodes a URI twice. The URI component “black%26white” is decoded to “black&white” then decoded again to “black&white” – it looks fine.

However, there are special cases where this won’t be fine – specifically cases with % signs in them. The URI component “26%2524” is an encoding of the string “26%24”. If you accidentally decode it twice, you will get “26$”. So it is a bug if you decode something twice, even if it rarely shows through. Once again, careful documentation.

Summary

As indicated by the length of this post, URI encoding is a harsh mistress. In general, you should carefully read and test any library functions you use to do encoding or decoding, and think about all the characters that may be encoded or decoded. Think about non-ASCII / Unicode characters and how they will be treated.

If you can get away with it, find higher-level functions such as Python’s urllib.urlencode which takes a lot of the work out of it. But even if your web framework does all of this for you, it’s a good thing to know what’s going on.

References

RFC 3986

Wikipedia: URI scheme

Advertisements

17 Responses to “URI Encoding Done Right”

  1. […] you may recall from my “URI Encoding Done Right” post, I said that non-ASCII characters in a URI are supposed to be first encoded using UTF-8 […]

  2. Thanx for your post, it was really difficult finding a good explanation for the javascript URI encoding for Python.
    I’ve just had a little problem:
    while trying to get unicode(‘á’) – aacute
    it gave me an error:

    Traceback (most recent call last):
    File “”, line 1, in
    unicode(‘á’)
    UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xe1 in position 0: ordinal not in range(128)

    so I had to do a dirty trick like
    exec(“tmp=u'”+text+”‘”)
    where text=’á’
    then tmp resulted u’\xe1′ which was what I was expecting
    Is there a “cleaner” way to get unicode from characters like that one?

    • so I had to do a dirty trick like
      exec(“tmp=u'”+text+”‘”)
      where text=’á’
      then tmp resulted u’\xe1′ which was what I was expecting

      Eww! That looks really nasty. It didn’t work that way for me. I got tmp = u’\xc3\xa1′, which is wrong.

      I’m not entirely sure what you’re trying to do.

      If you want the character ‘á’ as a Unicode character, just write u’á’.

      Note that ‘á’ by itself (in Python 2) is just auto-encoded to the UTF-8 bytes for that character.
      >>> ‘á’
      ‘\xc3\xa1’

      So you could call ‘\xc3\xa1’.decode(‘utf-8′) to get the result of u’\xe1’.

      By calling unicode(‘á’), that is the same as ‘á’.decode(‘ascii’), which is asking “what are the ASCII characters for the bytes ‘\xc3’ and ‘\xa1′ – both of which are not valid ASCII characters because they are above 128.

      So basically, u’á’ == u’\xe1′ == ‘\xc3\xa1’.decode(‘utf-8’)

      Any of the above will get you the same value.

      Thanks for commenting.

  3. I didn’t mean to put the cool guy icon there :), the error was … : ordinal not in range(128 and then )

  4. Excellent site, keep up the good work

  5. Thanks, great tip!

  6. Thanks for this post I was not finding much that was useful in regards to the encodeURI function. It is irritating that it doesn’t encode the “&” symbol”.

    • But, as I said in the blogpost, you should not be using encodeURI. It’s designed to take an entire URI which has not been escaped, and escape it. That’s fundamentally flawed — the whole point of escaping URIs is to distinguish between URI delimiters and actual component characters — once you’ve put unescaped data into a URI then that is an impossible distinction.

      encodeURI can’t encode the “&” symbol, since it might be used as a delimiter.

      You should only ever use encodeURIComponent, and do it before you combine the components into a full URI.

  7. Nice post, this topic is definitely pretty messy, and you broke it down really well! I know you explicitly excluded frameworks, but whenever I find myself manually constructing a query string in a django project I like to use their implementation of `urlencode`, which handles the character encoding issue for you: `from django.utils.http import urlencode`.

    Also, here’s a little post on the related subject of HTML escaping ampersands: http://mrcoles.com/blog/how-use-amersands-html-encode/

  8. Thanks for writing this. Exactly the clarity on the matter I was looking for. Much appreciated!

  9. […] Also see my full post about URI encoding, URI Encoding Done Right. […]

  10. The last 5 days have consisted primarily of trial, error, and chucking keyboards at the wall. Within 5 minutes after reading this, all of my issues were gone! Thank you so much.

  11. Thank you, I’ve just been searching for info approximately this topic for ages and yours is the greatest I’ve found out so far.
    However, what about the bottom line? Are you sure concerning
    the source?

    • What bottom line? I’m unclear what you’re question is. If you’re concerned that I cited Wikipedia, that’s just for easy reading. I have read the RFC in detail when researching this post.

  12. EncodeURIComponent
    —————————
    Where you are saying “However, it doesn’t seem to do much harm, so I won’t complain.”. However I am facing issue to the character ‘(‘ & ‘)’. Can you please guide if you know the solution to encode these characters. I know its rare situation but i faced this situation now.

    • Hi seraj,

      Sure, if you really need to encode a particular character, you can do it yourself using the string replace method. For example, to encode the ‘(‘ character in string s, use s.replace(‘(‘, ‘%28’).

      To encode all characters JavaScript knows to encode, as well as parentheses, do this:
      encodeURIComponent(s).replace(‘(‘, ‘%28’).replace(‘)’, ‘%29’)

      For example:
      > encodeURIComponent(‘black (& white)’).replace(‘(‘, ‘%28’).replace(‘)’, ‘%29’)
      ‘black%20%28%26%20white%29’

      Hope this helps.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: