Articles

How do you escape a complete URI?

In JavaScript, Python, Unicode, URI, Web development on February 12, 2012 by Matt Giuca

This question comes up quite a lot: “I have a URI. How do I percent-encode it?” In this post, I want to explore this question fully, because it is a hard question, and one that, on the face of it, is nonsense. The short answer to this question is: “You can’t. If you haven’t percent-encoded it yet, you’re too late.” But there are some specific reasons why you might want to do this, so read on.

Let’s get at the meaning of this question. I’ll use the term URI (Uniform Resource Identifier), but I’m mostly talking about URLs (a URI is just a bit more general). A URI is a complete identifier of a resource, such as:

http://example.com/admin/login?name=Helen&gender=f

A URI component is any of the individual atomic pieces of the URI, such as "example.com", "admin", "login", "name", "Helen", "gender" or "f". The components are the parts that the URI syntax has no interest in parsing; they represent plain text strings. Now the problem comes when we encounter a URI such as this:

http://example.com/admin/login?name=Helen Ødegård&gender=f

This isn't a legal URI because the last query argument "Helen Ødegård" is not percent-encoded -- the space (U+0020, meaning that this is the Unicode character with hexadecimal value 20), as well as the non-ASCII characters 'Ø' (U+00D8) and 'å' (U+00E5) are forbidden in any URI. So the answer to "can't we fix this?" is "yes" -- we can retroactively percent-encode the URI so it appears like this:

http://example.com/admin/login?name=Helen%20%C3%98deg%C3%A5rd&gender=f

A digression: note that the Unicode characters were first encoded to bytes with UTF-8: 'Ø' (U+00D8) encodes to the byte sequence hex C3 98, which then percent-encodes to %C3%98. This is not actually part of the standard: none of the URI standards (the most recent being RFC 3986) specify how a non-ASCII character is to be converted into bytes. I could also have encoded them using Latin-1: "Helen%20%D8deg%E5rd," but then I couldn't support non-European scripts. This is a mess, but it isn't the subject of this article, and the world mostly gets along fine by using UTF-8, which I'll assume we're using for the rest of this article.

Okay, so that's solved, but will it work in all cases? How about this URI:

http://example.com/admin/login?redirect=http://example.com/news#funny&name=Helen&gender=f

Clearly, a human looking at this can tell that the value of the "redirect" argument is "http://example.com/news#funny", which means that the "#" (U+0023) needs to be percent-encoded as "%23":

http://example.com/admin/login?redirect=http://example.com/news%23funny&name=Helen&gender=f

But how did we know to encode the "#"? What if whoever typed this URI genuinely meant for there to be a query of "redirect=http://example.com/news" and a fragment of "funny&name=Helen&gender=f". It is wrong for us to meddle with the given URI, assuming that the "#" was intended to be a literal character and not a delimiter. The answer to "can we fix it?" is "no". Fixing the above URI would only introduce bugs. The answer is "if you wanted that '#' to be interpreted literally, you should have encoded it before you stuck it in the URI."

The idea that you can:

  1. Take a bunch of URI components (as bare strings),
  2. Concatenate them together using URI delimiters (such as "?", "&" and "#"),
  3. Percent-encode the URI.

is nonsense, because once you have done step #2, you cannot possibly know (in general) which characters were part of the original URI components, and which are delimiters. Instead, error-free software must:

  1. Take a bunch of URI components (as bare strings),
  2. Percent-encode each individual URI component,
  3. Concatenate them together using URI delimiters (such as "?", "&" and "#").

This is why I previously recommended never using JavaScript's encodeURI function, and instead to use encodeURIComponent. The encodeURI function "assumes that the URI is a complete URI" -- it is designed to perform step #3 in the bad algorithm above, which by definition, is meaningless. The encodeURI function will not encode the "#" character, because it might be a delimiter, so it would fail to interpret the above example in its intended meaning.

The encodeURIComponent function, on the other hand, is designed to be called on the individual components of the URI before they are concatenated together -- step #2 of the correct algorithm above. Calling that function on just the component "http://example.com/news#funny" would produce:

http%3A%2F%2Fexample.com%2Fnews%23funny

which is a bit of overkill (the ":" and "/" characters do not strictly need to be encoded in a query parameter), but perfectly valid -- when the data reaches the other end it will be decoded back into the original string.

So having said all of that, is there any legitimate need to break the rule and percent-encode a complete URI?

URI cleaning

Well, yes there is. (I have been bullish in the past that there isn't, such as in my answer to this question on Stack Overflow, so this post is me reconsidering that position a bit.) It happens all the time: in your browser's address bar. If you type this URL into the address bar:

http://example.com/admin/login?name=Helen Ødegård&gender=f

it is not typically an error. Most browsers will automatically "clean up" the URL and send an HTTP request to the server with the line:

GET /admin/login?name=Helen%20%C3%98deg%C3%A5rd&gender=f HTTP/1.1

(Unfortunately, IANA will redirect this immediately, but if you inspect the packets or try it on a server you control, then check the logs, you will see this is true.) Most browsers don't show you that they're cleaning up the URIs -- they attempt to display them as nicely as possible (in fact, if you type in the escaped version, Firefox will automatically convert it so you can see the space and Unicode characters in the address bar).

Does this mean that we can relax and encode our URIs after composing them? No. This is an application of Postel's Law ("Be liberal in what you accept, and conservative in what you send.") The browser's address bar, being a human interface mechanism, is helpfully attempting to take an invalid URI and make it valid, with no guarantee of success. It wouldn't help on my second example ("redirect=http://example.com/news#funny"). I think this is a great idea, because it lets users type spaces and Unicode characters into the address bar, and it isn't necessarily a bad idea for other software to do it too, particularly where user interfaces are concerned. As long as the software is not relying on it.

In other words, software should not use this technique to construct URIs internally. It should only ever use this technique to attempt to "clean up" URIs that have been supplied from an external source.

So that is the point of JavaScript's encodeURI function. I don't like to call this "encoding" because that implies it is taking something unencoded and converting it into an encoded form. I prefer to call this "URI cleaning". That name is suggestive of the actual process: taking a complete URI and cleaning it up a bit.

Unfortunately (as pointed out by Tim Cuthbertson), encodeURI is not quite good for this purpose -- it encodes the '%' character, meaning it will double-escape any URI that already has percent-escaped content. More on that later.

We can formalise this process by describing a new type of object called a "super URI." A super URI is a sequence of Unicode characters with the following properties:

  1. Any character that is in any way valid in a URI is interpreted as normal for a URI,
  2. Any other character is interpreted as a URI would interpret the sequence of characters resulting from percent-encoding the UTF-8 encoding of the character (or some other character encoding scheme).

Now it becomes clear what we are doing: URI cleaning is simply the process of transforming a super URI into a normal URI. In this light, rather than saying that the string:

http://example.com/admin/login?name=Helen Ødegård&gender=f

is "some kind of malformed URI," we can say it is a super URI, which is equivalent to the normal URI:

http://example.com/admin/login?name=Helen%20%C3%98deg%C3%A5rd&gender=f

Note that super URIs have nothing to do with not percent-encoding of delimiter characters -- delimiters such as "#" must still be percent-escaped in the super URI. They are only about not percent-encoding invalid characters. We can consider super URIs to be human-readable syntax, while proper URIs are required for data transmission. This means that we can also take a proper URI and convert it into a more human-readable super URI for display purposes (as web browsers do). That is the purpose of JavaScript's decodeURI function. Note that, again, I don't consider this to be "decoding," rather, "pretty printing." It doesn't promise not to show you percent-encoded characters. It only decodes characters that are illegal in normal URIs.

It is probably a good idea for most applications that want to "pretty print" a URI to not decode control characters (U+0000 -- U+001F and U+007F), to avoid printing garbage and newlines. Note that decodeURI does decode these characters, so it is probably unwise to use it for display purposes without some post-processing.

Update: My "super URI" concept is similar to the formally specified IRI (Internationalized Resource Identifier) -- basically, a URI that can have non-ASCII characters. However, my "super URIs" also allow other ASCII characters that are illegal in URLs.

Which characters?

Okay, so exactly which characters should be escaped for this URI cleaning operation? I thought I'd take the opportunity to break down the different sets of characters described by the URI specification. I will address two versions of the specification: RFC 2396 (published in 1998) and RFC 3986 (published in 2005). 2396 is obsoleted by 3986, but since a lot of encoding functions (including JavaScript's) were invented before 2005, it gives us a good historical explanation for their behaviour.

RFC 2396

This specification defines two sets of characters: reserved and unreserved.

  • The reserved characters are: $&+,/:;=?@
  • The unreserved characters are: ALPHA and NUM and !'()*-._~

Where ALPHA and NUM are the ASCII alphabetic and numeric characters, respectively. (They do not include non-ASCII characters.)

There is a semantic difference between reserved and unreserved characters. Reserved characters may have a syntactic meaning in the URI syntax, and so if one of them is to appear as a literal character in a URI component, it may need to be escaped. (This will depend upon context -- a literal '?' in a path component will need to be escaped, whereas a '?' in a query does not need to be escaped.) Unreserved characters do not have a syntactic meaning in the URI syntax, and never need to be escaped. A corollary to this is that the escaping or unescaping an unreserved character does not change its meaning ("Z" means the same as "%5A"; "~" means the same as "%7E"), but escaping or unescaping a reserved character might change its meaning ("?" may have a different meaning to "%3F").

The URI component encoding process should percent-encode all characters that are not unreserved. It is safe to escape unreserved characters as well, but not necessary and generally not preferable.

Together, these two sets comprise the valid URI characters, along with two other characters: '%', used for encoding, and '#', used to delimit the fragment (the '#' and fragment were not considered to be part of the URI). I would suggest that both '%' and '#' be treated as reserved characters. All other characters are illegal. The complete set of illegal characters, under this specification, follows:

  • The ASCII control characters (U+0000 -- U+001F and U+007F)
  • The space character
  • The characters: "<>[\]^`{|}
  • Non-ASCII characters (U+0080 -- U+10FFFD)

The URI cleaning process should percent-encode precisely this set of characters: no more and no less.

RFC 3986

The updated URI specification from 2005 makes a number of changes, both to the way characters are grouped, and to the sets themselves. The reserved and unreserved sets are now as follows:

  • The reserved characters are: !#$&'()*+,/:;=?@[]
  • The unreserved characters are: ALPHA and NUM and -._~

This version features '#' as a reserved character, because fragments are now considered part of the URI proper. There are two more important additions to the restricted set. Firstly, the characters "!'()*" have been moved from unreserved to reserved, because they are "typically unsafe to decode." This means that, while these characters are still technically legal in a URI, their encoded form may be interpreted differently to their bare form, so encoding a URI component should encode these characters. Note that this is different than banning them from URIs altogether (for example, a "javascript:" URI is allowed to contain bare parentheses, and that scheme simply chooses not to distinguish between "(" and "%28"). Secondly, the characters '[' and ']' have been moved from illegal to reserved. As of 2005, URIs are allowed to contain square brackets. This unfortunate change was made to allow IPv6 addresses in the host part of a URI. However, note that they are only allowed in the host, and not anywhere else in the URI.

The reserved characters were also split into two sets, gen-delims and sub-delims:

  • The gen-delims are: #/:?@[]
  • The sub-delims are: !$&'()*+,;=

The sub-delims are allowed to appear anywhere in a URI (although, as reserved characters, their meaning may be interpreted differently if they are unescaped). The gen-delims are the important top-level syntactic markers used to delimit the fields of the URI. The gen-delims are assigned meaning by the URI syntax, while the sub-delims are assigned meaning by the scheme. This means that, depending on the scheme, sub-delims may be considered unreserved. For example, a program that encodes a JavaScript program into a "javascript:" URI does not need to encode the sub-delims, because JavaScript will interpret them the same whether they are encoded or not (such a program would need to encode illegal characters such as space, and gen-delims such as '?', but not sub-delims). The gen-delims may also be considered unreserved in certain contexts -- for example, in the query part of a URI, the '?' is allowed to appear bare and will generally mean the same thing as "%3F". However, it is not guaranteed to compare equal: under the Percent-Encoding Normalization rule, encoded and bare versions of unreserved characters must be considered equivalent, but this is not the case for reserved characters.

Taking the square brackets out of the illegal set leaves us with the following illegal characters:

  • The ASCII control characters (U+0000 -- U+001F and U+007F)
  • The space character
  • The characters: "<>\^`{|}
  • Non-ASCII characters (U+0080 -- U+10FFFD)

A modern URI cleaning function must encode only the above characters. This means that any URI cleaning function written before 2005 (hint: encodeURI) will encode square brackets! That's bad, because it means that a URI with an IPv6 address:

http://[2001:db8:85a3:8d3:1319:8a2e:370:7348]/admin/login?name=Helen Ødegård&gender=f

would be cleaned as:

http://%5B2001:db8:85a3:8d3:1319:8a2e:370:7348%5D/admin/login?name=Helen%20%C3%98deg%C3%A5rd&gender=f

which refers to the domain name "[2001:db8:85a3:8d3:1319:8a2e:370:7348]" (not the IPv6 address). Mozilla's reference on encodeURI contains a work-around that ensures that square brackets are not encoded. (Note that this still double-escapes '%' characters, so it isn't good for URI cleaning.)

So what exactly should I do?

If you are building a URI programmatically, you must encode each component individually before composing them.

  • Escape the following characters: space and !"#$%&'()*+,/:;<=>?@[\]^`{|} and U+0000 -- U+001F and U+007F and greater.
  • Do not escape ASCII alphanumeric characters or -._~ (although it doesn't matter if you do).
  • If you have specific knowledge about how the component will be used, you can relax the encoding of certain characters (for example, in a query, you may leave '?' bare; in a "javascript:" URI, you may leave all sub-delims bare). Bear in mind that this could impact the equivalence of URIs.

If you are parsing a URI component, you should unescape any percent-encoded sequence (this is safe, as '%' characters are not allowed to appear bare in a URI).

If you are "cleaning up" a URI that someone has given you:

  • Escape the following characters: space and "<>\^`{|} and U+0000 -- U+001F and U+007F and greater.
  • You may (but shouldn't) escape ASCII alphanumeric characters or -._~ (if you really want to; it will do no harm).
  • You must not escape the following characters: !#$%&'()*+,/:;=?@[]
  • For an advanced URI cleaning, you may also fix any other syntax errors in an appropriate way (for example, a '[' in the path segment may be encoded, as may a '%' in an invalid percent sequence).
  • An advanced URI cleaner may be able to escape some reserved characters in certain contexts. Bear in mind that this could impact the equivalence of URIs.

If you are "pretty printing" a URI and want to display escaped characters as bare, where possible:

  • Unescape the following characters: space and "-.<>\^_`{|}~ and ASCII alphanumeric characters and U+0080 and greater.
  • It is probably not wise to unescape U+0000 -- U+001F and U+007F, as they are control characters that could cause display problems (and there may be other Unicode characters with similar problems.)
  • You must not unescape the following characters: !#$%&'()*+,/:;=?@[]
  • An advanced URI printer may be able to unescape some reserved characters in certain contexts. Bear in mind that this could impact the equivalence of URIs.

These four activities roughly correspond to JavaScript's encodeURIComponent, decodeURIComponent, encodeURI and decodeURI functions, respectively. In the next section, we look at how they differ.

Some implementations

JavaScript

As I stated earlier, never use escape. First, it is not properly specified. In Firefox and Chrome, it encodes all characters other than the following: *+-./@_. This makes it unsuitable for URI construction and cleaning. It encodes the unreserved character '~' (which is harmless, but unnecessary), and it leaves the reserved characters '*', '+', '/' and '@' bare, which can be problematic. Worse, it encodes Latin-1 characters with Latin-1 (instead of UTF-8) -- not technically a violation of the spec, but likely to be misinterpreted, and even worse, it encodes characters above U+00FF with the malformed syntax "%uxxxx". Avoid.

JavaScript's "fixed" URI encoding functions behave according to RFC 2396, and assuming Unicode characters are to be encoded with UTF-8. This means that they are lacking the 2005 changes:

  • encodeURIComponent does not escape the previously-unreserved characters '!', "'", "(", ")" and "*". Mozilla's reference includes a work-around for this.
  • decodeURIComponent still works fine.
  • encodeURI erroneously escapes the previously-illegal characters '[' and ']'. Mozilla's reference includes a work-around for this.
  • decodeURI erroneously unescapes '[' and ']' (although there doesn't seem to be a practical case where this is a problem).

Edit: Unfortunately, encodeURI and decodeURI have a single, critical flaw: they escape and unescape (respectively) percent signs ('%'), which means they can't be used to clean a URI. (Thanks to Tim Cuthbertson for pointing this out.) For example, assume we wanted to clean the URI:

http://example.com/admin/login?redirect=http://example.com/news%23funny&name=Helen Ødegård&gender=f

This URI has the '#' escaped already, because no URI cleaner can turn a '#' into a "%23", but it doesn't have the space or Unicode characters escaped. Passing this to encodeURI produces:

http://example.com/admin/login?redirect=http://example.com/news%2523funny&name=Helen%20%C3%98deg%C3%A5rd&gender=f

Note that the "%23" has been double-escaped so it reads "%2523" -- completely wrong! We can fix this by extending Mozilla's work-around to also correct against double-escaped percent characters:

function fixedEncodeURI(str) {
    return encodeURI(str).replace(/%25/g, '%').replace(/%5[Bb]/g, '[').replace(/%5[Dd]/g, ']');
}

Note that decodeURI is similarly broken. The fixed version follows:

function fixedDecodeURI(str) {
    return decodeURI(str.replace(/%25/g, '%2525').replace(/%5[Bb]/g, '%255B').replace(/%5[Dd]/g, '%255D'));
}

Edit: Fixed fixedEncodeURI and fixedDecodeURI so they work on lowercase escape codes. (Thanks to Tim Cuthbertson for pointing this out.)

Python

Python 2's urllib.quote and urllib.unquote functions perform URI component encoding and decoding on byte strings (non-Unicode).

  • urllib.quote works as I specify above, except that it does escape '~', and does not escape '/'. This can be overridden by supplying safe='~'.
  • urllib.unquote works as expected, returning a byte string.

Note that these do not work properly at all on Unicode strings -- you should first encode the string using UTF-8 before passing it to urllib.quote.

In Python 3, the quote and unquote functions have been moved into the urllib.parse module, and upgraded to work on Unicode strings (by me -- yay!). By default, these will encode and decode strings as UTF-8, but this can be changed with the encoding and errors parameters (see urllib.parse.quote and urllib.parse.unquote).

I don't know of any Python built-in functions for doing URI cleaning, but urllib.quote can easily be used for this purpose by passing safe="!#$%&'()*+,/:;=?@[]~" (the set of reserved characters, as well as '%' and '~'; note that alphanumeric characters, and '-', '.' and '_' are always safe in Python).

Mozilla Firefox

Firefox 10's URL bar performs URL cleaning, allowing the user to type in URLs with illegal characters, and automatically converting them to correct URLs. It escapes the following characters:

  • space, "'<>` and U+0000 -- U+0001F and U+007F and greater. (Note that this includes the double and single quote.)
  • Note that the control characters for NUL, tab, newline and carriage return don't actually transmit.

I would say this is erroneous: on a minor note, it should not be escaping the single quote, as that is a reserved character. It also fails to escape the following illegal characters: \^{|}, sending them to the server bare.

Firefox also "prettifies" any URI, decoding most of the percent-escape sequences for the characters that it knows how to encode.

Google Chrome

Chrome 16's URL bar also performs URL cleaning. It is rather similar to Firefox, but encoding the following characters:

  • space, "<> and U+0000 -- U+0001F and U+007F and greater. (Note that this includes only the double quote.)

So Chrome also fails to escape the illegal characters \^`{|} (including the backtick, which Firefox escapes correctly), but unlike Firefox, it does not erroneously escape the single quote.

About these ads

2 Responses to “How do you escape a complete URI?”

  1. I have learn several just right stuff here. Certainly worth bookmarking for
    revisiting. I wonder how a lot effort you put to make
    this type of wonderful informative web site.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: