Articles

Do browsers encode URLs correctly?

In Web development on June 29, 2008 by Matt Giuca

This post is an open question, which I have just discovered. I haven’t fully tested it or researched it.

As you may recall from my “URI Encoding Done Right” post, I said that non-ASCII characters in a URI are supposed to be first encoded using UTF-8 into a byte stream, then percent-encoded. I got this information from RFC 3986 – URI Generic Syntax Section 2.5, which states:

When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent-encoded.  For example, the character A would be represented as “A”, the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as “%C3%80″, and the character KATAKANA LETTER A would be represented as “%E3%82%A2″.

This document is the current official reference on URI syntax, and it seems to pretty clearly state that UTF-8 is the way to go. (Perhaps though, I am confused about what “a new URI scheme” means).

Well I noticed that Mozilla Firefox (3.0) does not always encode URIs using UTF-8. When you submit a form, the contents of the input boxes are encoded in the current page encoding. For instance, say you serve a page with the encoding “iso-8859-1” which has a text box. If the user types “À” (LATIN CAPITAL LETTER A WITH GRAVE) into the box and clicks submit, the data should be encoded as “%C3%80″, as given in the example (and it would be, were the content type “utf-8″). However, in Firefox at least, it is encoded as “%C0″, which is the ISO-8859-1 code for “À”.

Even more bizarre, if you give a character outside the character set’s repertoire, Firefox first escapes it as an XML character reference, then percent-encodes that! So if you type “ア” (KATAKANA LETTER A, a character not found in ISO-8859-1) and click submit, it should be encoded as “%E3%82%A2″, but is in fact encoded as “%26%2312450%3B” – which decodes to “& #12450;” (12450 being the decimal value of KATAKANA LETTER A’s Unicode code point, this will render correctly if displayed as HTML).

This first behaviour (encoding as whatever the current character set is) seems logical. If the server is dealing entirely in that character set, then it will work. But if I am interpreting RFC 3986 correctly, then it’s problematic, because a cooperating server will always decode the URI as UTF-8, in which case Firefox will be producing invalid input for pages not served as UTF-8.

The second behaviour (escaping as an XML character reference) seems completely wrong. If the server is going to echo the text back as raw HTML, then it will display correctly, but surely most applications are going to do some processing of input, and they should be HTML-escaping them anyway, making this pretty much a retarded idea.

So I’d like to know if anyone knows of a justification for this behaviour. I’ll also investigate some other browsers’ behaviour. There’s a discussion here on the subject: The Secret of Character Encoding. This page states:

There is no official way of determining the character encoding of such a request, since the percent encoding operates on a byte level, so it is usually assumed that it is the same as the encoding the page containing the form was submitted in.

It seems like this page’s author has not read RFC 3986! [Ed: The author has since contacted me, and we agree, even after 3986, there is no mandate to use UTF-8. It is just a suggestion for new protocols.]

It should be noted that prior to 2005, when 3986 was written, the syntax was governed by RFC 2396 (now obsolete), which is not specific about encodings.

I think the bottom line is, the browser is “wrong” because if you look at the HTTP request it is sending, it doesn’t actually state the charset at all (if it’s a GET request then it has no Content-Type header). Hence the character set is implicit in the previous page response. That doesn’t make sense because the request could be sent to a completely different site. It’s just the browser assuming that since the server is sending pages in a particular charset, it would like to receive URLs in that same charset, and I think that’s an invalid assumption.

[Ed: I've since done a follow-up post: Browser URI encoding: The best we can do, where I make a few more corrections to this post.]

About these ads

10 Responses to “Do browsers encode URLs correctly?”

  1. Firefox also does weird things if you type non-ascii chars into the address bar. If there are only extended (and supplement?) latin chars it uses iso-8859-1 (which is % escaped).

    If you use additional other chars (e.g. Hiragana) it uses UTF-8.

    Opera and IE (even IE6) will always use UTF-8. I don’t really know how one is supposed to handle FF’s semi-random behavior.

  2. Hm. That’s odd, because for me (currently using Firefox 2), all the characters I type directly into the address bar are UTF-8 encoded. Only form submission is a problem.

    Also, I didn’t get around to mentioning this in the post, but of course as the web app author, the way to deal with this is simply to serve all your pages with charset=utf-8.

  3. Just tried it with FF3. It does indeed work there. To tell the truth I simply didn’t expect that they would fix it with this release. After all the bug report was already a couple of years old.

    My best guess is that they used the OS’ default character encoding by default with FF2 (with some obscure UTF-8 fallback). It really didn’t work with FF2. I clearly remember that öäü didn’t work, but あいう was fine for example. After all it was an unavoidable hindrance for one of the articles I wanted to write… that and FF2′s outrageously retarded font selection in SVGs (was also fixed with FF3).

    Too bad that they introduced a bunch of new SVG rendering issues with FF3. Boo. Hiss.

  4. But I tried it with FF2 yesterday and it worked. Maybe you were thinking of FF1?

    (By the way I assume we are talking about directly typing characters into the address bar, as opposed to the form submission thing – which appears to still be an issue).

  5. >But I tried it with FF2 yesterday and it worked.
    >Maybe you were thinking of FF1?

    No. FF 2.0.0.14 on Windows XP/2K. I’d guess that the OS’ default encoding might be UTF-8 on Linux or Mac.

  6. [...] | In Web development | Tags: browser, uri, url This is a follow-up post to last week’s post Do browsers encode URLs correctly?, in which I asked whether Firefox encodes form data correctly when submitted with the [...]

  7. Tahnks for posting

  8. Hey, I read a lot of blogs on a daily basis and for the most part, people lack substance but, I just wanted to make a quick comment to say GREAT blog!…..I”ll be checking in on a regularly now….Keep up the good work! :)

    I’m Out! :)

  9. For URL encoding, to convert URL to safe format to encoded url

    encode url

    • The site you linked to has a serious error. In fact, it is the error which it was the entire point of my article (and my follow-up article http://unspecified.wordpress.com/2008/07/08/browser-uri-encoding-the-best-we-can-do/) to correct.

      The table at the end of that article shows each “ASCII” character and its corresponding URI encoding. Firstly, those are not “ASCII” characters. They are the Windows-1252 code page (http://en.wikipedia.org/wiki/Windows-1252). (The characters up to %7F are ASCII; everything after that is Windows-1252 only.) Secondly, you assume that the URI will be encoded with Windows-1252 — not a good assumption since this code page only makes sense on Microsoft Windows computers. Most browsers use UTF-8 to encode URIs, so it is more likely, for example, that the “€” will encode as %E2%82%AC, rather than %80.

      To resolve this correctly, it is more complex than saying “character x maps to URI y” because the URI syntax doesn’t specify how to encode characters to percent-encoded values. It specifies how to encode bytes to percent-encoded values. ASCII characters should always encode to their corresponding ASCII values. Non-ASCII values are first converted with an unspecified encoding to a byte sequence (could be Latin-1, could be Windows-1252, most likely will be UTF-8) and *then* the bytes will be percent-encoded.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: