This post is an open question, which I have just discovered. I haven’t fully tested it or researched it.
As you may recall from my “URI Encoding Done Right” post, I said that non-ASCII characters in a URI are supposed to be first encoded using UTF-8 into a byte stream, then percent-encoded. I got this information from RFC 3986 – URI Generic Syntax Section 2.5, which states:
When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent-encoded. For example, the character A would be represented as “A”, the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as “%C3%80″, and the character KATAKANA LETTER A would be represented as “%E3%82%A2″.
This document is the current official reference on URI syntax, and it seems to pretty clearly state that UTF-8 is the way to go. (Perhaps though, I am confused about what “a new URI scheme” means).
Well I noticed that Mozilla Firefox (3.0) does not always encode URIs using UTF-8. When you submit a form, the contents of the input boxes are encoded in the current page encoding. For instance, say you serve a page with the encoding “iso-8859-1” which has a text box. If the user types “À” (LATIN CAPITAL LETTER A WITH GRAVE) into the box and clicks submit, the data should be encoded as “%C3%80″, as given in the example (and it would be, were the content type “utf-8″). However, in Firefox at least, it is encoded as “%C0″, which is the ISO-8859-1 code for “À”.
Even more bizarre, if you give a character outside the character set’s repertoire, Firefox first escapes it as an XML character reference, then percent-encodes that! So if you type “ア” (KATAKANA LETTER A, a character not found in ISO-8859-1) and click submit, it should be encoded as “%E3%82%A2″, but is in fact encoded as “%26%2312450%3B” – which decodes to “& #12450;” (12450 being the decimal value of KATAKANA LETTER A’s Unicode code point, this will render correctly if displayed as HTML).
This first behaviour (encoding as whatever the current character set is) seems logical. If the server is dealing entirely in that character set, then it will work. But if I am interpreting RFC 3986 correctly, then it’s problematic, because a cooperating server will always decode the URI as UTF-8, in which case Firefox will be producing invalid input for pages not served as UTF-8.
The second behaviour (escaping as an XML character reference) seems completely wrong. If the server is going to echo the text back as raw HTML, then it will display correctly, but surely most applications are going to do some processing of input, and they should be HTML-escaping them anyway, making this pretty much a retarded idea.
So I’d like to know if anyone knows of a justification for this behaviour. I’ll also investigate some other browsers’ behaviour. There’s a discussion here on the subject: The Secret of Character Encoding. This page states:
There is no official way of determining the character encoding of such a request, since the percent encoding operates on a byte level, so it is usually assumed that it is the same as the encoding the page containing the form was submitted in.
It seems like this page’s author has not read RFC 3986! [Ed: The author has since contacted me, and we agree, even after 3986, there is no mandate to use UTF-8. It is just a suggestion for new protocols.]
It should be noted that prior to 2005, when 3986 was written, the syntax was governed by RFC 2396 (now obsolete), which is not specific about encodings.
I think the bottom line is, the browser is “wrong” because if you look at the HTTP request it is sending, it doesn’t actually state the charset at all (if it’s a GET request then it has no Content-Type header). Hence the character set is implicit in the previous page response. That doesn’t make sense because the request could be sent to a completely different site. It’s just the browser assuming that since the server is sending pages in a particular charset, it would like to receive URLs in that same charset, and I think that’s an invalid assumption.
[Ed: I’ve since done a follow-up post: Browser URI encoding: The best we can do, where I make a few more corrections to this post.]