Browser URI encoding: The best we can do

In Web development on July 8, 2008 by Matt Giuca Tagged: , ,

This is a follow-up post to last week’s post Do browsers encode URLs correctly?, in which I asked whether Firefox encodes form data correctly when submitted with the application/x-www-form-urlencoded content type.

I’ve done a bit more research into URI encoding since then. Firstly, I think I was a bit over-zealous in stating that “UTF-8 is the only way to go”. Clearly, URI encoding is an octet encoding, not a character encoding, so it’s still perfectly valid to have a non-UTF-8 encoded URI.

As for what browsers should do for form submission, I think I was reading the wrong place. I should have been reading the HTML specification – after all, that is what governs how to submit HTML forms, not the generic URI syntax. It seems once again, it’s very vague because we’re dealing with technology originally designed only for ASCII. I still cannot find anyone willing to say “you must use UTF-8” – so basically this means UAs are free to do whatever they want. That is incredibly annoying for everybody involved.

For the record, the two important parts of the HTML 4.01 specification are as follows. The appendix section B.2.1 (informatively) suggests that while URLs in documents should not contain non-ASCII characters, UAs encountering non-ASCII characters should encode them with UTF-8. It also warns about implementations which rely on the page encoding rather than UTF-8. However note that this deals with URLs in links, not form submission, so it doesn’t count. I just bring it up because it’s an official suggestion of a strategy for URI-encoding.

The HTML 4.01 spec does not, however, mention what to do with non-ASCII characters in form submissions. Well, actually, it explicitly forbids them, in section 17.13.1:

Note. The “get” method restricts form data set values to ASCII characters. Only the “post” method (with enctype=”multipart/form-data”) is specified to cover the entire [ISO10646] character set.

This is … kind of … unacceptable. We need to be able to write apps which accept non-ASCII characters. So basically all the browsers are allowing “illegal” activity, for a very good reason. And that means no standards!

So as I found out in the previous post, Firefox uses the document’s charset to encode the URI. I’ve since empirically discovered that at least two other major browsers (Safari and Internet Explorer 7) do the same. I’ve also discovered the form’s “accept-charset” attribute, which gives document authors a lot more explicit control. All three browsers will respect this attribute and use it to encode URIs. If unspecified, it falls back to the document’s charset.

For example, if you specify accept-charset=”utf-8″ on a form element, it will be submitted with that encoding, regardless of the document encoding. I thereby strongly recommend you use this on all forms, even if your documents are already encoded in UTF-8. This is so that if the document is transcoded, the browser behaviour doesn’t change.

The good news as far as standardization is concerned is that the upcoming HTML5 *may* explicitly sanction this behaviour. While this section of the spec is currently empty, it points you to a “Web Forms 2.0” draft, which states:

The submission character encoding is selected from the form’s accept-charset attribute. UAs must use the encoding that most completely covers the characters found in the form data set of the encodings specified. If the attribute is not specified, then the client should use either the page’s character encoding, or, if that cannot encode all the characters in the form data set, UTF-8. Character encodings that are not mostly supersets of US-ASCII must not be used (this includes UTF-16 and EBCDIC) even if specified in the accept-charset attribute.

Okay! I would like to see this get standardized. Note that it also states:

Authors must not specify an encoding other than UTF-8 or US-ASCII in the accept-charset attribute when the method used is get and the action indicates an HTTP resource.

So this means ISO-8859-1 (Latin-1) is out. Hence you should always use accept-charset=”utf-8″.

So to answer the question: “Do browsers encode URIs correctly?”, the answer is, “they do the best they can”. So I retract my initial accusations that Firefox is doing something wrong, and point my finger at the document authors – be explicit and no harm shall befall you! Also at the W3C – hurry up and get HTML5 standardised so there are some official guidelines on this matter!

2 Responses to “Browser URI encoding: The best we can do”

  1. […] I’ve since done a follow-up post: Browser URI encoding: The best we can do, where I make a few more corrections to this post.] Possibly related posts: (automatically […]

  2. […] Browser URI encoding: The best we can do – naděje svítá s HTML v. 5 Podělte se s ostatními: […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: