Articles

You should XML-escape your URIs

In Web development on June 2, 2011 by Matt Giuca

A quick post about a sneaky little edge case I thought of. I don’t know if this is common practice, but when I encode a URI, I usually consider it “safe for XML”. It isn’t.

In other words, once I have taken a string like “dog’s bed & breakfast” and URI-encoded it to “dog%27s%20bed%20%26%20breakfast”, I would consider it safe to slap into an XML or HTML attribute value, such as <a href=”dog%27s%20bed%20%26%20breakfast”>…</a>. But it isn’t.

There’s one little problem: URIs allow (and frequently use) the ampersand character (“&”) to separate query string arguments. XML attribute values specifically disallow this character. It isn’t a problem with the above string, because the ampersand was part of the string, and so it was escaped as “%26”, which is a perfectly legal XML attribute value. But any URI with multiple query string parameters that you put into an attribute value is technically invalid XML. For instance, if you had the query parameters {“name”: “henry”, “age”: “27”}, that encodes to the query string “name=henry&age=27”. The XML element <a href=”http://example.com/?name=henry&age=27″>…</a&gt; is invalid, because it contains a bare ampersand in the attribute value. Browsers, however, don’t seem to mind, and will process the above link properly.

The problem happens on the edge cases. Consider the query parameters {“name”: “henry”, “lt;”: “27”}. They encode to the query string “name=henry&lt;=27″, and if you put that unquoted into XML, you get <a href=”http://example.com/?name=henry&lt;=27”>…</a&gt;, which is valid, and completely different to what you intended (it parses as query parameters {“name”: “henry<=27”}). Even if your URI encoder escapes the “;” (which it should, as “;” is a reserved character), you’ll still get <a href=”http://example.com/?name=henry&lt%2B=27″>…</a&gt;, once again invalid, but both Firefox and Chrome still manage to parse &lt as “<“.

So even if you have URI-encoded (which you should do), you still need to XML-encode before putting it into the attribute value: <a href=”http://example.com/?name=henry&amp;lt;=27”>…</a&gt;. A minor point is that if you are using single-quoted XML attribute values, you also need to make sure that if your URI has single quotes (which it is technically allowed to have), that they are being XML-encoded as well.

Also see my full post about URI encoding, URI Encoding Done Right.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: