Articles

Why you should never mutate a JavaScript Date

In JavaScript on August 2, 2013 by Matt Giuca Tagged: ,

Poor JavaScript. Everybody’s friend, but nobody loves it. Perhaps because some of its bits are just made of stupid. Consider this JavaScript snippet involving the Date class:

> date = new Date(2013, 0, 1)
Tue Jan 01 2013 00:00:00 GMT+1100 (EST)
> date.setUTCMonth(10)

First, I create a new date at midnight on January 1, 2013 (yes, months start from 0 and dates start from 1, it’s insane). Then, I set the month to November. What is the resulting value of date? Surely it’s November 01 2013? Nope, it’s:

Sun Dec 02 2012 00:00:00 GMT+1100 (EST)

Wat? The month is one year earlier, one month later, and one day later than I had expected. But it gets worse:

> date.setUTCMonth(10)
Fri Nov 02 2012 00:00:00 GMT+1100 (EST)

That’s right: setUTCMonth is not even idempotent! Calling it a second time gives a different result than the first. What on Earth is going on? Two things, actually.

Firstly, you may have noticed that the dates are in GMT+1100 (my local time zone, Australian Eastern Daylight Time). The constructor and string display both use the computer’s local time, which is a big mistake. For one thing, it means that software behaves differently in different parts of the world. (Remember this bug I found in the SOPA blackout snippet?) Software should only ever deal with UTC except when displaying times to the user. So in eastern Australia, “new Date(2013, 0, 1)” is actually Dec 31 2012 13:00:00 UTC. You can work around this, but JavaScript isn’t helpful: there is a Date.UTC function, but it doesn’t return a Date object. To construct a date from UTC time, call “new Date(Date.UTC(year, month, day, hours, minutes, seconds))”. Okay, so that’s confusing, but not altogether wrong.

Secondly, and more awfully, we are running into the intrinsic problem of Gregorian dates, that day 31 is valid for some months but not others. Every date library has to deal with this issue, but here JavaScript just sucks. When you start with December 31 and set the month to November, you get November 31. Since that date doesn’t exist, JavaScript “helpfully” gives you December 1. (Fancy that: a setter that doesn’t always end up setting the field to the value you gave it!) Dec 01 2012 13:00:00 UTC is Dec 02 2012 00:00:00 in my local time zone. Calling setUTCMonth a second time sets the month to November, keeping the date value at 1. Seriously, setting the month when the date is not valid for that month should be an error; at least then you would notice the problem. Here JavaScript falls into the classic API design trap of giving unhelpful garbage instead of useful errors.

So what should you do to avoid this? I would recommend never using any setters on the Date class. It is fundamentally dangerous to change the year, month or date, because in all three cases you can get silent corruption of other fields. Treat Date objects as immutable, and make new ones when you need to. Use the “new Date(Date.UTC(…))” form I gave above.

Articles

Calculating the gluPerspective matrix and other OpenGL matrix maths

In Graphics, Maths on June 21, 2012 by Matt Giuca

All 3D graphics programming requires that we manipulate matrices, for transformation and projection. Since graphics toolkits give us the matrix formulae directly, it is very easy to go on using them without truly understanding how they work. When we use OpenGL, the first thing we are told is to “just call” a bunch of matrix manipulation functions, like glTranslateglScalegluPerspective, etc, and it’s easy to go on building 3D worlds without understanding how these matrices work. But I like a deep understanding of how my software works, so I wanted to work out the maths behind these matrices. In particular, as we move towards OpenGL 3.0 and 4.0 and OpenGL ES 2.0, these newer standards are removing all of the built-in matrix stuff (for example, all of the above three functions have been removed) and you have to do all of this stuff manually.

I haven’t been able to find a good analysis of how to derive the matrices. In particular, the perspective projection matrix has often baffled me. So, I worked it all out and this post is a complete working of how to calculate several transformation and projection matrices, including perspective projection. This post is very heavy on mathematics, but I’ll try not to assume much, and if you already understand the basics of matrix transformations, you can just skip ahead.

The OpenGL coordinate system

This section covers the basics of coordinates, vectors and the OpenGL transformation pipeline. You can skip ahead if you already know this stuff.

In OpenGL, each vertex is represented by a point in 3-dimensional space, given the coordinate triple (x, y, z). Traditionally, you would specify a vertex using glVertex3d (or, in 2D graphics, with glVertex2d, which sets z to 0). Unfortunately, it’s quite confusing what the three axes actually mean in 3D space, and it is more confusing because a) it tends to be different from one platform to another (e.g., DirectX uses a different coordinate system to OpenGL which uses a different one to Blender, for example), and b) it can change based on how you set things up (e.g., by rotating the camera, you can change which direction is “up”). So for now, let us assume the default OpenGL camera and we’ll discuss how to transform the camera later.

OpenGL uses a right-handed coordinate system by default, meaning that if you hold out your right hand, extending your thumb, index and middle fingers so that they form three perpendicular axes (as shown in the image below), your thumb shows the direction of the positive X axis, your index finger the positive Y axis, and your middle finger the positive Z axis. Clearly, once your fingers are in place, you can orient your hand any way you like; the default OpenGL orientation is that X extends to the right, Y extends upwards, and Z extends towards the camera.

The right-handed coordinate system used by OpenGL

The right-handed coordinate system used by OpenGL

What I find strange about the OpenGL coordinate system is the orientation of Z. Since the positive Z axis runs towards the camera, it means that every visible object in the scene needs to have a negative z coordinate, which seems a bit un-intuitive. I would find a left-handed coordinate system (where Z extends away from the camera) to be more intuitive. You can use the camera or projection matrices to obtain a left-handed coordinate system, but it is probably simpler to stick to the defaults to avoid confusion.

The unit of measurement in OpenGL is “whatever you like”. In 2D graphics, we are used to measuring things in pixels, but in 3D graphics, that doesn’t make sense since the size of something in pixels depends upon how far away it is. So OpenGL lets you use arbitrary “units” for describing sizes and positions. One good thing about this, even for 2D graphics, is that it makes the display resolution independent — adjusting the screen resolution won’t alter the size of objects on the screen, it will simply adjust the level of detail.

What OpenGL is really about, at its core, is working out how a bunch of vertices go from floating around in 3D space and end up on your screen (a 2D space). The fact that vertices are used to connect polygons together is largely irrelevant (most of the hard work is in transforming vertices; once they are transformed, the polygons can be rendered in 2D). There are several considerations here:

  • Artists and programmers do not like defining every single vertex in its final position in 3D space. That would be crazy, because then whenever an object moved, we would have to adjust all of the vertices. Rather, we want to define every vertex relative to the centre of the particular model (such as a car or person), and then transform the whole model into its position in the world.
  • We need to be able to move the camera freely around the world. To do that, we need to be able to transform all of the vertices in the entire world based on the position and orientation of the camera. (You can think of the camera in OpenGL as being stationary at point (0, 0, 0) facing the -Z axis; rather than moving the camera, we must translate and rotate the entire world into position relative to that fixed camera.)
  • The screen is a 2D surface. Rendering a 3D image onto a 2D surface typically requires some sort of perspective, to make objects that are further away seem smaller, and objects that are closer seem larger. Also, 2D screens are usually not square, so in transforming to a 2D display, we must take into account the aspect ratio of the screen (the ratio of the width and height). On screens with a wider aspect ratio (such as widescreen 16:9 displays), we need to show more objects to the left and right than on screens with thinner aspect ratio (such as full screen 4:3 displays).
  • The screen is made up of pixels, and so when we render the display, we need to know each vertex’s position in pixels (not arbitrary OpenGL units). So we must transform the arbitrary units into pixels so that our scene fills up the whole screen at whatever resolution the display is using.

These considerations roughly form a series of steps that each vertex must go through in its journey from 3D model space to the screen. We call this the vertex pipeline, and OpenGL takes care of most of it automatically (with a little help from the programmer). Newer versions of OpenGL force the programmer to program much of the pipeline his/herself. To describe the pipeline, we must come up with a name for the “coordinate space” at each step of the way. These are outlined in the official OpenGL FAQ:

  • Object Coordinates are the coordinates of the object relative to the centre of the model (e.g., the coordinates passed to glVertex). The Model transformation is applied to each vertex in a particular object or model to transform it into World Coordinates.
  • World Coordinates are the coordinates of the object relative to the world origin. The View transformation is applied to each vertex in the world to transform it into Eye Coordinates. (Note: In OpenGL, the Model and View transformations are combined into the single matrix GL_MODELVIEW, so OpenGL itself doesn’t have a concept of world coordinates — vertices go straight from Object to Eye coordinates.)
  • Eye Coordinates are the coordinates of the object relative to the camera. The Projection transform is applied to each vertex in the world to transform it into Clip Coordinates. (Note: In OpenGL, the Projection transformation is handled by the GL_PROJECTION matrix.)
  • Clip Coordinates are a weird intermediate step which we will discuss in detail later. They are transformed by the Divide step into Normalised Device Coordinates, which I will abbreviate as Viewport coordinates.
  • Viewport Coordinates (or Normalised Device Coordinates in the OpenGL specification) are the coordinates of the vertex in their final on-screen position. The x and y coordinates are relative to the screen, where (-1, -1) is the lower-left corner of the screen and (1, 1) is the upper-right corner of the screen. The z coordinate is still relevant at this point because it will be used to test and update the depth buffer (briefly discussed in the next section); a z of -1 is the nearest point in the depth buffer, whereas 1 is the furthest point. (Yes, this really confuses me too: the Z axis flips around at this point such that the +Z axis extends away from the camera, while the -Z axis points towards the camera!) Any vertex outside the cube (-1, -1, -1) to (1, 1, 1) is clipped, meaning it will not be seen on the screen. The Viewport transformation is applied to the (x, y) part of each vertex to transform it into Screen Coordinates.
  • Screen Coordinates are the (x, y) coordinates of the vertex in screen pixels. By default, (0, 0) represents the lower-left of the screen, while (w, h) represents the upper-right of the screen, where w and h is the screen resolution in pixels.

In the “old” (pre-shader) OpenGL versions (OpenGL 1.0 and 2.0 or OpenGL ES 1.1), you would control the Object to Eye coordinate transformation by specifying the GL_MODELVIEW matrix (typically using a combination of glTranslate, glScale and glRotate), and the Eye to Clip coordinate transformation by specifying the GL_PROJECTION matrix (typically using glOrtho or gluPerspective). In the “new” (mandatory shader) OpenGL versions (OpenGL 3.0 onwards or OpenGL ES 2.0 onwards), you write a shader to do whatever you want, and it outputs Clip coordinates, typically using the same model, view and projection matrices as the old system. In both systems, you can control the Viewport to Screen transformation using glViewport (but the default value is usually fine); everything else is automatically taken care of by OpenGL. In this article, we are most interested in the transformation from Object coordinates to Viewport coordinates.

The depth buffer

I very briefly digress to introduce the OpenGL depth buffer. You can skip ahead if you already know this stuff.

While OpenGL draws pretty colourful objects to our screens, what is less obvious is that it is also drawing the same objects to a less glamorous greyscale off-screen buffer known as the depth buffer. When it draws to the depth buffer, it doesn’t use pretty lighting and texturing, but instead sets the “colour” of each pixel to the corresponding z coordinate at that position. For example, I have rendered the Utah teapot (glutSolidTeapot), and the images below show the resulting colour buffer (the normal screen display) and the depth buffer. Notice that the depth buffer uses darker pixels for the parts of the teapot that are closer to the screen (such as the spout), while the parts further away (such as the handle) fade off into the white background.

The colour buffer

The colour buffer

The depth buffer

The depth buffer

The closest physical analogy I can think of is if you imagine you are in a room where all objects are completely black, but there is a very thick white fog. In this room, the further away something is, the lighter it appears because there is more fog in between your eye and the object. If you have ever seen an autostereogram, such as Magic Eye, they often print “answers” in the form of a depth map (such as this shark), and these are essentially the depth buffer from a 3D scene.

The depth buffer has several uses. The most obvious is that every time OpenGL draws a pixel, it first checks the depth buffer to see if there is already something closer (darker) than the pixel it is about to draw. If there is, it doesn’t draw the pixel. That is why you usually don’t have to worry about the order you draw objects in OpenGL, because it automatically avoids drawing a polygon over the top of any polygon that is already closer to the screen. If you didn’t have the depth buffer, you would have to carefully draw each object in order from back to front! (That is why you need to remember to call glEnable(GL_DEPTH_TEST) in any 3D program, because the depth buffer is turned off by default!)

It is important to note that the depth buffer does have a finite range, which is implementation-specific. Each depth buffer pixel may, for example, be a 16-bit integer or a 32-bit float. The z values are mapped on to the depth buffer space by the projection matrix. This is why, when you call gluPerspective, the choice of the zNear and zFar values is so important: they control the depth buffer range. In the above image, I set zNear to 1 and zFar to 5, with the teapot filling up most of that space. Had I set zFar to 1000, I could see a lot further back, but I would be limiting the precision of the depth test because the depth values need to be shared over a much wider z space.

Simple matrix transformations

This section covers the mathematics of the scale and translate matrices. You can skip ahead if you already know this stuff.

Let’s say we want to apply a scale transformation of (100, 30, 10) to an object. Mathematically, this is very simple: just take each of its vertices and multiply the x value by 100, the y value by 30, and the z value by 10, like this:

\begin{array}{lcl}x' &=& x \times 100\\y' &=& y \times 30\\z' &=& z \times 10\end{array}

To do this efficiently, we can express the transformation as a matrix. Modern graphics cards can do matrix multiplication extremely quickly, so we take advantage of this wherever possible. We can express the scale as a 3×3 matrix:

\begin{bmatrix}100 & 0 & 0\\ 0 & 30 & 0\\ 0 & 0 & 10\end{bmatrix}

If we express a vector as a 1×3 matrix:

\begin{bmatrix}x\\ y\\ z\end{bmatrix}

Then we can apply the scale simply by multiplying the scale matrix by the vector:

\begin{bmatrix}100 & 0 & 0\\ 0 & 30 & 0\\ 0 & 0 & 10\end{bmatrix} \begin{bmatrix}x\\ y\\ z\end{bmatrix} = \begin{bmatrix}x \times 100\\ y \times 30\\ z \times 10\end{bmatrix}

See the Wikipedia article Matrix multiplication for an explanation of how this works. In general, we can think of a matrix as a function that transforms a vector in a very restrictive way: the resulting value for each coordinate is the sum of multiplying each of the input coordinates by some constant. Each row of the matrix specifies the formula for one of the resulting coordinates, and reading along the row gives the constant to multiply each input coordinate. In general, the 3×3 matrix:

\begin{bmatrix}a & b & c\\ d & e & f\\ g & h & i\end{bmatrix}

can be thought of as a function transforming (x, y, z) to (x’, y’, z’), where:

\begin{array}{lcl}x' &=& ax + by + cz\\ y' &=& dx + ey + fz\\ z' &=& gx + hy + iz\end{array}

So if we have a function of the above form, we can always represent it as a matrix, which is very efficient. In OpenGL, you can use glLoadMatrix or glMultMatrix to directly load a 4×4 matrix into the GL state (remember to first set the current matrix to the model-view matrix using glMatrixMode(GL_MODELVIEW)). It is important to note that OpenGL represents matrices in column-major order which is unintuitive: your array must group together the cells of each column,  not each row.

But now let’s say we want to apply a translate transform of (2, -5, 9) to an object. This time, mathematically we just want to take each vertex of the object and add 2 to the x value, subtract 5 from the y value, and add 9 to the z value:

\begin{array}{lcl}x' &=& x + 2\\ y' &=& y - 5\\ z' &=& z + 9\end{array}

We can’t express this as a 3×3 matrix, because it is not a linear transformation (the result isn’t just multiples of the input — there is constant addition as well). But we love matrices, so we’ll invent some more maths until it works: we need to introduce homogeneous coordinates. Rather than representing each vertex as a 3-dimensional vector (x, y, z), we’ll add a fourth dimension, so now each vertex is a 4-dimensional vector (x, y, z, w). The last coordinate, w, is always 1. It isn’t a “real” dimension; it’s just there to make matrices more helpful. You’ll notice that internally, all OpenGL vertices are actually 4-dimensional. There is a function glVertex4d that lets you specify the w coordinate, but usually, glVertex2d or glVertex3d is sufficient: both of those automatically set w to 1.

If we assume that w = 1 for all vertices, we can now express the translation as a linear transformation: just multiply the constants by w and assume it is 1:

\begin{array}{lcl}x' &=& x + w \times 2\\ y' &=& y + w \times -5\\ z' &=& z + w \times 9\\ w' &=& w\end{array}

Clever! Now we can express the translate as a 4×4 matrix:

\begin{bmatrix}1 & 0 & 0 & 2\\ 0 & 1 & 0 & -5\\ 0 & 0 & 1 & 9\\ 0 & 0 & 0 & 1\end{bmatrix}

This is why all vectors in OpenGL are actually 4D, and all matrices in OpenGL are actually 4×4.

The orthographic projection matrix

If we assume that all of the vertices are transformed into Eye Coordinates by the model-view matrix, what does the scene look like if we just render it without setting the projection matrix (i.e., if we leave the projection matrix in its default identity state, so it doesn’t affect the vertices at all)? We will render everything in the cube from (-1, -1, -1) to (1, 1, 1). Recall that the Viewport transformation transforms the (x, y) coordinates into screen space such that (-1, -1) is in the lower-left and (1, 1) is in the upper right, and OpenGL clips any vertex with a z coordinate that is not in between -1 and 1. The geometry will be rendered without perspective — objects will appear to be the same size no matter how close they are, and the z coordinate will only be used for depth buffer tests and clipping.

This is an orthographic projection and it isn’t normally used for 3D graphics because it produces unrealistic images. However, it is frequently used in CAD tools and other diagrams, often in the form of an isometric projection. Such a projection is also ideal for drawing 2D graphics with OpenGL — for example when drawing the user interface on top of a 3D game, you will want to switch into an orthographic projection and turn off the depth buffer.

There is also a weirdness because OpenGL is supposed to use a right-handed coordinate system: the Z axis is inverted (recall that the -Z axis points away from the camera, but in the depth buffer, smaller z values are considered closer). To fix this, we can invert the Z coordinate by setting the GL_PROJECTION matrix to the following:

\begin{bmatrix}1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & 0 & -1 & 0\\ 0 & 0 & 0 & 1\end{bmatrix}

Now objects with smaller z coordinates will appear behind objects with larger z coordinates. That’s the basic principle of the orthographic projection matrix, but we can customise it by modifying the cube that defines our screen and clipping space. The glOrtho function modifies the projection matrix to an orthographic projection with a custom cube: its arguments are left, right, bottom, top, nearVal, farVal. To make things extra confusing, the nearVal and farVal are inverse Z coordinates; for example, nearVal = 3 and farVal = 5 implies that the Z clipping is between -3 and -5. I think this negative Z axis thing was a really bad idea.

Some simple calculations show that in order to linearly map the range [left, right] on to [-1, 1], the range [bottom, top] on to [-1, 1], and the range [-nearVal, -farVal] on to [-1, 1], we use the formula:

\begin{array}{lcl}x_\mathit{viewport} &=& \frac{2x}{\mathit{right} - \mathit{left}} - \frac{\mathit{right} + \mathit{left}}{\mathit{right} - \mathit{left}}\\ y_\mathit{viewport} &=& \frac{2y}{\mathit{top} - \mathit{bottom}} - \frac{\mathit{top} + \mathit{bottom}}{\mathit{top} - \mathit{bottom}}\\ z_\mathit{viewport} &=& \frac{-2z}{\mathit{farVal} - \mathit{nearVal}} - \frac{\mathit{farVal} + \mathit{nearVal}}{\mathit{farVal} - \mathit{nearVal}}\end{array}

Using the same trick as with the translation matrix (assuming that w is always 1), we can express this with the orthographic projection matrix:

\begin{bmatrix}\frac{2}{\mathit{right} - \mathit{left}} & 0 & 0 & -\frac{\mathit{right} + \mathit{left}}{\mathit{right} - \mathit{left}}\\ 0 & \frac{2}{\mathit{top} - \mathit{bottom}} & 0 & -\frac{\mathit{top} + \mathit{bottom}}{\mathit{top} - \mathit{bottom}}\\ 0 & 0 & \frac{-2}{\mathit{farVal} - \mathit{nearVal}} & -\frac{\mathit{farVal} + \mathit{nearVal}}{\mathit{farVal} - \mathit{nearVal}}\\ 0 & 0 & 0 & 1\end{bmatrix}

Which is precisely the matrix produced by glOrtho.

The perspective projection matrix

At last, we come to the perspective projection matrix, the projection matrix used for realistic 3D displays. The perspective projection is realistic because it causes objects further in the distance to appear smaller. Like the orthographic projection, perspective is clipped (you need to specify near and far z coordinates, and any geometry outside of that range will not be drawn). However, the screen and clipping space is not a cube, but a square frustum (a square-based pyramid with the apex cut off). Indeed, if we set the near clipping plane to 0, the screen clipping region is a square-based pyramid with the camera as the apex (but we usually need a positive near clipping plane for the depth buffer to work).

The gluPerspective function is used to set up the perspective projection matrix, and in this section, we analyse the maths behind the matrix it creates. It takes four parameters:

  • fovy (“field-of-view y“): the vertical viewing angle. This is the angle, in degrees, from the top of the screen to the bottom.
  • aspect: The aspect ratio, which controls the horizontal viewing angle relative to the vertical. We could imagine a variable fovx = aspect × fovy. For a non-stretched perspective, aspect should be the ratio of the screen width to height. For example, on a 640×480 display, aspect should be 4/3.
  • zNear: The inverted z coordinate of the near clipping plane.
  • zFar: The inverted z coordinate of the far clipping plane.

As with glOrtho, the near and far clipping planes are inverted: zNear and zFar are always positive, even though z coordinates of objects in front of the camera are negative.

Let’s work out exactly how to go from 3D Eye Coordinates (again, assuming a fourth vector coordinate w which is always 1) to 2D screen coordinates with a depth buffer zvalue, in a perspective projection. First, we compute the 2D screen coordinates. The following diagram shows a side view of the scene in Eye Coordinate space. The camera frustum is pictured, bounded by the angle fovy and the Z clipping planes -zNear and -zFar. A blue triangle is visualised, representing a point at coordinate (y, z), where z is some arbitrary depth, and y lies on the edge of the frustum at that depth. Keep in mind that z must be a negative value.

Frustum (side view)

Side view of the camera frustum

Simple trigonometry tells us the relationship between y, z and fovy: in the blue right-angled triangle, y is the opposite edge, -z is the adjacent edge, and fovy/2 is the angle. Therefore, if we know z and fovy, we can compute y:

\begin{array}{lcl}\mathsf{tan}(\frac{\mathit{fovy}}{2}) &=& \frac{y}{-z}\\  y &=& \mathsf{tan}(\frac{\mathit{fovy}}{2}) \times -z\end{array}

We can simplify the following maths if we define:

f = 1 / \mathsf{tan}(\frac{\mathit{fovy}}{2})

then:

y = \frac{-z}{f}

At depth z, a Y eye coordinate value of y should map on to the viewport coordinate y = 1, as it is at the very top of the screen. Similarly, a Y eye coordinate value of -y should map onto viewport coordinate y = -1, and all intermediate Y values should map linearly across that range (because perspective projection is linear for all values at the same depth). That means for any (y, z) eye coordinate, we can divide by \frac{-z}{f} to compute the corresponding viewport coordinate. Thus, for any (y, z) eye coordinate, we can compute the corresponding viewport coordinate:

y_\mathit{viewport} = fy / -z

In the X axis, we find that at depth z, if y is the Y eye coordinate on the top edge of the frustum, the corresponding X eye coordinate on the right edge of the frustum is y × aspect. Therefore, using the same logic as above, for any (x, z) eye coordinate, we can compute the corresponding x viewport coordinate:

x_\mathit{viewport} = \frac{fx}{\mathit{aspect}} / -z

Finally, we need to move the z coordinate into the depth buffer space, such that -zNear maps onto -1, -zFar maps onto 1, and all Z eye coordinates in between are moved into the range [-1, 1]. We could map onto this space any way we like (linear might be a simple and obvious choice), but we don’t actually want a linear mapping: that would imply that all Z coordinates in that range have equal importance. We actually want to give coordinates close to the camera a much larger portion of the depth buffer than coordinates far from the camera. That is because near objects will be held to close scrutiny by the viewer, and thus need to be represented with fine detail in the depth buffer, whereas with far objects, it does not matter so much if they overlap. A good function is the reciprocal function, of the form z_\mathit{viewport} \propto 1/z.

The general form of this function is:

z_\mathit{viewport} = m/z + c

To satisfy our constraints on zNear and zFar, we obtain the equations:

\begin{array}{lcl}m/-\mathit{zNear} + c &=& -1\\  m/-\mathit{zFar} + c &=& 1\end{array}

We solve for m and c:

\begin{array}{lcl}m &=& (-1 - c) \times -\mathit{zNear}\\  m &=& (1 - c) \times -\mathit{zFar}\\  (-1 - c) \times -\mathit{zNear} &=& (1 - c) \times -\mathit{zFar}\\  \mathit{zNear} + c\times\mathit{zNear} &=& -\mathit{zFar} + c\times\mathit{zFar}\\  c \times (\mathit{zNear} - \mathit{zFar}) &=& -(\mathit{zFar} + \mathit{zNear})\\  c &=& -\frac{\mathit{zFar} + \mathit{zNear}}{\mathit{zNear} - \mathit{zFar}}\\  m &=& (1 + \frac{\mathit{zFar} + \mathit{zNear}}{\mathit{zNear} - \mathit{zFar}}) \times -\mathit{zFar}\\  m &=& \frac{(\mathit{zNear} - \mathit{zFar}) + (\mathit{zFar} + \mathit{zNear})}{\mathit{zNear} - \mathit{zFar}} \times -\mathit{zFar}\\  m &=& \frac{2 \times \mathit{zNear}}{\mathit{zNear} - \mathit{zFar}} \times -\mathit{zFar}\\  m &=& -\frac{2 \times \mathit{zFar} \times \mathit{zNear}}{\mathit{zNear} - \mathit{zFar}}\end{array}

Hence, we have the equation:

z_\mathit{viewport} = \frac{2 \times zFar \times zNear}{zNear - zFar} / -z - \frac{zFar + zNear}{zNear - zFar}

The following chart gives an idea of how the perspective projection maps Z eye coordinate values onto depth buffer values, with zNear = 1 and zFar = 5:

Mapping Z eye coordinate values onto the depth buffer

Mapping Z eye coordinate values onto the depth buffer

Notice how the mapping is heavily skewed to favour objects near the front of the screen. Half of the depth buffer space is given to objects in the front 16% of the frustum space.

So our equations for transforming from eye coordinates (x, y, z) to viewport coordinates (x_\mathit{viewport}, y_\mathit{viewport}, z_\mathit{viewport}) are:

\begin{array}{lcl}  x_\mathit{viewport} &=& \frac{fx}{\mathit{aspect}} / -z\\  y_\mathit{viewport} &=& fy / -z\\  z_\mathit{viewport} &=& \frac{2 \times zFar \times zNear}{zNear - zFar} / -z - \frac{zFar + zNear}{zNear - zFar}  \end{array}

But there’s a problem: Mathematically, these equations work fine, but they are non-linear transformations — all three involve division by z — so they cannot be represented by a matrix (recall that matrices represent functions that sum a constant multiple of each coordinate). So we have to cheat a little bit in order to use a matrix: we notice that the division by -z is consistent in all three equations, so we factor it out:

z_\mathit{viewport} = \big(\frac{2 \times zFar \times zNear}{zNear - zFar} + \frac{zFar + zNear}{zNear - zFar} \times z\big) / -z

Note that by factoring out the division, the third equation gained a multiplication by z. If we ignore the “/ -z” part of each equation, we see that it is now a linear equation suitable for encoding as a matrix. Thus, we invent “Clip Coordinates” — basically the above equations with the -z factored out. Rather than setting w to 1 as usual, we set it to -z, for reasons that will soon become clear:

\begin{array}{lcl}  x_\mathit{clip} &=& \frac{fx}{\mathit{aspect}}\\  y_\mathit{clip} &=& fy\\  z_\mathit{clip} &=& \frac{zFar + zNear}{zNear - zFar} \times z + \frac{2 \times zFar \times zNear}{zNear - zFar}\\  w_\mathit{clip} &=& -z  \end{array}

We can now redefine the viewport coordinates in terms of the clip coordinates, using the fact that the w clip coordinate has been set to -z:

\begin{array}{lcl}  x_\mathit{viewport} &=& x_\mathit{clip} / w_\mathit{clip}\\  y_\mathit{viewport} &=& y_\mathit{clip} / w_\mathit{clip}\\  z_\mathit{viewport} &=& z_\mathit{clip} / w_\mathit{clip}  \end{array}

The beauty of separating these two steps is that the first step can now be achieved with a matrix, while the second step is extremely simple. In OpenGL, the projection matrix (GL_PROJECTION) transforms Eye Coordinates to Clip Coordinates (not Viewport Coordinates). We can rewrite the above eye-to-clip equations as a matrix (assuming that the w eye coordinate is always 1):

\begin{bmatrix}  \frac{f}{\mathit{aspect}} & 0 & 0 & 0\\  0 & f & 0 & 0\\  0 & 0 & \frac{zFar + zNear}{zNear - zFar} & \frac{2 \times zFar \times zNear}{zNear - zFar}\\  0 & 0 & -1 & 0  \end{bmatrix}

That is, indeed, the projection matrix created by gluPerspective. Because we can’t express the clip-to-viewport transformation as a matrix, OpenGL does it for us, probably using specialised hardware to make it extremely efficient. So in the end, we can achieve the perspective transformation using a matrix and an extra bit of specialised code, and the whole thing is really really fast.

One final point: As I said earlier, the newer versions of OpenGL (version 3.0, or ES 2.0) don’t have any built-in matrix support. There is no GL_MODELVIEW or GL_PROJECTION matrix. These OpenGL versions require that you write shaders to do all of this work yourself (the shaders run on the GPU). But the principle is the same: you would create a projection matrix just like the one above, and multiply it by each vertex in the vertex shader. The output of the vertex shader is clip coordinates (not viewport coordinates) — OpenGL still performs the “divide by w” step automatically.

Articles

The importance of language-level abstract Unicode strings

In Language design, Language implementation, Unicode on April 19, 2012 by Matt Giuca Tagged: ,

Note: This article may not display properly if your browser has no font installed with some special Unicode characters.

Unicode is a beautiful thing. Where in the past, we had to be content with text sequences of merely 128 different characters, we can now use over 110,000 characters from almost every script in use today or throughout history. But there’s a problem: despite the fact that Unicode has been around and relatively unchanged for the past 16 years, most modern programming languages do not give programmers a proper abstraction over Unicode strings, which means that we must carefully deal with a range of encoding issues or suffer a plethora of bugs. And there are still languages coming out today that fall into this trap. This blog post is a call to language designers: your fundamental string data type should abstract over Unicode encoding schemes, such that your programmer sees only a sequence of Unicode characters, thus preventing most Unicode bugs.

Over the past few years, I have encountered a surprising amount of resistance to this idea, with counter-arguments including performance reasons, compatibility reasons and issues of low-level control. In this post, I will argue that having a consistent, abstract Unicode string type is more important, and that programmers can be given more control in the special cases that require it.

To give a quick example of why this is a problem, consider the Hello World program on the front page of the Go programming language website. This demo celebrates Go’s Unicode support by using the Chinese word for “world”, “世界”:

fmt.Println("Hello, 世界")

But while Go ostensibly has Unicode strings, they are actually just byte strings, with a strong mandate that they be treated as UTF-8-encoded Unicode text. The difference is apparent if we wrap the string in the built-in len function:

fmt.Println(len("Hello, 世界"))

This code, perhaps surprisingly, prints 13, where one might have expected it to print 9. This is because while there are 9 characters in the string, there are 13 bytes if the string is encoded as UTF-8: in Go, the string literal “Hello, 世界” is equivalent to “Hello, \xe4\xb8\x96\xe7\x95\x8c”. Similarly, indexing into the string uses byte offsets, and retrieves single bytes instead of whole characters. This is a dangerous leaky abstraction: it appears to be a simple sequence of characters until certain operations are performed. Worse, English-speaking programmers have a nasty habit of only testing with ASCII input, which means that any bugs where the programmer has presumed to be dealing with characters (but is actually dealing with bytes) will go undetected.

In a language that properly abstracts over Unicode strings, the above string should have a length of 9. Programmers should only be exposed to encoding details when they explicitly ask for it.

The folly of leaky abstraction

The fundamental problem lies in the difference between characters and code units. Characters are the abstract entities that correspond to code points. Code units are the pieces that represent a character in the underlying encoding, and they vary depending on the encoding. For example, consider the character OSMANYA LETTER BA or ‘𐒁’ (with code point U+10481 — I go into details on the difference between characters and code points in this blog post). In UTF-8, it is represented by the four byte sequence F0 90 92 81; code units are bytes, so we would say this character is represented by four code units in UTF-8. In UTF-16, the same character is represented by two code units: D801 DC81; code units are 16-bit numbers. The problem with many programming languages is that their string length, indexing and, often, iteration operations are in code units rather than in characters (or code points).

(An aside: I find it interesting that the Unicode Standard itself defines a “Unicode string” as “an ordered sequence of code units.” That’s code units, not code points, suggesting that the standard itself disagrees with my thesis that a string should be thought of as merely a sequence of code points. This section of the standard seems concerned with how the strings should be represented, not the programming interface for accessing such strings, so I feel that it doesn’t directly contradict this post. This post is about the string interface, not the underlying representation.)

There are three main types of culprit languages that exhibit this issue:

  1. Languages with byte-oriented string operations and no specified encoding. These leave Unicode support up to programmers and third-party library authors, resulting in a general mish-mash of encoding across different code bases. These include C, Python 2, Ruby, PHP, Perl, Lua, and almost all pre-Unicode (1992) programming languages.
  2. Languages with code-unit-oriented string operations and no specified encoding. These are probably the worst offenders, as the language manages the encoding, but leaks abstraction details that the programmer has no control over (for example, different builds of the compiler may have different underlying encoding schemes, affecting the behaviour of programs). These include certain builds of Python 3.2 and earlier, and Mercury.
  3. Languages with code-unit-oriented string operations, but a specific encoding scheme. These languages at least have well-defined behaviour, but still expose the programmer to implementation details, which can result in bugs. Most modern languages fit into this category, including Go (UTF-8), JVM-based languages such as Java and Scala (UTF-16), .NET-based languages such as C# (UTF-16) and JavaScript (UTF-16).

Edit: A number of comments indicate that some of the other languages have optional Unicode support. In Perl 5, you can write “use utf8″ to turn on proper Unicode strings. In Ruby 1.9, you can attach encodings to strings to make them behave better. In Python 2, as I’ll get to later, you have a separate Unicode string type.

Above, I talked about the leaky abstraction of UTF-8 in Go, but it is equally important to recognise the leaky abstraction of UTF-16 in many modern language frameworks; in particular, the Java platform (1996) and the .NET platform (2001). Both treat strings as a sequence of UTF-16 code units (and the corresponding “char” data type as a single UTF-16 code unit). This is disappointingly close to the ideal. In Java, for instance, you can essentially treat a string to be a sequence of characters. ‘e’ (U+0065) is a character, and ‘汉’ (U+6C49) is a character — it feels like the language is abstracting over Unicode characters, until you realise that it isn’t. ‘𐒁’ (U+10481) is not a valid Java character. It is an astral character (one that cannot be represented in a single UTF-16 code unit) — it needs to be represented by the two-code-unit sequence D801 DC81. So it is possible to store this character in a Java string, but not without understanding the underlying encoding details. For example, the Java string “𐒁” has length 2, and worse, if you ask for “charAt(0)” you will get the character U+D801; “charAt(1)” gives U+DC81. Any code which directly manipulates the characters of the string is likely to fail in the presence of astral characters. Java has the excuse of history: Java was first released in January 1996; Unicode 2.0 was released in July 1996, and before that, there were no astral characters (so for six months, Java had it right!). The .NET languages, such as C#, behave exactly the same, and don’t have the excuse of time. JavaScript is the same again. This is perhaps worse than the UTF-8 languages, because even non-English-speaking programmers are likely to forget about astral characters: there are no writing systems in use today that use astral characters. But astral characters include historical scripts of interest to historians, emoticons which may be used by software, and large numbers of mathematical symbols which are commonly used by mathematicians, including myself, so it is important that software gets them right.

I must stress how harmful the second class of languages are in terms of string encoding. In 2011, the Mercury language switched from a type 1 language (strings are just byte sequences) to a type 2 language — the language now specifies all of the basic string operations in terms of code units, but does not specify the encoding scheme. Worse, Mercury has several back-ends — if compiling to C or Erlang, it uses UTF-8, whereas compiling to Java or C# results in UTF-16-encoded strings. Now programmers must contend with the fact that string.length(“Hello, 世界”) might be 13 or 9 depending on the back-end. (Fortunately, character-oriented alternatives, such as count_codepoints, are provided.) This is a major blow to code portability, which I would recommend language designers avoid at all costs.

Python too has traditionally had this problem: the interpreter can be compiled in “UCS-2″ or “UCS-4″ modes, which represent strings as UTF-16 or UTF-32 respectively, and expose those details to the programmer (the UCS-2 build behaves much like the other UTF-16 languages, while the UCS-4 build behaves exactly as I want, with one character per code unit). Fortunately, Python is about to correct this little wart entirely with the introduction of PEP 393 in upcoming version 3.3. This version will remove UCS-2/UCS-4 build, so all future versions will behave as the UCS-4 build did, but with a nice optimisation: all strings with only Latin-1 characters are encoded in Latin-1; all strings with only basic multilingual plane (BMP) characters are encoded in UCS-2 (UTF-16); all strings with astral characters are encoded in UTF-32. (It’s more complicated than that, but that’s the gist of it.) This ensures the correct semantics, but allows for compact string representation in the overwhelmingly common case of BMP-only strings.

Somewhat surprisingly, Bash (1989) appears to support character-oriented Unicode strings. The only other languages I have found which properly abstract over Unicode strings are from the functional world: Haskell 98 (1998) and Scheme R6RS (2007). Haskell 98 specifies that “the character type Char is an enumeration whose values represent Unicode characters” (with a link to the Unicode 5.0 specification). Scheme R6RS was the first version of Scheme to mention Unicode, and got it right on the first go. It specifies that “characters are objects that represent Unicode scalar values,” and goes into details explicitly stating that a character value is in the range [0, D7FF16] ∪ [E00016, 10FFFF16], then defines strings as sequences of characters.

With only four languages that I know of properly providing character-oriented string operations by default (are there any more?), this is a pretty poor track record. Sadly, many modern languages are being built on top of either the JVM or .NET framework, and so naturally absorb the poor character handling of those platforms. Still, it would be nice to see some more languages that behave correctly.

Some real-world problems

So far, the discussion has been fairly academic. What are some of the actual problems that have come about as a result of the leaky abstraction?

Edit: I mention a heap of technologies here, in a way that could be interpreted to be derisive. I don’t mean any offense towards the creators of these technologies, but rather, I intend to point out how difficult it can be to work with Unicode on a language that doesn’t provide the right abstractions.

A memorable example for me was a program I worked on which sent streams of text from a Python 2 server (UTF-8 byte strings) to a JavaScript client (UTF-16 strings). We had chosen the arbitrary message size of 512 bytes, and were chopping up text arbitrarily on those boundaries, and sending them off encoded in UTF-8 to the client, where they were concatenated back together. This almost worked, but we discovered a strange problem: some non-ASCII characters were being messed up some of the time. It occurred if you happened to have a multi-byte character split across a 512-byte boundary. For example, “ü” (U+00FC) is encoded as hex C3 BC. If byte 511 of a message was C3 and byte 512 was BC, the first packet would be sent ending in the byte C3 (which JavaScript wouldn’t know how to convert to UTF-16), and the second packet would be sent beginning with the byte BC (which JavaScript also wouldn’t know how to convert to UTF-16). So the result is “��”. In general, if a language exposes its underlying code units, then strings may not be split, converted to a different encoding, and re-concatenated.

An obvious example of an operation that goes wrong in these situations is the length operator. If you use UTF-8, Chinese characters are going to each report a length of 3, while in UTF-16, astral characters will each report a length of 2. Think this doesn’t matter? What about Twitter? On Twitter, you have to type messages into 140 characters. That’s 140 characters, not 140 bytes. Imagine if Twitter ate up three “characters” every time you hit a key (if you spoke Chinese, for example). Fortunately, Twitter does it correctly, even for astral characters. I just tweeted the following:

𝐓𝐰𝐢𝐭𝐭𝐞𝐫 𝐬𝐮𝐩𝐩𝐨𝐫𝐭𝐬 𝐚𝐬𝐭𝐫𝐚𝐥 𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫𝐬 — 𝐞𝐚𝐜𝐡 𝐨𝐧𝐞 𝐜𝐨𝐮𝐧𝐭𝐬 𝐚𝐬 𝐚 𝐬𝐢𝐧𝐠𝐥𝐞 𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫. 𝐈𝐭’𝐬 𝐧𝐨𝐭 𝐭𝐡𝐞 𝐬𝐢𝐳𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐭𝐰𝐞𝐞𝐭 𝐭𝐡𝐚𝐭 𝐜𝐨𝐮𝐧𝐭𝐬, 𝐢𝐭’𝐬 𝐡𝐨𝐰 𝐲𝐨𝐮 𝐮𝐬𝐞 𝐢𝐭.

Notice that the text is unusually bold — these are no ordinary letters. They are MATHEMATICAL BOLD letters, which are astral characters. For example, the ‘𝐓’ is U+1D413 MATHEMATICAL BOLD CAPITAL T. The tweet contains exactly 140 characters — 28 ASCII, 3 in the basic multilingual plane, and 109 astral. In UTF-16, it comprises 249 code units, while in UTF-8, it comprises 473 bytes. Yet it was accepted in Twitter’s 140 character limit, because they count characters. I would guess that Twitter’s counting routines are written in both Scala and JavaScript, which means that in both versions, the engineers would need to have specially considered surrogate pairs as a single character. Had Twitter been written in a more Unicode-aware language, they could just have called the length method and not given it another thought.

Let’s quickly look at the way Tomboy (a note taking app written in C# on the Mono/.NET platform) loads notes from XML files. I discovered a bug in Tomboy where the entire document would be messed up if you type an astral character, save, quit, and load. It turns out that one function is keeping an int “offset” which counts the number of “characters” (actually code units) read, and passes the offset to another function, which “returns the location at a particular character offset.” This one really means character. Needless to say, one astral character means that the character offset is too high, resulting in a complete mess. Why is it the programmer’s responsibility to deal with these concepts?

Other software I’ve found that doesn’t handle astral characters correctly includes Bugzilla (ironically, I discovered this when it failed to save my report of the above Tomboy bug) and FileFormat.info (an otherwise-fantastic resource for Unicode character information). The latter shows what can happen when a UTF-16 string is encoded to UTF-8 without a great deal of care, presumably due to an underlying language exposing a UTF-16 string representation.

A common theme here is programmer confusion. While these problems can be solved with enough effort, the leaky abstraction forces the programmer to expend effort that could be spent elsewhere. Importantly, it makes documentation more difficult. Any language with byte- or code-unit-oriented strings forces programmers to clarify what they mean every time they say “length” or “n characters”. Only when the language completely hides the encoding details are programmers free to use the phrase “n characters” unambiguously.

What about performance?

The most common objection to character-oriented strings is that it impacts the performance of strings. If you have an abstract Unicode string that uses UTF-8 or UTF-16, your indexing operation (charAt) goes from O(1) to O(n), as you must iterate over all of the variable-length code points. It is easier to let the programmer suffer an O(1) charAt that returns code units rather than characters. Alternatively, if you have an abstract Unicode string that uses UTF-32, you’ll be fine from a performance standpoint, but you’ve made strings take up two or four times as much space.

But let’s think about what this really means. The performance argument really says, “it is better to be fast than correct.” That isn’t something I’ve read in any software engineering book. Conventional wisdom is the opposite: 1. Be correct — in all cases, not just the common ones, 2. Be fast (here it is OK to optimise for the common cases). While performance is an important consideration, is unbelievable to me that designers of high-level languages would prioritise speed over correctness. (Again, you could argue that a code unit interface is not “incorrect,” merely “low level,” but hopefully this post shows that it frequently results in incorrect code.)

I don’t know exactly what the best implementation is, and I’m not here to tell you how to implement it. But for what it’s worth, I don’t consider UTF-32 to be a tremendous burden in this day and age (in moving to 64-bit platforms, we doubled the size of all pointers — why can’t we also double the size of strings?) Alternatively, Python’s PEP 393 shows that we can have correct semantics and optimise for common cases as well. In any case, we are talking about software correctness here — can we get the semantics right first, then worry about optimisation?

But it isn’t supposed to be abstract — that’s what libraries are for

The second most common objection to character-oriented strings is “but the language just provides the basic primitives, and the programmer can build whatever they like on top of that.” The problem with this argument is that it ignores the ecosystem of the language. If the language provides, for example, a low-level byte-string data type, and programmers can choose to use it as a UTF-8 string (or any other encoding of their choice), then there is going to be a problem at the seams between code written by different programmers. If you see a library function that takes a byte-string, what should you pass? A UTF-8 string? A Latin-1 string? Perhaps the library hasn’t been written with Unicode awareness at all, and any non-ASCII string will break. At best, the programmer will have written a comment explaining how you should encode Unicode characters (and this is what you need to do in C if you plan to support Unicode), but most programmers will forget most of the time. It is far better to have everybody on the same page by having the main string type be Unicode-aware right out of the box.

The ecosystem argument can be seen very clearly if we compare Python 2 and 3. Python 3 is well known for better Unicode support than its predecessor, but fundamentally, it is a rather simple difference: Python 2’s byte-string type is called “str”, while its Unicode string type is called “unicode”. By contrast, Python 3’s byte-string type is called “bytes”, while its Unicode string type is called “str”. They behave pretty much the same, just with different names. But in practice, the difference is immense. Whenever you call the str() function, it produces a Unicode string. Quoted literals are Unicode strings. All Python 3 functions, regardless of who wrote them, deal with Unicode, unless the programmer had a good reason to deal with bytes, whereas in Python 2, most functions dealt with bytes and often broke when given a Unicode string. The lesson of Python 3 is: give programmers a Unicode string type, make it the default, and encoding issues will mostly go away.

What about binary strings and different encodings?

Of course, programming languages, even high-level ones, still need to manipulate data that is actually a sequence of 8-bit bytes, that doesn’t represent text at all. I’m not denying that — I just see it as a totally different data structure. There is no reason a language can’t have a separate byte array type. UTF-16 languages generally do. Java has separate types “byte” and “char”, with separate semantics (for example, byte is a numeric type, whereas char is not). In Java, if you want a byte array, just say “byte[]”. Python 3, similarly, has separate “bytes” and “str” types. JavaScript recently added a Uint8Array type for this purpose (since JavaScript previously only had UTF-16 strings). Having a byte array type is not an excuse for not having a separate abstract string type — in fact, it is better to have both.

Similarly, a common argument is that text might come into the program in all manner of different encodings, and so abstracting away the string representation means programmers can’t handle the wide variety of encodings out there. That’s nonsense — of course even high-level languages need encoding functions. Python 3 handles this ideally — the string (str) data type has an ‘encode’ method that can encode an abstract Unicode string into a byte string using an encoding of your choice, while the byte string (bytes) data type has a ‘decode’ method that can interpret a byte string using a specified encoding, returning an abstract Unicode string. If you need to deal with input in, say, Latin-1, you should not allow the text to be stored in Latin-1 inside your program — then it will be incompatible with any other non-Latin-1 strings. The correct approach is to take input as a byte string, then immediately decode it into an abstract Unicode string, so that it can be mixed with other strings in your program (which may have non-Latin characters).

What about non-Unicode characters?

There’s one last thorny issue: it’s all well and good to say “strings should just be Unicode,” but what about characters that cannot be represented in Unicode? “Like what,” you may ask. “Surely all of the characters from other encodings have made their way into Unicode by now.” Well, unfortunately, there is a controversial topic called Han unification, which I won’t go into details about here. Essentially, some languages (notably Japanese) have borrowed characters from Chinese and changed their appearance over thousands of years. The Unicode consortium officially considers these borrowed characters as being the same as the original Chinese, but with a different appearance (much like a serif versus a sans-serif font). But many Japanese speakers consider them to be separate to the Chinese characters, and Japanese character encodings reflect this. As a result, Japanese speakers may want to use a different character set than Unicode.

This is unfortunate, because it means that a programming language designed for Japanese users needs to support multiple encodings and expose them to the programmer. As Ruby has a large Japanese community, this explains why Ruby 1.9 added explicit encodings to strings, instead of going to all-Unicode as Python did. I personally find that the advantages of a single high-level character sequence interface outweigh the need to change character sets, but I wanted to mention this issue anyway, by way of justifying Ruby’s choice of string representation.

What about combining characters?

I’ve faced the argument that my proposed solution doesn’t go far enough in abstracting over a sequence of characters. I am suggesting that strings be an abstract sequence of characters, but the argument has been put to me that characters (as Unicode defines them) are not the most logical unit of text — combining character sequences (CCSes) are.

For example, consider the string “𐒁̸”, which consists of two characters: ‘𐒁’ (U+10481 OSMANYA LETTER BA) followed by ‘◌̸’ (U+0338 COMBINING LONG SOLIDUS OVERLAY). What is the length of this string?

  • Is it 6? That’s the number of UTF-8 bytes in the string (4 + 2).
  • Is it 3? That’s the number of UTF-16 code units in the string (2 + 1).
  • Is it 2? That’s the number of characters (code points) in the string.
  • Is it 1? That’s the number of combining character sequences in the string.

My argument is that the first two answers are unacceptable, because they expose implementation details to the programmer (who is then likely to pass them on to the user) — I am in favour of the third answer (2). The argument put to me is that the third answer is also unacceptable, because most people are going to think of “𐒁̸” as a single “character”, regardless of what the Unicode consortium thinks. I think that either the third or fourth answers are acceptable, but the third is a better compromise, because we can count characters in constant time (just use UTF-32), whereas it is not possible to count CCSes in constant time.

I know, I know, I said above that performance arguments don’t count. But I see a much less compelling case for counting CCSes over characters than for counting characters over code units.

Firstly, users are already exposed to combining characters, but they aren’t exposed to code units. We typically expect users to type the combining characters separately on a keyboard. For instance, to type the CCS “e̸”, the user would type ‘e’ and then type the code for ‘◌̸’. Conversely, we never expect users to type individual UTF-8 or UTF-16 code units — the representation of characters is entirely abstract from the user. Therefore, the user can understand when Twitter, for example, counts “e̸” as two characters. Most users would not understand if Twitter were to count “汉” as three characters.

Secondly, dealing with characters is sufficient to abstract over the encoding implementation details. If a language deals with code units, a program written for a UTF-8 build may behave differently if compiled with a UTF-16 string representation. A language that deals with characters is independent of the encoding. Dealing with CCSes doesn’t add any further implementation independence. Similarly, dealing with characters is enough to allow strings to be spliced, re-encoded, and re-combined on arbitrary boundaries; dealing with CCSes isn’t required for this either.

Thirdly, since Unicode allows arbitrarily long combining character sequences, dealing with CCSes could introduce subtle exploits into a lot of code. If Twitter counted CCSes, a user could write a multi-megabyte tweet by adding hundreds of combining characters onto a single base character.

In any case, I’m happy to have a further debate about whether we should be counting characters or CCSes — my goal here is to convince you that we should not be counting code units. Lastly, I’ll point out that Python 3 (in UCS-4 build, or from version 3.3 onwards) and Haskell both consider the above string to have a length of 2, as does Twitter. I don’t know of any programming languages that count CCSes. Twitter’s page on counting characters goes into some detail about this.

Conclusion

My point is this: the Unicode Consortium has provided us with a wonderful standard for representing and communicating characters from all of the world’s scripts, but most modern languages needlessly expose the details of how the characters are encoded. This means that all programmers need to become Unicode experts to make high-quality internationalised software, which means it is far less likely that programs will work when non-ASCII characters, or even worse, astral characters, are used. Since most English programmers don’t test with non-ASCII characters, and almost nobody tests with astral characters, this makes it pretty unlikely that software will ever see such inputs before it is released.

Programming language designers need to become Unicode experts, so that regular programmers don’t have to. The next generation of languages should provide character-oriented string operations only (except where the programmer explicitly asks for text to be encoded). Then the rest of us can get back to programming, instead of worrying about encoding issues.

Articles

What the heck is a character, anyway?

In Unicode on April 7, 2012 by Matt Giuca

Note: This article may not display properly if your browser has no font installed with some special Unicode characters.

I’m writing another rather lengthy Unicode post, but before I finish it, I want to clear up a terminology problem I ran into, as I seem not to be the only one. This post is about clearing up the definition of the word character, with respect to the Unicode standard.

Let me briefly define the problem. It’s pretty obvious that ‘a’ (U+0061 LATIN SMALL LETTER A) is a character, as is ‘∞’ (U+221E INFINITY). But things get more complicated when we consider combining characters. When I type ‘∞’ (U+221E INFINITY) followed by ‘◌̸’ (U+0338 COMBINING LONG SOLIDUS OVERLAY), I get the single thing “∞̸”. Is this thing a character? If so, what do we call the two things that we combined to make this character?

This problem becomes quite important when we need to count the characters in a string (or index into the string by character). How many characters are in the string “∞̸”, one or two? It depends on whether you use the word character to refer to the ‘∞’ and the ‘◌̸’, or to refer to the combined “∞̸”. There’s some ambiguity here. How does the Unicode standard resolve it?

I have commonly seen the following terminology used to resolve this conflict: code point refers to the individual things being combined (the ‘∞’ and the ‘◌̸’), while character refers to the combined thing (the “∞̸”). So the above string has two code points, but only one character. This interpretation is wrong.

Firstly, this is quite unsatisfactory terminology because it misuses the term code point. Even if we ignore combining characters, code point is still not a synonym for character. A code point is just an integer, usually expressed in U+xxxx notation. Consider plain old ASCII. It is often said that “a character is a byte,” but this is not strictly true, even in ASCII land: ‘a’ is a character; 0x61 is a byte. Just because there is a one-to-one correspondence between characters and (7-bit) bytes does not mean that character and byte are synonyms. ASCII describes the encoding between characters and bytes. Similarly, in Unicode land, character and code point are not synonyms: ‘∞’ is a character; U+221E is a code point. Unicode describes the encoding between characters and code points. So please do not say that ‘∞’ is a code point. This leaves us with a gap in our terminology: if “∞̸” is a character, then what term, if not code point, do we use to describe the ‘∞’ and ‘◌̸’ that combined to produce it?

The correct answer (as I read the spec — it’s not exactly clear) is that “∞̸” is not a character. Even though in colloquial speech we may call it that, in Unicode, a character is something that has a corresponding code point. Every character in Unicode has a code point: ‘∞’ is a character, having code point U+221E; ‘◌̸’ is a (combining) character, having code point U+0338. Since “∞̸” doesn’t have a code point of its own, it isn’t a character. So what is it?

It’s a combining character sequence. The Unicode glossary defines a combining character sequence as:

A maximal character sequence consisting of either a base character followed by a sequence of one or more characters where each is a combining character…

So please do not call “∞̸” a character. If it consists of more than one code point, then it is more than a single character.

Note that, formally, we should use the term abstract character for what I call a character above. The term “character” is just short-hand for “abstract character” — the “abstract” prefix does not imply a combining character sequence; it is just used to distinguish the word “character” from its various other colloquial meanings. I should also point out that when I say an abstract character is something that has a code point, it doesn’t have to be a unique code point. In rare cases, an abstract character can have several equivalent encodings. For example, the abstract character ‘Å’ can be encoded as U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE, or as U+212B ANGSTROM SIGN (for historical reasons). The important point is that it can be encoded with a single code point.

This is all the more confusing because it is possible to represent certain abstract characters using combining characters. For example, the characters ‘A’ (U+0041 LATIN CAPITAL LETTER A) and ‘◌̊’ (U+030A COMBINING RING ABOVE) together form the combining character sequence “Å”, which is canonically equivalent to the abstract character ‘Å’ (even though it might display a little bit differently in your browser). In these cases, the abstract character ‘Å’ can be said to be encoded as the code point sequence U+0041 U+030A. But this doesn’t mean in general that you can call a combining sequence a “character” — rather, it means that in certain cases, there is a character that is equivalent to a combining sequence. (This particular example is given in Figure 2-8 of Chapter 2 of The Unicode Standard 6.0 [PDF].) The reason I have used a less common example, “∞̸”, is that there is no equivalent single character (to my knowledge), and therefore “∞̸” can only be called a combining character sequence, and not a character.

In summary:

  • A character is something that can be encoded with a single code point (integer). When using combining characters, each of the individual combining elements is a separate character.
  • A code point is an integer, which identifies a character.
  • A combining character sequence is the result of combining a base (normal) character with one or more combining characters.

Articles

How do you escape a complete URI?

In JavaScript, Python, Unicode, URI, Web development on February 12, 2012 by Matt Giuca

This question comes up quite a lot: “I have a URI. How do I percent-encode it?” In this post, I want to explore this question fully, because it is a hard question, and one that, on the face of it, is nonsense. The short answer to this question is: “You can’t. If you haven’t percent-encoded it yet, you’re too late.” But there are some specific reasons why you might want to do this, so read on.

Let’s get at the meaning of this question. I’ll use the term URI (Uniform Resource Identifier), but I’m mostly talking about URLs (a URI is just a bit more general). A URI is a complete identifier of a resource, such as:

http://example.com/admin/login?name=Helen&gender=f

A URI component is any of the individual atomic pieces of the URI, such as “example.com”, “admin”, “login”, “name”, “Helen”, “gender” or “f”. The components are the parts that the URI syntax has no interest in parsing; they represent plain text strings. Now the problem comes when we encounter a URI such as this:

http://example.com/admin/login?name=Helen Ødegård&gender=f

This isn’t a legal URI because the last query argument “Helen Ødegård” is not percent-encoded — the space (U+0020, meaning that this is the Unicode character with hexadecimal value 20), as well as the non-ASCII characters ‘Ø’ (U+00D8) and ‘å’ (U+00E5) are forbidden in any URI. So the answer to “can’t we fix this?” is “yes” — we can retroactively percent-encode the URI so it appears like this:

http://example.com/admin/login?name=Helen%20%C3%98deg%C3%A5rd&gender=f

A digression: note that the Unicode characters were first encoded to bytes with UTF-8: ‘Ø’ (U+00D8) encodes to the byte sequence hex C3 98, which then percent-encodes to %C3%98. This is not actually part of the standard: none of the URI standards (the most recent being RFC 3986) specify how a non-ASCII character is to be converted into bytes. I could also have encoded them using Latin-1: “Helen%20%D8deg%E5rd,” but then I couldn’t support non-European scripts. This is a mess, but it isn’t the subject of this article, and the world mostly gets along fine by using UTF-8, which I’ll assume we’re using for the rest of this article.

Okay, so that’s solved, but will it work in all cases? How about this URI:

http://example.com/admin/login?redirect=http://example.com/news#funny&name=Helen&gender=f

Clearly, a human looking at this can tell that the value of the “redirect” argument is “http://example.com/news#funny”, which means that the “#” (U+0023) needs to be percent-encoded as “%23″:

http://example.com/admin/login?redirect=http://example.com/news%23funny&name=Helen&gender=f

But how did we know to encode the “#”? What if whoever typed this URI genuinely meant for there to be a query of “redirect=http://example.com/news” and a fragment of “funny&name=Helen&gender=f”. It is wrong for us to meddle with the given URI, assuming that the “#” was intended to be a literal character and not a delimiter. The answer to “can we fix it?” is “no“. Fixing the above URI would only introduce bugs. The answer is “if you wanted that ‘#’ to be interpreted literally, you should have encoded it before you stuck it in the URI.”

The idea that you can:

  1. Take a bunch of URI components (as bare strings),
  2. Concatenate them together using URI delimiters (such as “?”, “&” and “#”),
  3. Percent-encode the URI.

is nonsense, because once you have done step #2, you cannot possibly know (in general) which characters were part of the original URI components, and which are delimiters. Instead, error-free software must:

  1. Take a bunch of URI components (as bare strings),
  2. Percent-encode each individual URI component,
  3. Concatenate them together using URI delimiters (such as “?”, “&” and “#”).

This is why I previously recommended never using JavaScript’s encodeURI function, and instead to use encodeURIComponent. The encodeURI function “assumes that the URI is a complete URI” — it is designed to perform step #3 in the bad algorithm above, which by definition, is meaningless. The encodeURI function will not encode the “#” character, because it might be a delimiter, so it would fail to interpret the above example in its intended meaning.

The encodeURIComponent function, on the other hand, is designed to be called on the individual components of the URI before they are concatenated together — step #2 of the correct algorithm above. Calling that function on just the component “http://example.com/news#funny” would produce:

http%3A%2F%2Fexample.com%2Fnews%23funny

which is a bit of overkill (the “:” and “/” characters do not strictly need to be encoded in a query parameter), but perfectly valid — when the data reaches the other end it will be decoded back into the original string.

So having said all of that, is there any legitimate need to break the rule and percent-encode a complete URI?

URI cleaning

Well, yes there is. (I have been bullish in the past that there isn’t, such as in my answer to this question on Stack Overflow, so this post is me reconsidering that position a bit.) It happens all the time: in your browser’s address bar. If you type this URL into the address bar:

http://example.com/admin/login?name=Helen Ødegård&gender=f

it is not typically an error. Most browsers will automatically “clean up” the URL and send an HTTP request to the server with the line:

GET /admin/login?name=Helen%20%C3%98deg%C3%A5rd&gender=f HTTP/1.1

(Unfortunately, IANA will redirect this immediately, but if you inspect the packets or try it on a server you control, then check the logs, you will see this is true.) Most browsers don’t show you that they’re cleaning up the URIs — they attempt to display them as nicely as possible (in fact, if you type in the escaped version, Firefox will automatically convert it so you can see the space and Unicode characters in the address bar).

Does this mean that we can relax and encode our URIs after composing them? No. This is an application of Postel’s Law (“Be liberal in what you accept, and conservative in what you send.”) The browser’s address bar, being a human interface mechanism, is helpfully attempting to take an invalid URI and make it valid, with no guarantee of success. It wouldn’t help on my second example (“redirect=http://example.com/news#funny”). I think this is a great idea, because it lets users type spaces and Unicode characters into the address bar, and it isn’t necessarily a bad idea for other software to do it too, particularly where user interfaces are concerned. As long as the software is not relying on it.

In other words, software should not use this technique to construct URIs internally. It should only ever use this technique to attempt to “clean up” URIs that have been supplied from an external source.

So that is the point of JavaScript’s encodeURI function. I don’t like to call this “encoding” because that implies it is taking something unencoded and converting it into an encoded form. I prefer to call this “URI cleaning”. That name is suggestive of the actual process: taking a complete URI and cleaning it up a bit.

Unfortunately (as pointed out by Tim Cuthbertson), encodeURI is not quite good for this purpose — it encodes the ‘%’ character, meaning it will double-escape any URI that already has percent-escaped content. More on that later.

We can formalise this process by describing a new type of object called a “super URI.” A super URI is a sequence of Unicode characters with the following properties:

  1. Any character that is in any way valid in a URI is interpreted as normal for a URI,
  2. Any other character is interpreted as a URI would interpret the sequence of characters resulting from percent-encoding the UTF-8 encoding of the character (or some other character encoding scheme).

Now it becomes clear what we are doing: URI cleaning is simply the process of transforming a super URI into a normal URI. In this light, rather than saying that the string:

http://example.com/admin/login?name=Helen Ødegård&gender=f

is “some kind of malformed URI,” we can say it is a super URI, which is equivalent to the normal URI:

http://example.com/admin/login?name=Helen%20%C3%98deg%C3%A5rd&gender=f

Note that super URIs have nothing to do with not percent-encoding of delimiter characters — delimiters such as “#” must still be percent-escaped in the super URI. They are only about not percent-encoding invalid characters. We can consider super URIs to be human-readable syntax, while proper URIs are required for data transmission. This means that we can also take a proper URI and convert it into a more human-readable super URI for display purposes (as web browsers do). That is the purpose of JavaScript’s decodeURI function. Note that, again, I don’t consider this to be “decoding,” rather, “pretty printing.” It doesn’t promise not to show you percent-encoded characters. It only decodes characters that are illegal in normal URIs.

It is probably a good idea for most applications that want to “pretty print” a URI to not decode control characters (U+0000 — U+001F and U+007F), to avoid printing garbage and newlines. Note that decodeURI does decode these characters, so it is probably unwise to use it for display purposes without some post-processing.

Update: My “super URI” concept is similar to the formally specified IRI (Internationalized Resource Identifier) — basically, a URI that can have non-ASCII characters. However, my “super URIs” also allow other ASCII characters that are illegal in URLs.

Which characters?

Okay, so exactly which characters should be escaped for this URI cleaning operation? I thought I’d take the opportunity to break down the different sets of characters described by the URI specification. I will address two versions of the specification: RFC 2396 (published in 1998) and RFC 3986 (published in 2005). 2396 is obsoleted by 3986, but since a lot of encoding functions (including JavaScript’s) were invented before 2005, it gives us a good historical explanation for their behaviour.

RFC 2396

This specification defines two sets of characters: reserved and unreserved.

  • The reserved characters are: $&+,/:;=?@
  • The unreserved characters are: ALPHA and NUM and !'()*-._~

Where ALPHA and NUM are the ASCII alphabetic and numeric characters, respectively. (They do not include non-ASCII characters.)

There is a semantic difference between reserved and unreserved characters. Reserved characters may have a syntactic meaning in the URI syntax, and so if one of them is to appear as a literal character in a URI component, it may need to be escaped. (This will depend upon context — a literal ‘?’ in a path component will need to be escaped, whereas a ‘?’ in a query does not need to be escaped.) Unreserved characters do not have a syntactic meaning in the URI syntax, and never need to be escaped. A corollary to this is that the escaping or unescaping an unreserved character does not change its meaning (“Z” means the same as “%5A”; “~” means the same as “%7E”), but escaping or unescaping a reserved character might change its meaning (“?” may have a different meaning to “%3F”).

The URI component encoding process should percent-encode all characters that are not unreserved. It is safe to escape unreserved characters as well, but not necessary and generally not preferable.

Together, these two sets comprise the valid URI characters, along with two other characters: ‘%’, used for encoding, and ‘#’, used to delimit the fragment (the ‘#’ and fragment were not considered to be part of the URI). I would suggest that both ‘%’ and ‘#’ be treated as reserved characters. All other characters are illegal. The complete set of illegal characters, under this specification, follows:

  • The ASCII control characters (U+0000 — U+001F and U+007F)
  • The space character
  • The characters: “<>[\]^`{|}
  • Non-ASCII characters (U+0080 — U+10FFFD)

The URI cleaning process should percent-encode precisely this set of characters: no more and no less.

RFC 3986

The updated URI specification from 2005 makes a number of changes, both to the way characters are grouped, and to the sets themselves. The reserved and unreserved sets are now as follows:

  • The reserved characters are: !#$&'()*+,/:;=?@[]
  • The unreserved characters are: ALPHA and NUM and -._~

This version features ‘#’ as a reserved character, because fragments are now considered part of the URI proper. There are two more important additions to the restricted set. Firstly, the characters “!'()*” have been moved from unreserved to reserved, because they are “typically unsafe to decode.” This means that, while these characters are still technically legal in a URI, their encoded form may be interpreted differently to their bare form, so encoding a URI component should encode these characters. Note that this is different than banning them from URIs altogether (for example, a “javascript:” URI is allowed to contain bare parentheses, and that scheme simply chooses not to distinguish between “(” and “%28″). Secondly, the characters ‘[‘ and ‘]’ have been moved from illegal to reserved. As of 2005, URIs are allowed to contain square brackets. This unfortunate change was made to allow IPv6 addresses in the host part of a URI. However, note that they are only allowed in the host, and not anywhere else in the URI.

The reserved characters were also split into two sets, gen-delims and sub-delims:

  • The gen-delims are: #/:?@[]
  • The sub-delims are: !$&'()*+,;=

The sub-delims are allowed to appear anywhere in a URI (although, as reserved characters, their meaning may be interpreted differently if they are unescaped). The gen-delims are the important top-level syntactic markers used to delimit the fields of the URI. The gen-delims are assigned meaning by the URI syntax, while the sub-delims are assigned meaning by the scheme. This means that, depending on the scheme, sub-delims may be considered unreserved. For example, a program that encodes a JavaScript program into a “javascript:” URI does not need to encode the sub-delims, because JavaScript will interpret them the same whether they are encoded or not (such a program would need to encode illegal characters such as space, and gen-delims such as ‘?’, but not sub-delims). The gen-delims may also be considered unreserved in certain contexts — for example, in the query part of a URI, the ‘?’ is allowed to appear bare and will generally mean the same thing as “%3F”. However, it is not guaranteed to compare equal: under the Percent-Encoding Normalization rule, encoded and bare versions of unreserved characters must be considered equivalent, but this is not the case for reserved characters.

Taking the square brackets out of the illegal set leaves us with the following illegal characters:

  • The ASCII control characters (U+0000 — U+001F and U+007F)
  • The space character
  • The characters: “<>\^`{|}
  • Non-ASCII characters (U+0080 — U+10FFFD)

A modern URI cleaning function must encode only the above characters. This means that any URI cleaning function written before 2005 (hint: encodeURI) will encode square brackets! That’s bad, because it means that a URI with an IPv6 address:

http://[2001:db8:85a3:8d3:1319:8a2e:370:7348]/admin/login?name=Helen Ødegård&gender=f

would be cleaned as:

http://%5B2001:db8:85a3:8d3:1319:8a2e:370:7348%5D/admin/login?name=Helen%20%C3%98deg%C3%A5rd&gender=f

which refers to the domain name “[2001:db8:85a3:8d3:1319:8a2e:370:7348]” (not the IPv6 address). Mozilla’s reference on encodeURI contains a work-around that ensures that square brackets are not encoded. (Note that this still double-escapes ‘%’ characters, so it isn’t good for URI cleaning.)

So what exactly should I do?

If you are building a URI programmatically, you must encode each component individually before composing them.

  • Escape the following characters: space and !”#$%&'()*+,/:;<=>?@[\]^`{|} and U+0000 — U+001F and U+007F and greater.
  • Do not escape ASCII alphanumeric characters or -._~ (although it doesn’t matter if you do).
  • If you have specific knowledge about how the component will be used, you can relax the encoding of certain characters (for example, in a query, you may leave ‘?’ bare; in a “javascript:” URI, you may leave all sub-delims bare). Bear in mind that this could impact the equivalence of URIs.

If you are parsing a URI component, you should unescape any percent-encoded sequence (this is safe, as ‘%’ characters are not allowed to appear bare in a URI).

If you are “cleaning up” a URI that someone has given you:

  • Escape the following characters: space and “<>\^`{|} and U+0000 — U+001F and U+007F and greater.
  • You may (but shouldn’t) escape ASCII alphanumeric characters or -._~ (if you really want to; it will do no harm).
  • You must not escape the following characters: !#$%&'()*+,/:;=?@[]
  • For an advanced URI cleaning, you may also fix any other syntax errors in an appropriate way (for example, a ‘[‘ in the path segment may be encoded, as may a ‘%’ in an invalid percent sequence).
  • An advanced URI cleaner may be able to escape some reserved characters in certain contexts. Bear in mind that this could impact the equivalence of URIs.

If you are “pretty printing” a URI and want to display escaped characters as bare, where possible:

  • Unescape the following characters: space and “-.<>\^_`{|}~ and ASCII alphanumeric characters and U+0080 and greater.
  • It is probably not wise to unescape U+0000 — U+001F and U+007F, as they are control characters that could cause display problems (and there may be other Unicode characters with similar problems.)
  • You must not unescape the following characters: !#$%&'()*+,/:;=?@[]
  • An advanced URI printer may be able to unescape some reserved characters in certain contexts. Bear in mind that this could impact the equivalence of URIs.

These four activities roughly correspond to JavaScript’s encodeURIComponent, decodeURIComponent, encodeURI and decodeURI functions, respectively. In the next section, we look at how they differ.

Some implementations

JavaScript

As I stated earlier, never use escape. First, it is not properly specified. In Firefox and Chrome, it encodes all characters other than the following: *+-./@_. This makes it unsuitable for URI construction and cleaning. It encodes the unreserved character ‘~’ (which is harmless, but unnecessary), and it leaves the reserved characters ‘*’, ‘+’, ‘/’ and ‘@’ bare, which can be problematic. Worse, it encodes Latin-1 characters with Latin-1 (instead of UTF-8) — not technically a violation of the spec, but likely to be misinterpreted, and even worse, it encodes characters above U+00FF with the malformed syntax “%uxxxx”. Avoid.

JavaScript’s “fixed” URI encoding functions behave according to RFC 2396, and assuming Unicode characters are to be encoded with UTF-8. This means that they are lacking the 2005 changes:

  • encodeURIComponent does not escape the previously-unreserved characters ‘!’, “‘”, “(“, “)” and “*”. Mozilla’s reference includes a work-around for this.
  • decodeURIComponent still works fine.
  • encodeURI erroneously escapes the previously-illegal characters ‘[‘ and ‘]’. Mozilla’s reference includes a work-around for this.
  • decodeURI erroneously unescapes ‘[‘ and ‘]’ (although there doesn’t seem to be a practical case where this is a problem).

Edit: Unfortunately, encodeURI and decodeURI have a single, critical flaw: they escape and unescape (respectively) percent signs (‘%’), which means they can’t be used to clean a URI. (Thanks to Tim Cuthbertson for pointing this out.) For example, assume we wanted to clean the URI:

http://example.com/admin/login?redirect=http://example.com/news%23funny&name=Helen Ødegård&gender=f

This URI has the ‘#’ escaped already, because no URI cleaner can turn a ‘#’ into a “%23″, but it doesn’t have the space or Unicode characters escaped. Passing this to encodeURI produces:

http://example.com/admin/login?redirect=http://example.com/news%2523funny&name=Helen%20%C3%98deg%C3%A5rd&gender=f

Note that the “%23″ has been double-escaped so it reads “%2523″ — completely wrong! We can fix this by extending Mozilla’s work-around to also correct against double-escaped percent characters:

function fixedEncodeURI(str) {
    return encodeURI(str).replace(/%25/g, '%').replace(/%5[Bb]/g, '[').replace(/%5[Dd]/g, ']');
}

Note that decodeURI is similarly broken. The fixed version follows:

function fixedDecodeURI(str) {
    return decodeURI(str.replace(/%25/g, '%2525').replace(/%5[Bb]/g, '%255B').replace(/%5[Dd]/g, '%255D'));
}

Edit: Fixed fixedEncodeURI and fixedDecodeURI so they work on lowercase escape codes. (Thanks to Tim Cuthbertson for pointing this out.)

Python

Python 2’s urllib.quote and urllib.unquote functions perform URI component encoding and decoding on byte strings (non-Unicode).

  • urllib.quote works as I specify above, except that it does escape ‘~’, and does not escape ‘/’. This can be overridden by supplying safe=’~’.
  • urllib.unquote works as expected, returning a byte string.

Note that these do not work properly at all on Unicode strings — you should first encode the string using UTF-8 before passing it to urllib.quote.

In Python 3, the quote and unquote functions have been moved into the urllib.parse module, and upgraded to work on Unicode strings (by me — yay!). By default, these will encode and decode strings as UTF-8, but this can be changed with the encoding and errors parameters (see urllib.parse.quote and urllib.parse.unquote).

I don’t know of any Python built-in functions for doing URI cleaning, but urllib.quote can easily be used for this purpose by passing safe=”!#$%&'()*+,/:;=?@[]~” (the set of reserved characters, as well as ‘%’ and ‘~'; note that alphanumeric characters, and ‘-‘, ‘.’ and ‘_’ are always safe in Python).

Mozilla Firefox

Firefox 10’s URL bar performs URL cleaning, allowing the user to type in URLs with illegal characters, and automatically converting them to correct URLs. It escapes the following characters:

  • space, “‘<>` and U+0000 — U+0001F and U+007F and greater. (Note that this includes the double and single quote.)
  • Note that the control characters for NUL, tab, newline and carriage return don’t actually transmit.

I would say this is erroneous: on a minor note, it should not be escaping the single quote, as that is a reserved character. It also fails to escape the following illegal characters: \^{|}, sending them to the server bare.

Firefox also “prettifies” any URI, decoding most of the percent-escape sequences for the characters that it knows how to encode.

Google Chrome

Chrome 16’s URL bar also performs URL cleaning. It is rather similar to Firefox, but encoding the following characters:

  • space, “<> and U+0000 — U+0001F and U+007F and greater. (Note that this includes only the double quote.)

So Chrome also fails to escape the illegal characters \^`{|} (including the backtick, which Firefox escapes correctly), but unlike Firefox, it does not erroneously escape the single quote.

Articles

SOPA Strike Blackout JavaScript snippet

In JavaScript on January 18, 2012 by Matt Giuca

Websites around the world are preparing to go dark tomorrow to protest the United States’ proposed SOPA act that will present a legal threat to websites worldwide in an overblown measure to stop piracy. The website sopastrike.com, run by Fight For The Future, is doing a good job of organising the protest (initially proposed by Reddit) by suggesting that webmasters place a JavaScript snippet in their HTML pages that will redirect visitors to http://sopastrike.com/strike/ for a twelve-hour time window between 8AM and 8PM US Eastern Standard Time. Unfortunately, there is a time zone bug in this code which will render the strike wholly or partially ineffective for non-American visitors.

If you are planning to use this code to “strike” your website, please read this post and update your code accordingly. I have contacted Fight For The Future about this bug. TL;DR, here is the correct code you should use:

<script type="text/javascript">var a=new Date,b=a.getUTCHours();if(0==a.getUTCMonth()&&2012==a.getUTCFullYear()&&((18==a.getUTCDate()&&13<=b)||(19==a.getUTCDate()&&0>=b)))window.location="http://sopastrike.com/strike";</script>

The original code (found on http://sopastrike.com) reads as follows (do not use this code):

var a=new Date,b=a.getHours()+a.getTimezoneOffset()/60;if(18==a.getDate()&&0==a.getMonth()&&2012==a.getFullYear()&&13=b)window.location="http://sopastrike.com/strike";

This will only work properly in North and South America (UTC-1 or further west). It will fail to redirect during some or all of the time window in Europe, Africa, Asia and Oceania (UTC or further east), and won’t work at all in Eastern Australia (where I live) or New Zealand. :(

The stated goal of the code is to activate the “strike” redirect between 8:00AM and 7:59PM EST (which is between 1300, January 18 and 0059, January 19, UTC). But the code does a bad hand-conversion from local time to UTC — it converts only the hour and not the date. The precise logic is:

“If the date in local time is 18 January, 2012 and the time in UTC is between 1300 and 2459 inclusive, then redirect.”

So wherever you are in the world, if it’s past midnight at the end of January 18, the strike doesn’t happen.

Folks in the United Kingdom at GMT (UTC) will miss the final hour of the strike, because at midnight, getDate() will tick over to 19. Anywhere east of the UK will see increasingly fewer hours of the strike. In AEDT (UTC+11), the strike is scheduled to start at precisely midnight, January 19, local time. So it won’t work for us at all, or anybody east of Australia.

As JavaScript provides explicit UTC conversion methods, I fixed the code by using those for all of the date fields. The following code has been tested at various relevant times in US EST (UTC-5) and AEDT (UTC+11) (by adjusting my computer’s clock and time zone), and works correctly in Firefox and Chrome (I have not tested it in Internet Explorer, but W3Schools says that all major browsers support the UTC methods):

<script type="text/javascript">var a=new Date,b=a.getUTCHours();if(0==a.getUTCMonth()&&2012==a.getUTCFullYear()&&((18==a.getUTCDate()&&13<=b)||(19==a.getUTCDate()&&0>=b)))window.location="http://sopastrike.com/strike";</script>

Note that this code is a bit more complex because it has to explicitly handle the final hour which is on January 19 in UTC. The precise logic for my code is this:

“If the month in UTC is January, 2012, and either the date in UTC is 18 and the time in UTC is 1300 or later, or the date in UTC is 19 and the time in UTC is 0059 or earlier, then redirect.”

I will be blacking out my WordPress blog during this time (not using this JavaScript; WordPress has a built-in SOPA protest option, which is awesome). See you on the other side!

Articles

2011 in review

In Uncategorized on January 1, 2012 by Matt Giuca

The WordPress.com stats helper monkeys prepared a 2011 annual report for this blog.

Here’s an excerpt:

The concert hall at the Syndey Opera House holds 2,700 people. This blog was viewed about 27,000 times in 2011. If it were a concert at Sydney Opera House, it would take about 10 sold-out performances for that many people to see it.

Click here to see the complete report.

Follow

Get every new post delivered to your Inbox.