Articles

What the heck is a character, anyway?

In Unicode on April 7, 2012 by Matt Giuca

Note: This article may not display properly if your browser has no font installed with some special Unicode characters.

I’m writing another rather lengthy Unicode post, but before I finish it, I want to clear up a terminology problem I ran into, as I seem not to be the only one. This post is about clearing up the definition of the word character, with respect to the Unicode standard.

Let me briefly define the problem. It’s pretty obvious that ‘a’ (U+0061 LATIN SMALL LETTER A) is a character, as is ‘∞’ (U+221E INFINITY). But things get more complicated when we consider combining characters. When I type ‘∞’ (U+221E INFINITY) followed by ‘◌̸’ (U+0338 COMBINING LONG SOLIDUS OVERLAY), I get the single thing “∞̸”. Is this thing a character? If so, what do we call the two things that we combined to make this character?

This problem becomes quite important when we need to count the characters in a string (or index into the string by character). How many characters are in the string “∞̸”, one or two? It depends on whether you use the word character to refer to the ‘∞’ and the ‘◌̸’, or to refer to the combined “∞̸”. There’s some ambiguity here. How does the Unicode standard resolve it?

I have commonly seen the following terminology used to resolve this conflict: code point refers to the individual things being combined (the ‘∞’ and the ‘◌̸’), while character refers to the combined thing (the “∞̸”). So the above string has two code points, but only one character. This interpretation is wrong.

Firstly, this is quite unsatisfactory terminology because it misuses the term code point. Even if we ignore combining characters, code point is still not a synonym for character. A code point is just an integer, usually expressed in U+xxxx notation. Consider plain old ASCII. It is often said that “a character is a byte,” but this is not strictly true, even in ASCII land: ‘a’ is a character; 0×61 is a byte. Just because there is a one-to-one correspondence between characters and (7-bit) bytes does not mean that character and byte are synonyms. ASCII describes the encoding between characters and bytes. Similarly, in Unicode land, character and code point are not synonyms: ‘∞’ is a character; U+221E is a code point. Unicode describes the encoding between characters and code points. So please do not say that ‘∞’ is a code point. This leaves us with a gap in our terminology: if “∞̸” is a character, then what term, if not code point, do we use to describe the ‘∞’ and ‘◌̸’ that combined to produce it?

The correct answer (as I read the spec — it’s not exactly clear) is that “∞̸” is not a character. Even though in colloquial speech we may call it that, in Unicode, a character is something that has a corresponding code point. Every character in Unicode has a code point: ‘∞’ is a character, having code point U+221E; ‘◌̸’ is a (combining) character, having code point U+0338. Since “∞̸” doesn’t have a code point of its own, it isn’t a character. So what is it?

It’s a combining character sequence. The Unicode glossary defines a combining character sequence as:

A maximal character sequence consisting of either a base character followed by a sequence of one or more characters where each is a combining character…

So please do not call “∞̸” a character. If it consists of more than one code point, then it is more than a single character.

Note that, formally, we should use the term abstract character for what I call a character above. The term “character” is just short-hand for “abstract character” — the “abstract” prefix does not imply a combining character sequence; it is just used to distinguish the word “character” from its various other colloquial meanings. I should also point out that when I say an abstract character is something that has a code point, it doesn’t have to be a unique code point. In rare cases, an abstract character can have several equivalent encodings. For example, the abstract character ‘Å’ can be encoded as U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE, or as U+212B ANGSTROM SIGN (for historical reasons). The important point is that it can be encoded with a single code point.

This is all the more confusing because it is possible to represent certain abstract characters using combining characters. For example, the characters ‘A’ (U+0041 LATIN CAPITAL LETTER A) and ‘◌̊’ (U+030A COMBINING RING ABOVE) together form the combining character sequence “Å”, which is canonically equivalent to the abstract character ‘Å’ (even though it might display a little bit differently in your browser). In these cases, the abstract character ‘Å’ can be said to be encoded as the code point sequence U+0041 U+030A. But this doesn’t mean in general that you can call a combining sequence a “character” — rather, it means that in certain cases, there is a character that is equivalent to a combining sequence. (This particular example is given in Figure 2-8 of Chapter 2 of The Unicode Standard 6.0 [PDF].) The reason I have used a less common example, “∞̸”, is that there is no equivalent single character (to my knowledge), and therefore “∞̸” can only be called a combining character sequence, and not a character.

In summary:

  • A character is something that can be encoded with a single code point (integer). When using combining characters, each of the individual combining elements is a separate character.
  • A code point is an integer, which identifies a character.
  • A combining character sequence is the result of combining a base (normal) character with one or more combining characters.
About these ads

2 Responses to “What the heck is a character, anyway?”

  1. [...] point U+10481 — I go into details on the difference between characters and code points in this blog post). In UTF-8, it is represented by the four byte sequence F0 90 92 81; code units are bytes, so we [...]

  2. Your interpretation is somewhat wrong. Unicode defines code points, abstract characters, *and* coded characters. It uses ‘character’ as a synonym for the later, but outside the formal language of the Unicode standard, it is not wrong to call ‘abstract characters’ characters too, especially when they happen to be user perceived characters. So, according to Unicode, both INFINITY and COMBINING LONG SOLIDUS OVERLAY are abstract characters, and when you encode them with the corresponding code points (U+221E and U+0338) these become coded characters. Now, the ‘∞̸ ‘ thingy as a whole *is* an abstract character, and, since it doesn’t have a name, I may call it NOT INFINITE SIGN (note that Unicode does assign names for some abstract characters even though they can’t be represented by single coded character, see NamedSequences in UCD). As a backup for my claim that sequence of combining characters are still considered to be a representation of a *single* abstract character, see §2.4 or §3.4, D11, 4th note. I also recommend you to read the summery of the definitions at http://utf8everywhere.org/#faq.glossary.

    Regarding “counting characters”—due to the above complexity it’s not something that you ever need to do. The correct Unicode term for what you really like to count (e.g. for cursor movement) is Grapheme Clusters. And indexing into the text… heh? Text access is inherently sequential.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: