Note: This article may not display properly if your browser has no font installed with some special Unicode characters.
I’m writing another rather lengthy Unicode post, but before I finish it, I want to clear up a terminology problem I ran into, as I seem not to be the only one. This post is about clearing up the definition of the word character, with respect to the Unicode standard.
Let me briefly define the problem. It’s pretty obvious that ‘a’ (U+0061 LATIN SMALL LETTER A) is a character, as is ‘∞’ (U+221E INFINITY). But things get more complicated when we consider combining characters. When I type ‘∞’ (U+221E INFINITY) followed by ‘◌̸’ (U+0338 COMBINING LONG SOLIDUS OVERLAY), I get the single thing “∞̸”. Is this thing a character? If so, what do we call the two things that we combined to make this character?
This problem becomes quite important when we need to count the characters in a string (or index into the string by character). How many characters are in the string “∞̸”, one or two? It depends on whether you use the word character to refer to the ‘∞’ and the ‘◌̸’, or to refer to the combined “∞̸”. There’s some ambiguity here. How does the Unicode standard resolve it?
I have commonly seen the following terminology used to resolve this conflict: code point refers to the individual things being combined (the ‘∞’ and the ‘◌̸’), while character refers to the combined thing (the “∞̸”). So the above string has two code points, but only one character. This interpretation is wrong.
Firstly, this is quite unsatisfactory terminology because it misuses the term code point. Even if we ignore combining characters, code point is still not a synonym for character. A code point is just an integer, usually expressed in U+xxxx notation. Consider plain old ASCII. It is often said that “a character is a byte,” but this is not strictly true, even in ASCII land: ‘a’ is a character; 0x61 is a byte. Just because there is a one-to-one correspondence between characters and (7-bit) bytes does not mean that character and byte are synonyms. ASCII describes the encoding between characters and bytes. Similarly, in Unicode land, character and code point are not synonyms: ‘∞’ is a character; U+221E is a code point. Unicode describes the encoding between characters and code points. So please do not say that ‘∞’ is a code point. This leaves us with a gap in our terminology: if “∞̸” is a character, then what term, if not code point, do we use to describe the ‘∞’ and ‘◌̸’ that combined to produce it?
The correct answer (as I read the spec — it’s not exactly clear) is that “∞̸” is not a character. Even though in colloquial speech we may call it that, in Unicode, a character is something that has a corresponding code point. Every character in Unicode has a code point: ‘∞’ is a character, having code point U+221E; ‘◌̸’ is a (combining) character, having code point U+0338. Since “∞̸” doesn’t have a code point of its own, it isn’t a character. So what is it?
It’s a combining character sequence. The Unicode glossary defines a combining character sequence as:
A maximal character sequence consisting of either a base character followed by a sequence of one or more characters where each is a combining character…
So please do not call “∞̸” a character. If it consists of more than one code point, then it is more than a single character.
Note that, formally, we should use the term abstract character for what I call a character above. The term “character” is just short-hand for “abstract character” — the “abstract” prefix does not imply a combining character sequence; it is just used to distinguish the word “character” from its various other colloquial meanings. I should also point out that when I say an abstract character is something that has a code point, it doesn’t have to be a unique code point. In rare cases, an abstract character can have several equivalent encodings. For example, the abstract character ‘Å’ can be encoded as U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE, or as U+212B ANGSTROM SIGN (for historical reasons). The important point is that it can be encoded with a single code point.
This is all the more confusing because it is possible to represent certain abstract characters using combining characters. For example, the characters ‘A’ (U+0041 LATIN CAPITAL LETTER A) and ‘◌̊’ (U+030A COMBINING RING ABOVE) together form the combining character sequence “Å”, which is canonically equivalent to the abstract character ‘Å’ (even though it might display a little bit differently in your browser). In these cases, the abstract character ‘Å’ can be said to be encoded as the code point sequence U+0041 U+030A. But this doesn’t mean in general that you can call a combining sequence a “character” — rather, it means that in certain cases, there is a character that is equivalent to a combining sequence. (This particular example is given in Figure 2-8 of Chapter 2 of The Unicode Standard 6.0 [PDF].) The reason I have used a less common example, “∞̸”, is that there is no equivalent single character (to my knowledge), and therefore “∞̸” can only be called a combining character sequence, and not a character.
- A character is something that can be encoded with a single code point (integer). When using combining characters, each of the individual combining elements is a separate character.
- A code point is an integer, which identifies a character.
- A combining character sequence is the result of combining a base (normal) character with one or more combining characters.