Unicode

Unicode Code Points
Unicode Planes
- Non-character Code Points
Special Characters

Jakob Jenkov
Last update: 2022-08-06

Unicode is an encoding for textual characters which is able to represent characters from many different languages from around the world. Each character is represented by a unicode code point. A code point is an integer value that uniquely identifies the given character. Unicode characters can be encoded using different encodings, like UTF-8 or UTF-16. These encodings specify how each character's Unicode code point is encoded, as one or more bytes. Each encoding will represent the characters as bytes according to their own scheme.

Unicode Code Points

As mentioned earlier, each unicode character is represented by a unicode code point which is an integer value.

Code Point Number Interval

The code point integer values interval go from 0 to 10FFFF (in hexadecimal encoding).

Code Point Textual Notation

When referring to a unicode code point in writing, we write a U+ and then the hexadecimal representation of the code point. For instance, the uppercase character A is represented as U+0041. This notation is only used when referring to the code points in text, though.

Unicode Text Consists of Code Point Sequences

To create a text using unicode characters you use a sequence of unicode code points. For instance, the sequence U+0041 U+0042 U+0043 makes up the text ABC.

Code Point Binary Encoding

On the byte encoding level the unicode characters (code points) are encoded differently than their textual encoding. The uppercase character A does not need 6 bytes (the 6 ascii characters in U+0041) when encoded as raw bytes. The exact number of bytes used depends on whether you are encoding using UTF-8, UTF-16 or some other encoding. Currently, UTF-8 is the most commonly used encoding for Unicode in text documents, JSON, HTML etc.

Unicode Planes

Unicode code points are divided into sections which are called unicode planes. These unicode planes are indexed from 0 to 10 (in hexadecimal encoding, meaning there are 17 total unicode planes).

You can see which unicode plane a given code point belongs to by writing the code point up as 6 hexadecimal digits, and looking at the first 2 digits. If a code point is too small to take up 6 hexadecimal digits, add zeros in front of the number until it is 6 digits long.

As example, the unicode code point U+0041 would become U+000041 of which the first two hexadecimal digits are 00. Thus the unicode code point U+0041 belongs to unicode plane 0.

Along the same logic, the code point U+10FFFF is already 6 hexadecimal digits long, and thus does not need any zeroes added in front of it. The first two hexadecimal digits are 10 which translates to 16 in decimal digits. Thus, the code point U+10FFFF belongs to unicode plane 16.

Here are the Unicode planes listed with their hexadecimal prefix and their code point intervals (in hexadecimal too).

Hex Prefix	Code Point Interval
00	U+000000 - U+00FFFF
01	U+010000 - U+01FFFF
02	U+020000 - U+02FFFF
03	U+030000 - U+03FFFF
04	U+040000 - U+04FFFF
05	U+050000 - U+05FFFF
06	U+060000 - U+06FFFF
07	U+070000 - U+07FFFF
08	U+080000 - U+08FFFF
09	U+090000 - U+09FFFF
0A	U+0A0000 - U+0AFFFF
0B	U+0B0000 - U+0BFFFF
0C	U+0C0000 - U+0CFFFF
0D	U+0D0000 - U+0DFFFF
0E	U+0E0000 - U+0EFFFF
0F	U+0F0000 - U+0FFFFF
10	U+100000 - U+10FFFF

Non-character Code Points

The last 2 characters of each unicode plane are non-characters.

Special Characters

Unicode contains some special characters which do not represent textual characters. These non-textual characters are typically located in certain intervals of the unicode value space. For instance:

Interval	Description
`U+000000 - U+00001F`	Control characters
`U+00007F - U+00009F`	Control characters
`U+00DB00 - U+00DFFF`	Surrogate pairs
`U+00E000 - U+00F8FF`	Private use area
`U+0F0000 - U+0FFFFF`	Private use area
`U+100000 - U+10FFFF`	Private use area

Some unicode code points are not themselves characters. Instead they are combined with the preceding unicode character to alter the character. For instance, a character with an accent over it could be represented by first the character code point followed by the accent code point. Rather than displaying this as two characters, these two code points would be combined into the first character with the accent displayed on top of it.

Private use areas has no characters assigned to them by the unicode standard. Private use areas can be used to assign characters in your own context (should you need to), by following a standard procedure for how this is done.

Next: UTF-8

Tweet
	Jakob Jenkov