Unicode

Jakob Jenkov
Last update: 2022-08-06

Unicode is an encoding for textual characters which is able to represent characters from many different languages from around the world. Each character is represented by a unicode code point. A code point is an integer value that uniquely identifies the given character. Unicode characters can be encoded using different encodings, like UTF-8 or UTF-16. These encodings specify how each character's Unicode code point is encoded, as one or more bytes. Each encoding will represent the characters as bytes according to their own scheme.

Unicode Code Points

As mentioned earlier, each unicode character is represented by a unicode code point which is an integer value.

Code Point Number Interval

The code point integer values interval go from 0 to 10FFFF (in hexadecimal encoding).

Code Point Textual Notation

When referring to a unicode code point in writing, we write a U+ and then the hexadecimal representation of the code point. For instance, the uppercase character A is represented as U+0041. This notation is only used when referring to the code points in text, though.

Unicode Text Consists of Code Point Sequences

To create a text using unicode characters you use a sequence of unicode code points. For instance, the sequence U+0041 U+0042 U+0043 makes up the text ABC.

Code Point Binary Encoding

On the byte encoding level the unicode characters (code points) are encoded differently than their textual encoding. The uppercase character A does not need 6 bytes (the 6 ascii characters in U+0041) when encoded as raw bytes. The exact number of bytes used depends on whether you are encoding using UTF-8, UTF-16 or some other encoding. Currently, UTF-8 is the most commonly used encoding for Unicode in text documents, JSON, HTML etc.

Unicode Planes

Unicode code points are divided into sections which are called unicode planes. These unicode planes are indexed from 0 to 10 (in hexadecimal encoding, meaning there are 17 total unicode planes).

You can see which unicode plane a given code point belongs to by writing the code point up as 6 hexadecimal digits, and looking at the first 2 digits. If a code point is too small to take up 6 hexadecimal digits, add zeros in front of the number until it is 6 digits long.

As example, the unicode code point U+0041 would become U+000041 of which the first two hexadecimal digits are 00. Thus the unicode code point U+0041 belongs to unicode plane 0.

Along the same logic, the code point U+10FFFF is already 6 hexadecimal digits long, and thus does not need any zeroes added in front of it. The first two hexadecimal digits are 10 which translates to 16 in decimal digits. Thus, the code point U+10FFFF belongs to unicode plane 16.

Here are the Unicode planes listed with their hexadecimal prefix and their code point intervals (in hexadecimal too).

Hex PrefixCode Point Interval
00U+000000 - U+00FFFF
01U+010000 - U+01FFFF
02U+020000 - U+02FFFF
03U+030000 - U+03FFFF
04U+040000 - U+04FFFF
05U+050000 - U+05FFFF
06U+060000 - U+06FFFF
07U+070000 - U+07FFFF
08U+080000 - U+08FFFF
09U+090000 - U+09FFFF
0AU+0A0000 - U+0AFFFF
0BU+0B0000 - U+0BFFFF
0CU+0C0000 - U+0CFFFF
0DU+0D0000 - U+0DFFFF
0EU+0E0000 - U+0EFFFF
0FU+0F0000 - U+0FFFFF
10U+100000 - U+10FFFF

Non-character Code Points

The last 2 characters of each unicode plane are non-characters.

Special Characters

Unicode contains some special characters which do not represent textual characters. These non-textual characters are typically located in certain intervals of the unicode value space. For instance:

IntervalDescription
U+000000 - U+00001FControl characters
U+00007F - U+00009FControl characters
U+00DB00 - U+00DFFFSurrogate pairs
U+00E000 - U+00F8FFPrivate use area
U+0F0000 - U+0FFFFFPrivate use area
U+100000 - U+10FFFFPrivate use area

Some unicode code points are not themselves characters. Instead they are combined with the preceding unicode character to alter the character. For instance, a character with an accent over it could be represented by first the character code point followed by the accent code point. Rather than displaying this as two characters, these two code points would be combined into the first character with the accent displayed on top of it.

Private use areas has no characters assigned to them by the unicode standard. Private use areas can be used to assign characters in your own context (should you need to), by following a standard procedure for how this is done.

Jakob Jenkov

Featured Videos



Core Software Performance Optimization Principles

Thread Congestion in Java - Video Tutorial













Close TOC

All Trails

Trail TOC

Page TOC

Previous

Next