Unicode
Jakob Jenkov |
Unicode is an encoding for textual characters which is able to represent characters from many different languages from around the world. Each character is represented by a unicode code point. A code point is an integer value that uniquely identifies the given character. Unicode characters can be encoded using different encodings, like UTF-8 or UTF-16. These encodings specify how each character's Unicode code point is encoded, as one or more bytes. Each encoding will represent the characters as bytes according to their own scheme.
Unicode Code Points
As mentioned earlier, each unicode character is represented by a unicode code point which is an integer value.
Code Point Number Interval
The code point integer values interval go from 0 to 10FFFF (in hexadecimal encoding).
Code Point Textual Notation
When referring to a unicode code point in writing, we write a U+
and then the hexadecimal representation
of the code point. For instance, the uppercase character A
is represented as U+0041
.
This notation is only used when referring to the code points in text, though.
Unicode Text Consists of Code Point Sequences
To create a text using unicode characters you use a sequence of unicode code points. For instance, the
sequence U+0041
U+0042
U+0043
makes up the text ABC
.
Code Point Binary Encoding
On the byte encoding level the unicode characters (code points) are encoded differently than their textual encoding.
The uppercase character A
does not need 6 bytes (the 6 ascii characters in U+0041
)
when encoded as raw bytes. The exact number of bytes used depends on whether you are encoding using UTF-8, UTF-16
or some other encoding. Currently, UTF-8 is the most commonly used encoding for
Unicode in text documents, JSON, HTML etc.
Unicode Planes
Unicode code points are divided into sections which are called unicode planes. These unicode planes are indexed from 0 to 10 (in hexadecimal encoding, meaning there are 17 total unicode planes).
You can see which unicode plane a given code point belongs to by writing the code point up as 6 hexadecimal digits, and looking at the first 2 digits. If a code point is too small to take up 6 hexadecimal digits, add zeros in front of the number until it is 6 digits long.
As example, the unicode code point U+0041
would become U+000041
of which the first two
hexadecimal digits are 00
. Thus the unicode code point U+0041
belongs to unicode plane
0
.
Along the same logic, the code point U+10FFFF
is already 6 hexadecimal digits long, and thus does
not need any zeroes added in front of it. The first two hexadecimal digits are 10
which translates
to 16 in decimal digits. Thus, the code point U+10FFFF
belongs to unicode plane 16.
Here are the Unicode planes listed with their hexadecimal prefix and their code point intervals (in hexadecimal too).
Hex Prefix | Code Point Interval |
---|---|
00 | U+000000 - U+00FFFF |
01 | U+010000 - U+01FFFF |
02 | U+020000 - U+02FFFF |
03 | U+030000 - U+03FFFF |
04 | U+040000 - U+04FFFF |
05 | U+050000 - U+05FFFF |
06 | U+060000 - U+06FFFF |
07 | U+070000 - U+07FFFF |
08 | U+080000 - U+08FFFF |
09 | U+090000 - U+09FFFF |
0A | U+0A0000 - U+0AFFFF |
0B | U+0B0000 - U+0BFFFF |
0C | U+0C0000 - U+0CFFFF |
0D | U+0D0000 - U+0DFFFF |
0E | U+0E0000 - U+0EFFFF |
0F | U+0F0000 - U+0FFFFF |
10 | U+100000 - U+10FFFF |
Non-character Code Points
The last 2 characters of each unicode plane are non-characters.
Special Characters
Unicode contains some special characters which do not represent textual characters. These non-textual characters are typically located in certain intervals of the unicode value space. For instance:
Interval | Description |
---|---|
U+000000 - U+00001F | Control characters |
U+00007F - U+00009F | Control characters |
U+00DB00 - U+00DFFF | Surrogate pairs |
U+00E000 - U+00F8FF | Private use area |
U+0F0000 - U+0FFFFF | Private use area |
U+100000 - U+10FFFF | Private use area |
Some unicode code points are not themselves characters. Instead they are combined with the preceding unicode character to alter the character. For instance, a character with an accent over it could be represented by first the character code point followed by the accent code point. Rather than displaying this as two characters, these two code points would be combined into the first character with the accent displayed on top of it.
Private use areas has no characters assigned to them by the unicode standard. Private use areas can be used to assign characters in your own context (should you need to), by following a standard procedure for how this is done.
Tweet | |
Jakob Jenkov |