Unicode is an encoding for textual characters which is able to represent characters from many different languages from around the world. Each character is represented by a unicode code point. A code point is an integer value that uniquely identifies the given character. Unicode characters can be encoded using different encodings, like UTF-8 or UTF-16. These encodings specify how each character's Unicode code point is encoded, as one or more bytes. Each encoding will represent the characters as bytes according to their own scheme.
Unicode Code Points
As mentioned earlier, each unicode character is represented by a unicode code point which is an integer value.
Code Point Number Interval
The code point integer values interval go from 0 to 10FFFF (in hexadecimal encoding).
Code Point Textual Notation
When referring to a unicode code point in writing, we write a
U+ and then the hexadecimal representation
of the code point. For instance, the uppercase character
A is represented as
This notation is only used when referring to the code points in text, though.
Unicode Text Consists of Code Point Sequences
To create a text using unicode characters you use a sequence of unicode code points. For instance, the
U+0043 makes up the text
Code Point Binary Encoding
On the byte encoding level the unicode characters (code points) are encoded differently than their textual encoding.
The uppercase character
A does not need 6 bytes (the 6 ascii characters in
when encoded as raw bytes. The exact number of bytes used depends on whether you are encoding using UTF-8, UTF-16
or some other encoding. Currently, UTF-8 is the most commonly used encoding for
Unicode in text documents, JSON, HTML etc.
Unicode code points are divided into sections which are called unicode planes. These unicode planes are indexed from 0 to 10 (in hexadecimal encoding, meaning there are 17 total unicode planes).
You can see which unicode plane a given code point belongs to by writing the code point up as 6 hexadecimal digits, and looking at the first 2 digits. If a code point is too small to take up 6 hexadecimal digits, add zeros in front of the number until it is 6 digits long.
As example, the unicode code point
U+0041 would become
U+000041 of which the first two
hexadecimal digits are
00. Thus the unicode code point
U+0041 belongs to unicode plane
Along the same logic, the code point
U+10FFFF is already 6 hexadecimal digits long, and thus does
not need any zeroes added in front of it. The first two hexadecimal digits are
10 which translates
to 16 in decimal digits. Thus, the code point
U+10FFFF belongs to unicode plane 16.
Here are the Unicode planes listed with their hexadecimal prefix and their code point intervals (in hexadecimal too).
|Code Point Interval
|U+000000 - U+00FFFF
|U+010000 - U+01FFFF
|U+020000 - U+02FFFF
|U+030000 - U+03FFFF
|U+040000 - U+04FFFF
|U+050000 - U+05FFFF
|U+060000 - U+06FFFF
|U+070000 - U+07FFFF
|U+080000 - U+08FFFF
|U+090000 - U+09FFFF
|U+0A0000 - U+0AFFFF
|U+0B0000 - U+0BFFFF
|U+0C0000 - U+0CFFFF
|U+0D0000 - U+0DFFFF
|U+0E0000 - U+0EFFFF
|U+0F0000 - U+0FFFFF
|U+100000 - U+10FFFF
Non-character Code Points
The last 2 characters of each unicode plane are non-characters.
Unicode contains some special characters which do not represent textual characters. These non-textual characters are typically located in certain intervals of the unicode value space. For instance:
U+000000 - U+00001F
U+00007F - U+00009F
U+00DB00 - U+00DFFF
U+00E000 - U+00F8FF
|Private use area
U+0F0000 - U+0FFFFF
|Private use area
U+100000 - U+10FFFF
|Private use area
Some unicode code points are not themselves characters. Instead they are combined with the preceding unicode character to alter the character. For instance, a character with an accent over it could be represented by first the character code point followed by the accent code point. Rather than displaying this as two characters, these two code points would be combined into the first character with the accent displayed on top of it.
Private use areas has no characters assigned to them by the unicode standard. Private use areas can be used to assign characters in your own context (should you need to), by following a standard procedure for how this is done.