UTF-8

UTF-8 Marker Bits and Code Point Bits
Unicode Code Point Intervals Used in UTF-8
Reading UTF-8
Writing UTF-8
Reading and Writing UTF-8 in Java
Searching Forwards in UTF-8
Searching Backwards in UTF-8

Jakob Jenkov
Last update: 2022-08-07

UTF-8 is a byte encoding used to encode unicode characters. UTF-8 uses 1, 2, 3 or 4 bytes to represent a unicode character. Remember, a unicode character is represented by a unicode code point. Thus, UTF-8 uses 1, 2, 3 or 4 bytes to represent a unicode code point.

UTF-8 is the a very commonly used textual encoding on the web, and is thus very popular. Web browsers understand UTF-8. Many programming languages also allow you to use UTF-8 in the code, and can import and export UTF-8 text easily. Several textual data formats and markup languages are often encoded in UTF-8. For instance JSON, XML, HTML, CSS, SVG etc.

UTF-8 Marker Bits and Code Point Bits

When translating a unicode code point to one or more UTF-8 encoded bytes, each of these bytes are composed of marker bits and code point bits. The marker bits tell how to interpret the given byte. The code point bits are used to represent the value of the code point. In the following sections the marker bits are written using 0's and 1's, and the code point bits are written using the characters Z, Y, X, W and V. Each character represents a single bit.

Unicode Code Point Intervals Used in UTF-8

For unicode code points in the hexadecimal value interval U+0000 to U+007F UTF-8 uses a single byte to represent the character. The code points in this interval represent the same characters as the ASCII characters, and use the same integer values (code points) to represent them. In binary digits, the single byte representing a code point in this interval looks like this:

0ZZZZZZZ

The marker bit has the value 0. The bits representing the code point value are marked with Z.

For unicode code points in the interval U+0080 to U+07FF UTF-8 uses two bytes to represent the character. In binary digits, the two bytes representing a code point in this interval look like this:

 110YYYYY 10ZZZZZZ

The marker bits are the 110 and 10 bits of the two bytes. The Y and Z characters represents the bits used to represent the code point value. The first byte (most significant byte) is the byte to the left.

For unicode code points in the interval U+0800 to U+FFFF UTF-8 uses three bytes to represent the character. In binary digits, the three bytes representing a code point in this interval look like this:

1110XXXX 10YYYYYY 10ZZZZZZ

The marker bits are the 1110 and 10 bits of the three bytes. The X, Y and Z characters the bits used to represent the code point value. The first byte (most significant byte) is the byte to the left.

For unicode code points in the interval U+10000 to U+10FFFF UTF-8 uses four bytes to represent the character. In binary digits, the four bytes representing a code point in this interval look like this:

11110VVV 10WWXXXX 10YYYYYY 10ZZZZZZ

The marker bits are the 11110 and 10 bits of the four bytes. The bits named V and W mark the code point plane the character is from. The rest of the bits marked with X, Y and Z represent the rest of the code point. The first byte (most significant byte) is the byte on the left.

Reading UTF-8

When reading UTF-8 encoded bytes into characters, you need to figure out if a given character (code point) is represented by 1, 2, 3 or 4 bytes. You do so by looking at the bit pattern of the first byte.

If the first byte has the bit pattern 0ZZZZZZZ (most significant bit is a 0) then the character code point is represented only by this byte.

If the first byte has the bit pattern 110YYYYY (3 most significant bits are 110) then the character code point is represented by two bytes.

If the first byte has the bit pattern 1110XXXX (4 most significant bits are 1110) then the character code point is represented by three bytes.

If the first byte has the bit pattern 11110VVV (5 most significant bits are 11110) then the character code point is represented by four bytes.

Once you know how many bytes is used to represent the given character code point, read all the actual code point carrying bits (bits marked with V, W, X, Y and Z), into a single 32 bit data type (e.g a Java int). The bits then make up the integer value of the code point. Here is how a 32-bit data type looks after reading a 4-byte UTF-8 character into it:

000000 000VVVWW XXXXYYYY YYZZZZZZ

Notice how all the marker bits (the most significant bits with the patterns 11110 and 10) have been removed from all of the 4 bytes, before the remaining bits (the bits marked with A, B, C, D and E) are copied into the 32-bit data type.

Writing UTF-8

When writing UTF-8 text you need to translate unicode code points into UTF-8 encoded bytes. First, you must figure out how many bytes you need to represent the given code point. I have explained the code point value intervals at the top of this UTF-8 tutorial, so I will not repeat them here.

Second, you need to translate the bits representing the code point into the corresponding UTF-8 bytes. Once you know how many bytes are needed to represent the code point, you also know what bit pattern of marker bits and code point bits you need to use. Simply create the needed number of bytes with marker bits, and copy the correct code point bits into each of the bytes, and you are done.

Here is an example of translating a code point that requires 4 bytes in UTF-8. The code point has the abstract value (as bit pattern):

000000 000VVVWW XXXXYYYY YYZZZZZZ

The corresponding 4 UTF-8 bytes will look like this:

11110VVV 10WWXXXX 10YYYYYY 10ZZZZZZ

Reading and Writing UTF-8 in Java

There are several ways to read and write UTF-8 encoded bytes in Java. In the following sections I will cover a few of them.

Read UTF-8 Into a Java String

If you need to read UTF-8 into a Java String, you can do like this:

byte[] utf8 = ...  // get the UTF-8 bytes from somewhere (file, URL etc)

String string = new String(bytes, StandardCharsets.UTF_8);

Get UTF-8 Bytes From Java String

You can obtain the characters of a Java String as UTF-8 encoded bytes, like this:

byte[] utf8 = string.getBytes(StandardCharsets.UTF_8);

A Utf8Buffer Class Which Can Write and Read UTF-8 Code Points

Here is a Utf8Buffer class which can both write and read UTF-8 as Java integer code points:

public class Utf8Buffer {

    public byte[] buffer;
    public int offset;
    public int length;
    public int endOffset;

    public int tempOffset;


    public Utf8Buffer(byte [] data, int offset, int length) {
        this.buffer = data;
        this.offset      = offset;
        this.tempOffset  = offset;
        this.length      = length;
        this.endOffset   = offset + length;
    }

    public void reset() {
        this.tempOffset = this.offset;
    }

    public void calculateLengthAndEndOffset() {
        this.length = this.tempOffset - this.offset;
        this.endOffset = this.tempOffset;
    }

    public int writeCodepoint(int codepoint) {
        if(codepoint < 0x00_00_00_80){
            // This is a one byte UTF-8 char
            buffer[this.tempOffset++] = (byte) (0xFF & codepoint);
            return 1;
        } else if (codepoint < 0x00_00_08_00) {
            // This is a two byte UTF-8 char. Value is 11 bits long (less than 12 bits in value).
            // Get highest 5 bits into first byte
            buffer[this.tempOffset]     = (byte) (0xFF & (0b1100_0000 | (0b0001_1111 & (codepoint >> 6))));
            buffer[this.tempOffset + 1] = (byte) (0xFF & (0b1000_0000 | (0b0011_1111 & codepoint)));
            this.tempOffset+=2;
            return 2;
        } else if (codepoint < 0x00_01_00_00){
            // This is a three byte UTF-8 char. Value is 16 bits long (less than 17 bits in value).
            // Get the highest 4 bits into the first byte
            buffer[this.tempOffset]     = (byte) (0xFF & (0b1110_0000 | (0b0000_1111 & (codepoint >> 12))));
            buffer[this.tempOffset + 1] = (byte) (0xFF & (0b1000_0000 | (0b00111111 & (codepoint >> 6))));
            buffer[this.tempOffset + 2] = (byte) (0xFF & (0b1000_0000 | (0b00111111 & codepoint)));
            this.tempOffset+=3;
            return 3;
        } else if (codepoint < 0x00_11_00_00) {
            // This is a four byte UTF-8 char. Value is 21 bits long (less than 22 bits in value).
            // Get the highest 3 bits into the first byte
            buffer[this.tempOffset]     = (byte) (0xFF & (0b1111_0000 | (0b0000_0111 & (codepoint >> 18))));
            buffer[this.tempOffset + 1] = (byte) (0xFF & (0b1000_0000 | (0b0011_1111 & (codepoint >> 12))));
            buffer[this.tempOffset + 2] = (byte) (0xFF & (0b1000_0000 | (0b0011_1111 & (codepoint >> 6))));
            buffer[this.tempOffset + 3] = (byte) (0xFF & (0b1000_0000 | (0b0011_1111 & codepoint)));
            this.tempOffset+=4;
            return 4;
        }
        throw new IllegalArgumentException(
            "Unknown Unicode codepoint: "
            + codepoint);
    }

    public int nextCodepoint() {
        int firstByteOfChar = 0xFF & buffer[tempOffset];

        if(firstByteOfChar < 0b1000_0000) {    // 128
            //this is a single byte UTF-8 char (an ASCII char)
            tempOffset++;
            return firstByteOfChar;
        } else if(firstByteOfChar < 0b1110_0000) {    // 224
            int nextCodepoint = 0;
            //this is a two byte UTF-8 char
            nextCodepoint = 0b0001_1111 & firstByteOfChar; //0x1F
            nextCodepoint <<= 6;
            nextCodepoint |= 0b0011_1111 & (0xFF & buffer[tempOffset + 1]); //0x3F
            tempOffset +=2;
            return  nextCodepoint;
        } else if(firstByteOfChar < 0b1111_0000) {    // 240
            //this is a three byte UTF-8 char
            int nextCodepoint = 0;
            //this is a two byte UTF-8 char
            nextCodepoint = 0b0000_1111 & firstByteOfChar; // 0x0F
            nextCodepoint <<= 6;
            nextCodepoint |= 0x3F & buffer[tempOffset + 1];
            nextCodepoint <<= 6;
            nextCodepoint |= 0x3F & buffer[tempOffset + 2];
            tempOffset +=3;
            return  nextCodepoint;
        } else if(firstByteOfChar < 0b1111_1000) {    // 248
            //this is a four byte UTF-8 char
            int nextCodepoint = 0;
            //this is a two byte UTF-8 char
            nextCodepoint = 0b0000_0111 & firstByteOfChar; // 0x07
            nextCodepoint <<= 6;
            nextCodepoint |= 0x3F & buffer[tempOffset + 1];
            nextCodepoint <<= 6;
            nextCodepoint |= 0x3F & buffer[tempOffset + 2];
            nextCodepoint <<= 6;
            nextCodepoint |= 0x3F & buffer[tempOffset + 3];
            tempOffset +=4;
            return  nextCodepoint;
        }

        throw new IllegalStateException(
            "Codepoint not recognized from first byte: "
            + firstByteOfChar);
    }
}

Using the Utf8Buffer class could look like this:

Utf8Buffer utf8Buffer = new Utf8Buffer(new byte[1024], 0, 0);

utf8Buffer.writeCodepoint(0x7F);

// After writing - calculating length and offsets are necessary,
// and if you want to read, tempOffset must be set back to offset ( reset() )
    
utf8Buffer.calculateLengthAndEndOffset();
utf8Buffer.reset();

int nextCodePoint = utf8Buffer.nextCodepoint();

Searching Forwards in UTF-8

Searching forwards in UTF-8 is reasonably straightforward. You encode one character at a time, and compare it to the character you are searching for. No big surprise here.

Searching Backwards in UTF-8

The UTF-8 encoding has the nice side effect that you can search backwards in UTF-8 encoded bytes. You can see from each byte if it is the beginning of a character or not by looking at the marker bits. The following marker bit patterns all imply that the byte is the beginning of a character:

0          Beginning of 1 byte character (also an ascii character)
110        Beginning of 2 byte character
1110       Beginning of 3 byte character
11110      Beginning of 4 byte character

The following marker bit pattern implies that the byte is not the first byte of a UTF-8 character:

10         Second, third or fourth byte of a UTF-8 character

Notice how you can always see from a marker bit pattern if it is the first byte of a character, or a second / third / fourth byte. Just keeping searching backwards until you find the beginning of the character, then go forward and decode it, and check if it is the character you are looking for.

Tweet
	Jakob Jenkov