Java Internationalization: Converting to and from Unicode

Jakob Jenkov
Last update: 2020-01-21

Internally in Java all strings are kept in Unicode. Since not all text received from users or the outside world is in unicode, your application may have to convert from non-unicode to unicode. Additionally, when the application outputs text it may have to convert the internal unicode format to whatever format the outside world needs.

Java has a few different methods you can use to convert text to and from unicode. These methods are:

  • The String class
  • The Reader and Writer classes and subclasses

I will explain both methods in the sections below.

UTF-8

First of all I would like to clarify that Unicode consist of a set of "code points" which are basically a numerical value that corresponds to a given character. There are several ways to "encode" these code points (numerical values) into bytes. The two most common ones are UTF-8 and UTF-16. In this tutorial I will only show examples of converting to UTF-8 - since this seems to be the most commonly used Unicode encoding.

Converting to and from Unicode UTF-8 Using the String Class

You can use the String class to convert a byte array to a String instance. You do so using the constructor of the String class. Here is an example:

byte[] bytes = new byte[10];

String str = new String(bytes, Charset.forName("UTF-8"));

System.out.println(str);

This example first creates a byte array. The byte array does not actually contain any sensible data, but for the sake of the example, that does not matter. The example then creates a new String, passing the byte array and the character set of the characters in the byte array as parameters to the constructor. The String constructor will then convert the bytes from the character set of the byte array to unicode.

You can convert the text of a String to another format using the getBytes() method. Here is an example:

bytes[] bytes = str.getBytes(Charset.forName("UTF-8"));

You can also write unicode characters directly in strings in the code, by escaping the with \u. Here is an example:

// The danish letters Æ Ø Å
String myString = "\u00C6\u00D8\u00C5" ;

Converting to and from Unicode UTF-8 Using the Reader and Writer Classes

The Reader and Writer classes are stream oriented classes that enable a Java application to read and write streams of characters. Both classes are explained in my Java IO tutorial. Go to Reader or Writer to read more.

Here is an example that uses an InputStreamReader to convert from a certain character set (UTF-8) to unicode:

InputStream inputStream = new FileInputStream("c:\\data\\utf-8-text.txt");
Reader      reader      = new InputStreamReader(inputStream,
                                                Charset.forName("UTF-8"));

int data = reader.read();
while(data != -1){
    char theChar = (char) data;
    data = reader.read();
}

reader.close();

This example creates a FileInputStream and wraps it in a InputStreamReader. The InputStreamReader is told to interprete the characters in the file as UTF-8 characters. This is done using the second constructor paramter in the InputStreamReader class.

Here is an example writing a stream of characters back out to UTF-8:

OutputStream outputStream = new FileOutputStream("c:\\data\\output.txt");
Writer       writer       = new OutputStreamWriter(outputStream,
                                                   Charset.forName("UTF-8"));

writer.write("Hello World");

writer.close();

This example creates an OutputStreamWriter which converts the string written through it to the UTF-8 character set.

Jakob Jenkov

Featured Videos

Java Generics

Java ForkJoinPool

P2P Networks Introduction



















Close TOC
All Tutorial Trails
All Trails
Table of contents (TOC) for this tutorial trail
Trail TOC
Table of contents (TOC) for this tutorial
Page TOC
Previous tutorial in this tutorial trail
Previous
Next tutorial in this tutorial trail
Next