Java Internationalization: Converting to and from Unicode
Jakob Jenkov |
Internally in Java all strings are kept in Unicode. Since not all text received from users or the outside world is in unicode, your application may have to convert from non-unicode to unicode. Additionally, when the application outputs text it may have to convert the internal unicode format to whatever format the outside world needs.
Java has a few different methods you can use to convert text to and from unicode. These methods are:
- The
String
class - The
Reader
andWriter
classes and subclasses
I will explain both methods in the sections below.
UTF-8
First of all I would like to clarify that Unicode consist of a set of "code points" which are basically a numerical value that corresponds to a given character. There are several ways to "encode" these code points (numerical values) into bytes. The two most common ones are UTF-8 and UTF-16. In this tutorial I will only show examples of converting to UTF-8 - since this seems to be the most commonly used Unicode encoding.
Converting to and from Unicode UTF-8 Using the String Class
You can use the String
class to convert a byte array to a
String
instance. You do so using the constructor of the
String
class. Here is an example:
byte[] bytes = new byte[10]; String str = new String(bytes, Charset.forName("UTF-8")); System.out.println(str);
This example first creates a byte array. The byte array does not actually contain
any sensible data, but for the sake of the example, that does not matter.
The example then creates a new String
, passing the byte array
and the character set of the characters in the byte array as parameters to
the constructor. The String
constructor will then convert
the bytes from the character set of the byte array to unicode.
You can convert the text of a String
to another format using
the getBytes()
method. Here is an example:
bytes[] bytes = str.getBytes(Charset.forName("UTF-8"));
You can also write unicode characters directly in strings in the code, by escaping
the with \u
. Here is an example:
// The danish letters Æ Ø Å String myString = "\u00C6\u00D8\u00C5" ;
Converting to and from Unicode UTF-8 Using the Reader and Writer Classes
The Reader
and Writer
classes are stream oriented
classes that enable a Java application to read and write streams of characters.
Both classes are explained in my Java IO tutorial. Go to Reader or
Writer to read more.
Here is an example that uses an InputStreamReader
to convert from a certain character set (UTF-8)
to unicode:
InputStream inputStream = new FileInputStream("c:\\data\\utf-8-text.txt"); Reader reader = new InputStreamReader(inputStream, Charset.forName("UTF-8")); int data = reader.read(); while(data != -1){ char theChar = (char) data; data = reader.read(); } reader.close();
This example creates a FileInputStream
and wraps it in a
InputStreamReader
. The InputStreamReader
is told to interprete
the characters in the file as UTF-8 characters. This is done using the second constructor
paramter in the InputStreamReader
class.
Here is an example writing a stream of characters back out to UTF-8:
OutputStream outputStream = new FileOutputStream("c:\\data\\output.txt"); Writer writer = new OutputStreamWriter(outputStream, Charset.forName("UTF-8")); writer.write("Hello World"); writer.close();
This example creates an OutputStreamWriter
which converts the
string written through it to the UTF-8 character set.
Tweet | |
Jakob Jenkov |