Java Internationalization: BreakIterator
Jakob Jenkov |
The java.text.BreakIterator
class is used to find character, word and sentence boundaries
across different languages. Since different languages use different character, word and sentence boundaries,
it is not enough just to search for space, comma, fullstop, semicolon, colon etc. You need a foolproof
way to search for these boundaries in different languages. The BreakIterator
class provides
that.
Creating a BreakIterator
A single BreakIterator
instance can only detect one of the following types of boundaries:
- Character boundaries
- Word boundaries
- Sentence boundaries
- Line boundaries
You create an instance that can recognize one of the above boundaries using the corresponding factory
method in the BreakIterator
class. The factory methods are:
BreakIterator.getCharacterInstance(); BreakIterator.getWordInstance(); BreakIterator.getSentenceInstance(); BreakIterator.getLineInstance();
Each of these methods take a Locale
as parameter, and returns a BreakIterator
instance. Here is a simple example:
Locale locale = LocaleUK; BreakIterator breakIterator = BreakIterator.characterInstance(locale);
Character Boundaries
When searching for character boundaries it is necessary to make a distinction between user characters and unicode characters.
A user character is the character a user would write if they use a pen. User characters are also typically what the user sees on the screen.
It may require one or more unicode characters to represent a user character. Some user characters are represented by 2 or more unicode characters.
A character instance of the BreakIterator
class finds character boundaries for
user characters, not unicode characters.
Here is a simple example that finds character boundaries in a string:
Locale locale = Locale.UK; BreakIterator breakIterator = BreakIterator.getCharacterInstance(locale); breakIterator.setText("Mary had a little Android device."); int boundaryIndex = breakIterator.first(); while(boundaryIndex != BreakIterator.DONE) { System.out.println(boundaryIndex) ; boundaryIndex = breakIterator.next(); }
This example creates a BreakIterator
targeted at the British language,
and sets the text to find character breaks in using the setText()
method.
The method first()
returns the first found break. The method next()
finds all subsequent breaks. Both methods return the unicode character index of the found
user character. Thus, if a user character takes up more than one unicode character, the character
indexes will increase with the number of unicode characters the user takes.
Word Boundaries
When finding word boundaries you need to create a BreakIterator
that is capable
of finding word boundaries for the specific language needed. Here is how you do that:
Locale locale = Locale.UK; BreakIterator breakIterator = BreakIterator.getWordInstance(locale);
This code creates a BreakIterator
instance that can find word boundaries in UK english
texts.
Here is an example that finds word boundaries in an english text:
Locale locale = Locale.UK; BreakIterator breakIterator = BreakIterator.getWordInstance(locale); breakIterator.setText("Mary had a little Android device."); int boundaryIndex = breakIterator.first(); while(boundaryIndex != BreakIterator.DONE) { System.out.println(boundaryIndex) ; boundaryIndex = breakIterator.next(); }
Again, here the first()
and next()
methods return the
unicode index of the found word boundary.
Counting Words in a Specific Language in Java
Here is a Java code example that shows how to count the occurrences of the words in
a given string, according to the rules of a specific Locale
:
public class WordCounter { public static class WordCount { protected String word = null; protected int count = 0; } public static Map<String, WordCount> countWords(String text, Locale locale) { Map<String, WordCount> wordCounts = new HashMap<String, WordCount>(); BreakIterator breakIterator = BreakIterator.getWordInstance(locale) ; breakIterator.setText(text); int wordBoundaryIndex = breakIterator.first(); int prevIndex = 0; while(wordBoundaryIndex != BreakIterator.DONE){ String word = text.substring(prevIndex, wordBoundaryIndex).toLowerCase(); if(isWord(word)) { WordCount wordCount = wordCounts.get(word); if(wordCount == null) { wordCount = new WordCount(); wordCount.word = word; } wordCount.count++; wordCounts.put(word, wordCount); } prevIndex = wordBoundaryIndex; wordBoundaryIndex = breakIterator.next(); } return wordCounts; } private static boolean isWord(String word) { if(word.length() == 1){ return Character.isLetterOrDigit(word.charAt(0)); } return !"".equals(word.trim()); } }
The countWords()
method takes a string and a Locale
. The
Locale
represents the language of the string. Thus, when a BreakIterator
is created, it can be created for that specific language.
The method counts how many times each word occurs in the string, and returns that as a Map<String, WordCount>
.
The keys in the map are the individuals words in lowercase. The value for each key is a WordCount
instance, which contains two variables: The word
and the count
of that word.
If you want the total number of words in the text, you would have to sum the counts of all individual words.
Notice how the isWord()
method uses the Character.isLetterOrDigit()
method
to determine if a character is a letter or digit, or something else (like semicolon, quote etc.).
The Character.isLetterOrDigit()
checks according to the unicode characters if a character
is a letter or digit - and thus not just in the english language, but also in other languages. This,
and similar methods are described in more detail in the Characeter Methods
text.
Sentence Boundaries
To locate sentence boundaries you need a BreakIterator
instance that is
capable of finding sentence boundaries. Here is how you do that:
Locale locale = Locale.UK; BreakIterator breakIterator = BreakIterator.getSentenceInstance(locale);
This code creates a BreakIterator
targeted at the UK english language.
Here is an example that finds the sentence boundaries in an english string:
Locale locale = Locale.UK; BreakIterator breakIterator = BreakIterator.getSentenceInstance(locale); breakIterator.setText( "Mary had a little Android device. " + "It had small batteries too."); int boundaryIndex = breakIterator.first(); while(boundaryIndex != BreakIterator.DONE) { System.out.println(boundaryIndex) ; boundaryIndex = breakIterator.next(); }
Line Boundaries
You can find breaks in a string where a line of text could be broken onto a new line
without disturbing the reading of the text. To do this you need a BreakIterator
capable of detecting potential line breaks. Note, that it does not find actual line breaks in the
text, but potential line breaks. Finding potential line breaks is useful in text editors that need
to break text onto multiple lines when displaying it, even if the text contains no explicit line breaks.
Here is how you create such a BreakIterator
:
Locale locale = Locale.UK; BreakIterator breakIterator = BreakIterator.getLineInstance(locale);
This example creates a BreakIterator
capable of finding potential line breaks
in UK english text.
Here is an example that finds potential line breaks in a string with english text:
Locale locale = Locale.UK; BreakIterator breakIterator = BreakIterator.getLineInstance(locale); breakIterator.setText( "Mary had a little Android device.\n " + "It had small batteries too."); int boundaryIndex = breakIterator.first(); while(boundaryIndex != BreakIterator.DONE) { System.out.println(boundaryIndex) ; boundaryIndex = breakIterator.next(); }
Tweet | |
Jakob Jenkov |