Java Internationalization: BreakIterator

Creating a BreakIterator
Character Boundaries
Word Boundaries
- Counting Words in a Specific Language in Java
Sentence Boundaries
Line Boundaries

Jakob Jenkov
Last update: 2014-06-23

The java.text.BreakIterator class is used to find character, word and sentence boundaries across different languages. Since different languages use different character, word and sentence boundaries, it is not enough just to search for space, comma, fullstop, semicolon, colon etc. You need a foolproof way to search for these boundaries in different languages. The BreakIterator class provides that.

Creating a BreakIterator

A single BreakIterator instance can only detect one of the following types of boundaries:

Character boundaries
Word boundaries
Sentence boundaries
Line boundaries

You create an instance that can recognize one of the above boundaries using the corresponding factory method in the BreakIterator class. The factory methods are:

BreakIterator.getCharacterInstance();
BreakIterator.getWordInstance();
BreakIterator.getSentenceInstance();
BreakIterator.getLineInstance();

Each of these methods take a Locale as parameter, and returns a BreakIterator instance. Here is a simple example:

Locale locale = LocaleUK;

BreakIterator breakIterator =
    BreakIterator.characterInstance(locale);

Character Boundaries

When searching for character boundaries it is necessary to make a distinction between user characters and unicode characters.

A user character is the character a user would write if they use a pen. User characters are also typically what the user sees on the screen.

It may require one or more unicode characters to represent a user character. Some user characters are represented by 2 or more unicode characters.

A character instance of the BreakIterator class finds character boundaries for user characters, not unicode characters.

Here is a simple example that finds character boundaries in a string:

Locale locale = Locale.UK;
BreakIterator breakIterator =
        BreakIterator.getCharacterInstance(locale);

breakIterator.setText("Mary had a little Android device.");

int boundaryIndex = breakIterator.first();
while(boundaryIndex != BreakIterator.DONE) {
    System.out.println(boundaryIndex) ;
    boundaryIndex = breakIterator.next();
}

This example creates a BreakIterator targeted at the British language, and sets the text to find character breaks in using the setText() method.

The method first() returns the first found break. The method next() finds all subsequent breaks. Both methods return the unicode character index of the found user character. Thus, if a user character takes up more than one unicode character, the character indexes will increase with the number of unicode characters the user takes.

Word Boundaries

When finding word boundaries you need to create a BreakIterator that is capable of finding word boundaries for the specific language needed. Here is how you do that:

Locale locale = Locale.UK;
BreakIterator breakIterator =
        BreakIterator.getWordInstance(locale);

This code creates a BreakIterator instance that can find word boundaries in UK english texts.

Here is an example that finds word boundaries in an english text:

Locale locale = Locale.UK;
BreakIterator breakIterator =
        BreakIterator.getWordInstance(locale);

breakIterator.setText("Mary had a little Android device.");

int boundaryIndex = breakIterator.first();
while(boundaryIndex != BreakIterator.DONE) {
    System.out.println(boundaryIndex) ;
    boundaryIndex = breakIterator.next();
}

Again, here the first() and next() methods return the unicode index of the found word boundary.

Counting Words in a Specific Language in Java

Here is a Java code example that shows how to count the occurrences of the words in a given string, according to the rules of a specific Locale:

public class WordCounter {

    public static class  WordCount {
        protected String word  = null;
        protected int    count = 0;
    }

    public static Map<String, WordCount> countWords(String text, Locale locale) {
        Map<String, WordCount> wordCounts = new HashMap<String, WordCount>();

        BreakIterator breakIterator = BreakIterator.getWordInstance(locale) ;
        breakIterator.setText(text);

        int wordBoundaryIndex = breakIterator.first();
        int prevIndex         = 0;
        while(wordBoundaryIndex != BreakIterator.DONE){
            String word = text.substring(prevIndex, wordBoundaryIndex).toLowerCase();
            if(isWord(word)) {
                WordCount wordCount = wordCounts.get(word);
                if(wordCount == null) {
                    wordCount = new WordCount();
                    wordCount.word = word;
                }
                wordCount.count++;
                wordCounts.put(word, wordCount);
            }
            prevIndex = wordBoundaryIndex;
            wordBoundaryIndex = breakIterator.next();
        }

        return wordCounts;
    }

    private static boolean isWord(String word) {
        if(word.length() == 1){
            return Character.isLetterOrDigit(word.charAt(0));
        }
        return !"".equals(word.trim());
    }
}

The countWords() method takes a string and a Locale. The Locale represents the language of the string. Thus, when a BreakIterator is created, it can be created for that specific language.

The method counts how many times each word occurs in the string, and returns that as a Map<String, WordCount>. The keys in the map are the individuals words in lowercase. The value for each key is a WordCount instance, which contains two variables: The word and the count of that word. If you want the total number of words in the text, you would have to sum the counts of all individual words.

Notice how the isWord() method uses the Character.isLetterOrDigit() method to determine if a character is a letter or digit, or something else (like semicolon, quote etc.). The Character.isLetterOrDigit() checks according to the unicode characters if a character is a letter or digit - and thus not just in the english language, but also in other languages. This, and similar methods are described in more detail in the Characeter Methods text.

Sentence Boundaries

To locate sentence boundaries you need a BreakIterator instance that is capable of finding sentence boundaries. Here is how you do that:

Locale locale = Locale.UK;
BreakIterator breakIterator =
        BreakIterator.getSentenceInstance(locale);

This code creates a BreakIterator targeted at the UK english language.

Here is an example that finds the sentence boundaries in an english string:

Locale locale = Locale.UK;
BreakIterator breakIterator =
        BreakIterator.getSentenceInstance(locale);

breakIterator.setText(
        "Mary had a little Android device. " +
        "It had small batteries too.");

int boundaryIndex = breakIterator.first();
while(boundaryIndex != BreakIterator.DONE) {
    System.out.println(boundaryIndex) ;
    boundaryIndex = breakIterator.next();
}

Line Boundaries

You can find breaks in a string where a line of text could be broken onto a new line without disturbing the reading of the text. To do this you need a BreakIterator capable of detecting potential line breaks. Note, that it does not find actual line breaks in the text, but potential line breaks. Finding potential line breaks is useful in text editors that need to break text onto multiple lines when displaying it, even if the text contains no explicit line breaks. Here is how you create such a BreakIterator:

Locale locale = Locale.UK;
BreakIterator breakIterator =
        BreakIterator.getLineInstance(locale);

This example creates a BreakIterator capable of finding potential line breaks in UK english text.

Here is an example that finds potential line breaks in a string with english text:

Locale locale = Locale.UK;
BreakIterator breakIterator =
        BreakIterator.getLineInstance(locale);

breakIterator.setText(
        "Mary had a little Android device.\n " +
        "It had small batteries too.");

int boundaryIndex = breakIterator.first();
while(boundaryIndex != BreakIterator.DONE) {
    System.out.println(boundaryIndex) ;
    boundaryIndex = breakIterator.next();
}

Next: Java Internationalization: Converting to and from Unicode

Tweet
	Jakob Jenkov