Java Internationalization: Collator - Sorting Strings

Customized Collation Rules
Improved Performance Using CollationKey's
Normalizing Text Before Sorting

Jakob Jenkov
Last update: 2014-06-23

Each language can have its own rules for how strings and letters are sorted. Thus, simply using the String.compareTo() method may not work for all languages.

To sort a collection of strings according to the rules of a certain Locale, you use a java.text.Collator instance created for that specific Locale. Here is an example of how you create a Collator:

Locale   locale = Locale.UK;
Collator collator = Collator.getInstance(locale);

To compare two strings using the Collator instance you call the compare() method. The compare() method returns an int with the same meaning as the String.compareTo() method:

A negative number means that the first string passed as parameter should occur earlier in a sorted sequence than the second string passed as parameter (the first string is "smaller"). A 0 means that the two strings have the same order - e.g. if the strings are equal. A positive number means that the first string should occur later in a sorted sequence than the second string (the first string is "bigger").

Here is how you use the Collator.compare() method:

Locale   locale = Locale.UK;
Collator collator = Collator.getInstance(locale);

int result = collator.compare("A", "B");

The result variable will contain a negative number in the example above. The string A would appear before the string B when sorted according to UK string sorting rules.

Customized Collation Rules

It is possible to customize the rules used to compare strings using the RuleBasedCollator. Here is a simple example:

String rules = "< b < a";

RuleBasedCollator ruleBasedCollator =
        new RuleBasedCollator(rules);

int result = ruleBasedCollator.compare("a", "b");

System.out.println(result);

The first line of this example defines the rules used compare the characters of strings. The example above defines that b comes before a. Therefore the last line of the example above will print out 1. The rest of the characters are sorted using the default order of the instantiated RuleBasedCollator

Grouping Characters

You can group characters by separating them with a comma in the rule string. Here is an example that groups uppercase and lowercase characters together:

String rules = "< c,C < b,B";

RuleBasedCollator ruleBasedCollator =
        new RuleBasedCollator(rules);

int result = ruleBasedCollator.compare("boss", "Carol");
System.out.println(result);

The first line of this example defines that both uppercase and lowercase C's are to appear before both uppercase and lowercase B's when comparing strings.

Combinations of Characters Interpreted as One Character

You can specify that some combinations of characters are to be interpreted as one character when ordering strings. Here is an example:

String rules = "< ch < b < a < c";

RuleBasedCollator ruleBasedCollator =
        new RuleBasedCollator(rules);

int result = ruleBasedCollator.compare("boss", "carol");
System.out.println(result);

result = ruleBasedCollator.compare("boss", "charlie");
System.out.println(result);

The first line of this example defines that the character combination ch is to be interpreted as one character when sorting strings. Also, ch is to occur before b, which again occurs before a, which again occurs before c.

The output printed from this code is:

-1
1

The string boss is to occur before the string carol. However, the string boss is to occur after charlie, since ch appears before b according to the rules specified in the first line of the example.

Full Rule Syntax

There are more rules you can use with the RuleBasedCollator class. You can see the full specification of rules in the RuleBasedCollator JavaDoc.

Improved Performance Using CollationKey's

If you need to sort and resort the same strings several times, you can create a CollationKey for each string and sort based on that instead of the strings. Sorting based on the CollationKey is done using bitwise comparison. This is faster than the string-wise comparison the RuleBasedCollator uses normally.

Creating a CollationKey takes time. If you only sort the strings once, it is faster to just use the RuleBasedCollator as it is.

Here is an example of how to create a CollationKey and sort strings using them:

String rules = "< c,C < b,B < a,A";

RuleBasedCollator ruleBasedCollator =
        new RuleBasedCollator(rules);

CollationKey[] collationKeys = new CollationKey[3];

collationKeys[0] = ruleBasedCollator.getCollationKey("boss");
collationKeys[1] = ruleBasedCollator.getCollationKey("carol");
collationKeys[2] = ruleBasedCollator.getCollationKey("andy");

Arrays.sort(collationKeys);

for(CollationKey collationKey : collationKeys) {
    System.out.println(collationKey.getSourceString());
}

The output printed from this code is:

carol
boss
andy

Normalizing Text Before Sorting

In unicode, some characters can be represented in multiple ways. Some has their own character as well as a combination of other unicode characters that can represent them. When characters can be represented in multiple ways, sorting them becomes harder. Therefore you should normalize the text before you sort it, or search in it for that matter. Normalizing the text makes sure that a given string of unicde characters is always represented in the same way - a way which is search and sort friendly.

You normalize a string using the static normalize() method of the java.text.Normalizer class. Here is an example:

String normalizedText =
    Normalizer.normalize("Text to normalize",
        Normalizer.Form.NFD);

The first parameter to the normalize() method is the text to normalize. Any CharSequence can be used.

The second parameter is the normalization form to normalize the text to. When you normalize text you have to choose one of four different normalization forms. When you normalize text with the intend of sorting it, make sure that you normalize all text to the same normalization form. What these forms mean you can see from the Java Normalizer Tutorial at Oracle.

Next: Java Internationalization: BreakIterator

Tweet
	Jakob Jenkov