Java Internationalization: Collator - Sorting Strings
Jakob Jenkov |
Each language can have its own rules for how strings and letters are sorted. Thus, simply
using the String.compareTo()
method may not work for all languages.
To sort a collection of strings according to the rules of a certain Locale
,
you use a java.text.Collator
instance created for that specific Locale
.
Here is an example of how you create a Collator
:
Locale locale = Locale.UK; Collator collator = Collator.getInstance(locale);
To compare two strings using the Collator
instance you call the compare()
method. The compare()
method returns an int
with the same meaning as the String.compareTo()
method:
A negative number means that the first string passed as parameter should occur earlier in a sorted sequence than the second string passed as parameter (the first string is "smaller"). A 0 means that the two strings have the same order - e.g. if the strings are equal. A positive number means that the first string should occur later in a sorted sequence than the second string (the first string is "bigger").
Here is how you use the Collator.compare()
method:
Locale locale = Locale.UK; Collator collator = Collator.getInstance(locale); int result = collator.compare("A", "B");
The result
variable will contain a negative number in the example above.
The string A
would appear before the string B
when sorted according
to UK string sorting rules.
Customized Collation Rules
It is possible to customize the rules used to compare strings using the RuleBasedCollator
.
Here is a simple example:
String rules = "< b < a"; RuleBasedCollator ruleBasedCollator = new RuleBasedCollator(rules); int result = ruleBasedCollator.compare("a", "b"); System.out.println(result);
The first line of this example defines the rules used compare the characters
of strings. The example above defines that b comes before a. Therefore
the last line of the example above will print out 1
. The rest
of the characters are sorted using the default order of the instantiated
RuleBasedCollator
Grouping Characters
You can group characters by separating them with a comma in the rule string. Here is an example that groups uppercase and lowercase characters together:
String rules = "< c,C < b,B"; RuleBasedCollator ruleBasedCollator = new RuleBasedCollator(rules); int result = ruleBasedCollator.compare("boss", "Carol"); System.out.println(result);
The first line of this example defines that both uppercase and lowercase C
's
are to appear before both uppercase and lowercase B
's when comparing strings.
Combinations of Characters Interpreted as One Character
You can specify that some combinations of characters are to be interpreted as one character when ordering strings. Here is an example:
String rules = "< ch < b < a < c"; RuleBasedCollator ruleBasedCollator = new RuleBasedCollator(rules); int result = ruleBasedCollator.compare("boss", "carol"); System.out.println(result); result = ruleBasedCollator.compare("boss", "charlie"); System.out.println(result);
The first line of this example defines that the character combination
ch
is to be interpreted as one character when sorting
strings. Also, ch
is to occur before b
,
which again occurs before a
, which again occurs before c
.
The output printed from this code is:
-1 1
The string boss
is to occur before the string carol
.
However, the string boss
is to occur after charlie
,
since ch
appears before b
according to the rules
specified in the first line of the example.
Full Rule Syntax
There are more rules you can use with the RuleBasedCollator
class.
You can see the full specification of rules in the RuleBasedCollator JavaDoc.
Improved Performance Using CollationKey's
If you need to sort and resort the same strings several times, you can create a
CollationKey
for each string and sort based on that instead of the
strings. Sorting based on the CollationKey
is done using bitwise
comparison. This is faster than the string-wise comparison the RuleBasedCollator
uses normally.
Creating a CollationKey
takes time. If you only sort the strings
once, it is faster to just use the RuleBasedCollator
as it is.
Here is an example of how to create a CollationKey
and sort
strings using them:
String rules = "< c,C < b,B < a,A"; RuleBasedCollator ruleBasedCollator = new RuleBasedCollator(rules); CollationKey[] collationKeys = new CollationKey[3]; collationKeys[0] = ruleBasedCollator.getCollationKey("boss"); collationKeys[1] = ruleBasedCollator.getCollationKey("carol"); collationKeys[2] = ruleBasedCollator.getCollationKey("andy"); Arrays.sort(collationKeys); for(CollationKey collationKey : collationKeys) { System.out.println(collationKey.getSourceString()); }
The output printed from this code is:
carol boss andy
Normalizing Text Before Sorting
In unicode, some characters can be represented in multiple ways. Some has their own character as well as a combination of other unicode characters that can represent them. When characters can be represented in multiple ways, sorting them becomes harder. Therefore you should normalize the text before you sort it, or search in it for that matter. Normalizing the text makes sure that a given string of unicde characters is always represented in the same way - a way which is search and sort friendly.
You normalize a string using the static normalize()
method of the java.text.Normalizer
class. Here is an example:
String normalizedText = Normalizer.normalize("Text to normalize", Normalizer.Form.NFD);
The first parameter to the normalize()
method is the text to normalize.
Any CharSequence
can be used.
The second parameter is the normalization form to normalize the text to. When you normalize text you have to choose one of four different normalization forms. When you normalize text with the intend of sorting it, make sure that you normalize all text to the same normalization form. What these forms mean you can see from the Java Normalizer Tutorial at Oracle.
Tweet | |
Jakob Jenkov |