Java for vs. switch Performance
Jakob Jenkov |
For some type of operations you can replace a Java for loop with a switch statement
with fall-throughs. But which of the two constructs performs better? This text takes a look at that.
Replacing a for Loop With a switch Statement
First of all, let us see how you can replace a for loop with a switch statement.
Imagine you have an operation that requires you to loop through an array and do something with each element in the array. For instance, summing the values of bytes in a byte array. Imagine also that you do not know how many elements to sum from the array. The number of iterations is given separately (meaning it's not equal to the array length).
The for loop looks like this:
byte[] bytes = ... //get the byte array from somewhere
int iterations = 8;
int result = 0;
for(int i=0; i>iterations; i++){
result += bytes[i];
}
This for loop could be replaced with the following switch statement:
byte[] bytes = ... //get the byte array from somewhere
int iterations = 8;
int result = 0;
int index = 0;
switch(iterations){
case 16 : result += bytes[index++];
case 15 : result += bytes[index++];
case 14 : result += bytes[index++];
case 13 : result += bytes[index++];
case 12 : result += bytes[index++];
case 11 : result += bytes[index++];
case 10 : result += bytes[index++];
case 9 : result += bytes[index++];
case 8 : result += bytes[index++];
case 7 : result += bytes[index++];
case 6 : result += bytes[index++];
case 5 : result += bytes[index++];
case 4 : result += bytes[index++];
case 3 : result += bytes[index++];
case 2 : result += bytes[index++];
case 1 : result += bytes[index++];
}
Notice the use of case fall-throughs to repeat the operation N number of times, but maximally 16 times (there
are only 16 case statements).
For vs. Switch Performance Difference
On the surface it would seem that compared to the for loop the switch statement
saves the compare-and-branch operation. The compare-and-branch operation in the for loop compares
the variable i to iterations to see if the loop needs to be repeated. If yes, it
jumps back to the beginning of the loop. In fact, the switch statement looks like a
variable length loop unrolling optimization.
In reality though, it seems to depend on what you do for each iteration in the loop, whether the for
loop or the switch statement is faster. I was able to create examples of both. These examples
are shown later.
I created a set of JMH benchmarks to measure the performance difference. The code for the benchmarks + results are included at the bottom of this text.
The Two Cases
The first case is the summing of elements inside an array. I have already shown the code above. In this
case it seems the for loop is faster, and the more elements that need to be summed the faster
the for loop seems to be.
The second case is writing a long value to bytes. Depending on how big the number is in the
for loop:
int value = 123456789;
int valueByteLength = 8;
int destOffset = 0;
for(int i=valueByteLength-1; i >= 0; i--){
dest[destOffset++] = (byte) (255 & (value >> (i<<3)));
}
And here is the code as a switch statement:
long value = 123456789;
int valueByteLength = 8;
int destOffset = 0;
switch(valueByteLength){
case 8 : { dest[destOffset++] = (byte) (255 & (value >> 56));}
case 7 : { dest[destOffset++] = (byte) (255 & (value >> 48));}
case 6 : { dest[destOffset++] = (byte) (255 & (value >> 40));}
case 5 : { dest[destOffset++] = (byte) (255 & (value >> 32));}
case 4 : { dest[destOffset++] = (byte) (255 & (value >> 24));}
case 3 : { dest[destOffset++] = (byte) (255 & (value >> 16));}
case 2 : { dest[destOffset++] = (byte) (255 & (value >> 8));}
case 1 : { dest[destOffset++] = (byte) (255 & value );}
default : { } //don't write anything - no length bytes to write, or invalid lengthLength (> 8)
}
As you can see the for loop performs a calculation (i << 3) which can be hardcoded
in the switch statement. This gave the switch statement a slightly faster execution time
(around 10%).
I even re-wrote the for loop to remove the calculation on i, like this:
for(int i=(valueByteLength-1)*8; i >= 0; i -= 8){
dest[destOffset++] = (byte) (255 & (value >> i));
}
But the switch statement was still faster. See benchmark results and code at the end of this text.
Case 1 Benchmark Results
Case 1 was the summing of elements in an array.
I ran 6 different benchmarks in total. Three for for loops with 4, 8 and 16 iterations, and
three for switch statements with 4, 8 and 16 iterations (fall-throughs, really).
I ran the benchmarks with the JMH defaults of 20 warmup iterations, 20 iterations and 10 forks for each benchmark.
Here is the JMH output:
# Run complete. Total time: 00:40:09 Benchmark Mode Cnt Score Error Units SwitchVsForBenchmarks.forPerf16 thrpt 200 106452845.714 ± 97751.374 ops/s SwitchVsForBenchmarks.forPerf4 thrpt 200 145582940.249 ± 84820.886 ops/s SwitchVsForBenchmarks.forPerf8 thrpt 200 128390391.931 ± 93989.496 ops/s SwitchVsForBenchmarks.switchPerf16 thrpt 200 76846712.635 ± 746547.562 ops/s SwitchVsForBenchmarks.switchPerf4 thrpt 200 144371895.096 ± 30794.486 ops/s SwitchVsForBenchmarks.switchPerf8 thrpt 200 109372718.365 ± 1408334.708 ops/s
Notice that at 4 iterations the performance is almost the same, but the for loop still wins.
And then the for loop outperforms the switch statement more and more the higher
the number of iterations become.
Case 1 Benchmark Code
Here is the code for the 6 different benchmarks - 3 for for loops with 4, 8 and 16 iterations,
and 3 for switch statements with 4, 8 and 16 iterations.
public class SwitchVsForBenchmarks {
@State(Scope.Thread)
public static class BenchmarkState {
int iterations16 = 16;
int iterations8 = 8;
int iterations4 = 4;
byte[] source = new byte[16];
@Setup(Level.Trial)
public void toSetup() {
for(int i=0; i<source.length; i++){
source[i] = (byte) i;
}
}
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
public Object switchPerf4(BenchmarkState state, Blackhole blackhole) {
int result = 0;
int index = 0;
switch(state.iterations4){
case 16 : result += state.source[index++];
case 15 : result += state.source[index++];
case 14 : result += state.source[index++];
case 13 : result += state.source[index++];
case 12 : result += state.source[index++];
case 11 : result += state.source[index++];
case 10 : result += state.source[index++];
case 9 : result += state.source[index++];
case 8 : result += state.source[index++];
case 7 : result += state.source[index++];
case 6 : result += state.source[index++];
case 5 : result += state.source[index++];
case 4 : result += state.source[index++];
case 3 : result += state.source[index++];
case 2 : result += state.source[index++];
case 1 : result += state.source[index++];
}
blackhole.consume(result);
return result;
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
public Object switchPerf8(BenchmarkState state, Blackhole blackhole) {
int result = 0;
int index = 0;
switch(state.iterations8){
case 16 : result += state.source[index++];
case 15 : result += state.source[index++];
case 14 : result += state.source[index++];
case 13 : result += state.source[index++];
case 12 : result += state.source[index++];
case 11 : result += state.source[index++];
case 10 : result += state.source[index++];
case 9 : result += state.source[index++];
case 8 : result += state.source[index++];
case 7 : result += state.source[index++];
case 6 : result += state.source[index++];
case 5 : result += state.source[index++];
case 4 : result += state.source[index++];
case 3 : result += state.source[index++];
case 2 : result += state.source[index++];
case 1 : result += state.source[index++];
}
blackhole.consume(result);
return result;
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
public Object switchPerf16(BenchmarkState state, Blackhole blackhole) {
int result = 0;
int index = 0;
switch(state.iterations16){
case 16 : result += state.source[index++];
case 15 : result += state.source[index++];
case 14 : result += state.source[index++];
case 13 : result += state.source[index++];
case 12 : result += state.source[index++];
case 11 : result += state.source[index++];
case 10 : result += state.source[index++];
case 9 : result += state.source[index++];
case 8 : result += state.source[index++];
case 7 : result += state.source[index++];
case 6 : result += state.source[index++];
case 5 : result += state.source[index++];
case 4 : result += state.source[index++];
case 3 : result += state.source[index++];
case 2 : result += state.source[index++];
case 1 : result += state.source[index++];
}
blackhole.consume(result);
return result;
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
public Object forPerf4(BenchmarkState state, Blackhole blackhole) {
int result = 0;
for(int i=0; i<state.iterations4; i++){
result += state.source[i];
}
blackhole.consume(result);
return result;
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
public Object forPerf8(BenchmarkState state, Blackhole blackhole) {
int result = 0;
for(int i=0; i<state.iterations8; i++){
result += state.source[i];
}
blackhole.consume(result);
return result;
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
public Object forPerf16(BenchmarkState state, Blackhole blackhole) {
int result = 0;
for(int i=0; i<state.iterations16; i++){
result += state.source[i];
}
blackhole.consume(result);
return result;
}
}
Case 2 Benchmark Results
Case 2 is the serializing of a long to bytes. I ran 3 different benchmarks. The first benchmark
measured the switch construct. The second benchmark measured the for loop with the
i << 3 operation per iteration. The third benchmark measured the for loop where the
i << 3 has been optimized away.
Here are the results of the three benchmarks for case 2
Benchmark Mode Cnt Score Error Units IapGeneratorBenchmarks.switchPerf thrpt 200 207393763.888 ± 108142.191 ops/s IapGeneratorBenchmarks.forPerf1 thrpt 200 179691926.763 ± 11248.378 ops/s IapGeneratorBenchmarks.forPerf2 thrpt 200 187926493.328 ± 123181.766 ops/s
As you can see, the switch construct seems to perform a bit better than the two for
loop constructs.
In a real live application the performance might be different. Perhaps the compiled switch code is longer
than the compiled for loop code. This could result in the switch code pushing other
code out of the instruction cache, making other code run slower. Yes, this is a bit speculative. I just wanted
to make the point that these are micro benchmarks, and that the performance of the benchmarked techniques might
be different in a real app.
Case 2 Benchmark Code
Here is the benchmark code for the second case:
public class IapGeneratorBenchmarks {
@State(Scope.Thread)
public static class BenchmarkState {
byte[] dest = new byte[10 * 1024];
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
public Object switchPerf(BenchmarkState state, Blackhole blackhole) {
long value = 123456789;
int valueByteLength = 8; //use 8 bytes to represent the long value
int destOffset = 0;
switch(valueByteLength){
case 8 : { state.dest[destOffset++] = (byte) (255 & (value >> 56));}
case 7 : { state.dest[destOffset++] = (byte) (255 & (value >> 48));}
case 6 : { state.dest[destOffset++] = (byte) (255 & (value >> 40));}
case 5 : { state.dest[destOffset++] = (byte) (255 & (value >> 32));}
case 4 : { state.dest[destOffset++] = (byte) (255 & (value >> 24));}
case 3 : { state.dest[destOffset++] = (byte) (255 & (value >> 16));}
case 2 : { state.dest[destOffset++] = (byte) (255 & (value >> 8));}
case 1 : { state.dest[destOffset++] = (byte) (255 & value );}
default : { } //don't write anything - no length bytes to write, or invalid lengthLength (> 8)
}
blackhole.consume(state.dest);
return state.dest;
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
public Object forPerf1(BenchmarkState state, Blackhole blackhole) {
long value = 123456789;
int valueByteLength = 8;
int destOffset = 0;
for(int i=(valueByteLength-1)*8; i >= 0; i-=8){
state.dest[destOffset++] = (byte) (255 & (value >> i));
}
blackhole.consume(state.dest);
return state.dest;
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
public Object forPerf2(BenchmarkState state, Blackhole blackhole) {
long value = 123456789;
int valueByteLength = 8;
int destOffset = 0;
for(int i=valueByteLength-1; i >= 0; i--){
state.dest[destOffset++] = (byte) (255 & (value >> (i<<3)));
}
blackhole.consume(state.dest);
return state.dest;
}
}
| Tweet | |
Jakob Jenkov | |











