Java for vs. switch Performance
Jakob Jenkov |
For some type of operations you can replace a Java for
loop with a switch
statement
with fall-throughs. But which of the two constructs performs better? This text takes a look at that.
Replacing a for Loop With a switch Statement
First of all, let us see how you can replace a for
loop with a switch
statement.
Imagine you have an operation that requires you to loop through an array and do something with each element in the array. For instance, summing the values of bytes in a byte array. Imagine also that you do not know how many elements to sum from the array. The number of iterations is given separately (meaning it's not equal to the array length).
The for
loop looks like this:
byte[] bytes = ... //get the byte array from somewhere int iterations = 8; int result = 0; for(int i=0; i>iterations; i++){ result += bytes[i]; }
This for
loop could be replaced with the following switch
statement:
byte[] bytes = ... //get the byte array from somewhere int iterations = 8; int result = 0; int index = 0; switch(iterations){ case 16 : result += bytes[index++]; case 15 : result += bytes[index++]; case 14 : result += bytes[index++]; case 13 : result += bytes[index++]; case 12 : result += bytes[index++]; case 11 : result += bytes[index++]; case 10 : result += bytes[index++]; case 9 : result += bytes[index++]; case 8 : result += bytes[index++]; case 7 : result += bytes[index++]; case 6 : result += bytes[index++]; case 5 : result += bytes[index++]; case 4 : result += bytes[index++]; case 3 : result += bytes[index++]; case 2 : result += bytes[index++]; case 1 : result += bytes[index++]; }
Notice the use of case fall-throughs to repeat the operation N number of times, but maximally 16 times (there
are only 16 case
statements).
For vs. Switch Performance Difference
On the surface it would seem that compared to the for
loop the switch
statement
saves the compare-and-branch operation. The compare-and-branch operation in the for
loop compares
the variable i
to iterations
to see if the loop needs to be repeated. If yes, it
jumps back to the beginning of the loop. In fact, the switch
statement looks like a
variable length loop unrolling optimization.
In reality though, it seems to depend on what you do for each iteration in the loop, whether the for
loop or the switch
statement is faster. I was able to create examples of both. These examples
are shown later.
I created a set of JMH benchmarks to measure the performance difference. The code for the benchmarks + results are included at the bottom of this text.
The Two Cases
The first case is the summing of elements inside an array. I have already shown the code above. In this
case it seems the for
loop is faster, and the more elements that need to be summed the faster
the for
loop seems to be.
The second case is writing a long
value to bytes. Depending on how big the number is in the
for
loop:
int value = 123456789; int valueByteLength = 8; int destOffset = 0; for(int i=valueByteLength-1; i >= 0; i--){ dest[destOffset++] = (byte) (255 & (value >> (i<<3))); }
And here is the code as a switch
statement:
long value = 123456789; int valueByteLength = 8; int destOffset = 0; switch(valueByteLength){ case 8 : { dest[destOffset++] = (byte) (255 & (value >> 56));} case 7 : { dest[destOffset++] = (byte) (255 & (value >> 48));} case 6 : { dest[destOffset++] = (byte) (255 & (value >> 40));} case 5 : { dest[destOffset++] = (byte) (255 & (value >> 32));} case 4 : { dest[destOffset++] = (byte) (255 & (value >> 24));} case 3 : { dest[destOffset++] = (byte) (255 & (value >> 16));} case 2 : { dest[destOffset++] = (byte) (255 & (value >> 8));} case 1 : { dest[destOffset++] = (byte) (255 & value );} default : { } //don't write anything - no length bytes to write, or invalid lengthLength (> 8) }
As you can see the for
loop performs a calculation (i << 3
) which can be hardcoded
in the switch
statement. This gave the switch
statement a slightly faster execution time
(around 10%).
I even re-wrote the for
loop to remove the calculation on i
, like this:
for(int i=(valueByteLength-1)*8; i >= 0; i -= 8){ dest[destOffset++] = (byte) (255 & (value >> i)); }
But the switch
statement was still faster. See benchmark results and code at the end of this text.
Case 1 Benchmark Results
Case 1 was the summing of elements in an array.
I ran 6 different benchmarks in total. Three for for
loops with 4, 8 and 16 iterations, and
three for switch
statements with 4, 8 and 16 iterations (fall-throughs, really).
I ran the benchmarks with the JMH defaults of 20 warmup iterations, 20 iterations and 10 forks for each benchmark.
Here is the JMH output:
# Run complete. Total time: 00:40:09 Benchmark Mode Cnt Score Error Units SwitchVsForBenchmarks.forPerf16 thrpt 200 106452845.714 ± 97751.374 ops/s SwitchVsForBenchmarks.forPerf4 thrpt 200 145582940.249 ± 84820.886 ops/s SwitchVsForBenchmarks.forPerf8 thrpt 200 128390391.931 ± 93989.496 ops/s SwitchVsForBenchmarks.switchPerf16 thrpt 200 76846712.635 ± 746547.562 ops/s SwitchVsForBenchmarks.switchPerf4 thrpt 200 144371895.096 ± 30794.486 ops/s SwitchVsForBenchmarks.switchPerf8 thrpt 200 109372718.365 ± 1408334.708 ops/s
Notice that at 4 iterations the performance is almost the same, but the for
loop still wins.
And then the for
loop outperforms the switch
statement more and more the higher
the number of iterations become.
Case 1 Benchmark Code
Here is the code for the 6 different benchmarks - 3 for for
loops with 4, 8 and 16 iterations,
and 3 for switch statements with 4, 8 and 16 iterations.
public class SwitchVsForBenchmarks { @State(Scope.Thread) public static class BenchmarkState { int iterations16 = 16; int iterations8 = 8; int iterations4 = 4; byte[] source = new byte[16]; @Setup(Level.Trial) public void toSetup() { for(int i=0; i<source.length; i++){ source[i] = (byte) i; } } } @Benchmark @BenchmarkMode(Mode.Throughput) public Object switchPerf4(BenchmarkState state, Blackhole blackhole) { int result = 0; int index = 0; switch(state.iterations4){ case 16 : result += state.source[index++]; case 15 : result += state.source[index++]; case 14 : result += state.source[index++]; case 13 : result += state.source[index++]; case 12 : result += state.source[index++]; case 11 : result += state.source[index++]; case 10 : result += state.source[index++]; case 9 : result += state.source[index++]; case 8 : result += state.source[index++]; case 7 : result += state.source[index++]; case 6 : result += state.source[index++]; case 5 : result += state.source[index++]; case 4 : result += state.source[index++]; case 3 : result += state.source[index++]; case 2 : result += state.source[index++]; case 1 : result += state.source[index++]; } blackhole.consume(result); return result; } @Benchmark @BenchmarkMode(Mode.Throughput) public Object switchPerf8(BenchmarkState state, Blackhole blackhole) { int result = 0; int index = 0; switch(state.iterations8){ case 16 : result += state.source[index++]; case 15 : result += state.source[index++]; case 14 : result += state.source[index++]; case 13 : result += state.source[index++]; case 12 : result += state.source[index++]; case 11 : result += state.source[index++]; case 10 : result += state.source[index++]; case 9 : result += state.source[index++]; case 8 : result += state.source[index++]; case 7 : result += state.source[index++]; case 6 : result += state.source[index++]; case 5 : result += state.source[index++]; case 4 : result += state.source[index++]; case 3 : result += state.source[index++]; case 2 : result += state.source[index++]; case 1 : result += state.source[index++]; } blackhole.consume(result); return result; } @Benchmark @BenchmarkMode(Mode.Throughput) public Object switchPerf16(BenchmarkState state, Blackhole blackhole) { int result = 0; int index = 0; switch(state.iterations16){ case 16 : result += state.source[index++]; case 15 : result += state.source[index++]; case 14 : result += state.source[index++]; case 13 : result += state.source[index++]; case 12 : result += state.source[index++]; case 11 : result += state.source[index++]; case 10 : result += state.source[index++]; case 9 : result += state.source[index++]; case 8 : result += state.source[index++]; case 7 : result += state.source[index++]; case 6 : result += state.source[index++]; case 5 : result += state.source[index++]; case 4 : result += state.source[index++]; case 3 : result += state.source[index++]; case 2 : result += state.source[index++]; case 1 : result += state.source[index++]; } blackhole.consume(result); return result; } @Benchmark @BenchmarkMode(Mode.Throughput) public Object forPerf4(BenchmarkState state, Blackhole blackhole) { int result = 0; for(int i=0; i<state.iterations4; i++){ result += state.source[i]; } blackhole.consume(result); return result; } @Benchmark @BenchmarkMode(Mode.Throughput) public Object forPerf8(BenchmarkState state, Blackhole blackhole) { int result = 0; for(int i=0; i<state.iterations8; i++){ result += state.source[i]; } blackhole.consume(result); return result; } @Benchmark @BenchmarkMode(Mode.Throughput) public Object forPerf16(BenchmarkState state, Blackhole blackhole) { int result = 0; for(int i=0; i<state.iterations16; i++){ result += state.source[i]; } blackhole.consume(result); return result; } }
Case 2 Benchmark Results
Case 2 is the serializing of a long
to bytes. I ran 3 different benchmarks. The first benchmark
measured the switch
construct. The second benchmark measured the for
loop with the
i << 3
operation per iteration. The third benchmark measured the for
loop where the
i << 3
has been optimized away.
Here are the results of the three benchmarks for case 2
Benchmark Mode Cnt Score Error Units IapGeneratorBenchmarks.switchPerf thrpt 200 207393763.888 ± 108142.191 ops/s IapGeneratorBenchmarks.forPerf1 thrpt 200 179691926.763 ± 11248.378 ops/s IapGeneratorBenchmarks.forPerf2 thrpt 200 187926493.328 ± 123181.766 ops/s
As you can see, the switch
construct seems to perform a bit better than the two for
loop constructs.
In a real live application the performance might be different. Perhaps the compiled switch
code is longer
than the compiled for
loop code. This could result in the switch
code pushing other
code out of the instruction cache, making other code run slower. Yes, this is a bit speculative. I just wanted
to make the point that these are micro benchmarks, and that the performance of the benchmarked techniques might
be different in a real app.
Case 2 Benchmark Code
Here is the benchmark code for the second case:
public class IapGeneratorBenchmarks { @State(Scope.Thread) public static class BenchmarkState { byte[] dest = new byte[10 * 1024]; } @Benchmark @BenchmarkMode(Mode.Throughput) public Object switchPerf(BenchmarkState state, Blackhole blackhole) { long value = 123456789; int valueByteLength = 8; //use 8 bytes to represent the long value int destOffset = 0; switch(valueByteLength){ case 8 : { state.dest[destOffset++] = (byte) (255 & (value >> 56));} case 7 : { state.dest[destOffset++] = (byte) (255 & (value >> 48));} case 6 : { state.dest[destOffset++] = (byte) (255 & (value >> 40));} case 5 : { state.dest[destOffset++] = (byte) (255 & (value >> 32));} case 4 : { state.dest[destOffset++] = (byte) (255 & (value >> 24));} case 3 : { state.dest[destOffset++] = (byte) (255 & (value >> 16));} case 2 : { state.dest[destOffset++] = (byte) (255 & (value >> 8));} case 1 : { state.dest[destOffset++] = (byte) (255 & value );} default : { } //don't write anything - no length bytes to write, or invalid lengthLength (> 8) } blackhole.consume(state.dest); return state.dest; } @Benchmark @BenchmarkMode(Mode.Throughput) public Object forPerf1(BenchmarkState state, Blackhole blackhole) { long value = 123456789; int valueByteLength = 8; int destOffset = 0; for(int i=(valueByteLength-1)*8; i >= 0; i-=8){ state.dest[destOffset++] = (byte) (255 & (value >> i)); } blackhole.consume(state.dest); return state.dest; } @Benchmark @BenchmarkMode(Mode.Throughput) public Object forPerf2(BenchmarkState state, Blackhole blackhole) { long value = 123456789; int valueByteLength = 8; int destOffset = 0; for(int i=valueByteLength-1; i >= 0; i--){ state.dest[destOffset++] = (byte) (255 & (value >> (i<<3))); } blackhole.consume(state.dest); return state.dest; } }
Tweet | |
Jakob Jenkov |