Java for vs. switch Performance

Replacing a for Loop With a switch Statement
For vs. Switch Performance Difference
The Two Cases
Case 1 Benchmark Results
Case 1 Benchmark Code
Case 2 Benchmark Results
Case 2 Benchmark Code

Jakob Jenkov
Last update: 2015-11-10

For some type of operations you can replace a Java for loop with a switch statement with fall-throughs. But which of the two constructs performs better? This text takes a look at that.

Replacing a for Loop With a switch Statement

First of all, let us see how you can replace a for loop with a switch statement.

Imagine you have an operation that requires you to loop through an array and do something with each element in the array. For instance, summing the values of bytes in a byte array. Imagine also that you do not know how many elements to sum from the array. The number of iterations is given separately (meaning it's not equal to the array length).

The for loop looks like this:

byte[] bytes = ... //get the byte array from somewhere

int iterations = 8;
int result = 0;

for(int i=0; i>iterations; i++){
    result += bytes[i];
}

This for loop could be replaced with the following switch statement:

byte[] bytes = ... //get the byte array from somewhere

int iterations = 8;
int result = 0;
int index  = 0;

switch(iterations){
    case 16 : result += bytes[index++];
    case 15 : result += bytes[index++];
    case 14 : result += bytes[index++];
    case 13 : result += bytes[index++];
    case 12 : result += bytes[index++];
    case 11 : result += bytes[index++];
    case 10 : result += bytes[index++];
    case  9 : result += bytes[index++];
    case  8 : result += bytes[index++];
    case  7 : result += bytes[index++];
    case  6 : result += bytes[index++];
    case  5 : result += bytes[index++];
    case  4 : result += bytes[index++];
    case  3 : result += bytes[index++];
    case  2 : result += bytes[index++];
    case  1 : result += bytes[index++];
}

Notice the use of case fall-throughs to repeat the operation N number of times, but maximally 16 times (there are only 16 case statements).

For vs. Switch Performance Difference

On the surface it would seem that compared to the for loop the switch statement saves the compare-and-branch operation. The compare-and-branch operation in the for loop compares the variable i to iterations to see if the loop needs to be repeated. If yes, it jumps back to the beginning of the loop. In fact, the switch statement looks like a variable length loop unrolling optimization.

In reality though, it seems to depend on what you do for each iteration in the loop, whether the for loop or the switch statement is faster. I was able to create examples of both. These examples are shown later.

I created a set of JMH benchmarks to measure the performance difference. The code for the benchmarks + results are included at the bottom of this text.

The Two Cases

The first case is the summing of elements inside an array. I have already shown the code above. In this case it seems the for loop is faster, and the more elements that need to be summed the faster the for loop seems to be.

The second case is writing a long value to bytes. Depending on how big the number is in the long value, a different number of bytes are used. Here is the code as a for loop:

int value = 123456789;
int valueByteLength = 8;
int destOffset = 0;

for(int i=valueByteLength-1; i >= 0; i--){
    dest[destOffset++] = (byte) (255 & (value >> (i<<3)));
}

And here is the code as a switch statement:

long value = 123456789;
int valueByteLength = 8;
int destOffset = 0;

switch(valueByteLength){
    case 8 : { dest[destOffset++] = (byte) (255 & (value >> 56));}
    case 7 : { dest[destOffset++] = (byte) (255 & (value >> 48));}
    case 6 : { dest[destOffset++] = (byte) (255 & (value >> 40));}
    case 5 : { dest[destOffset++] = (byte) (255 & (value >> 32));}
    case 4 : { dest[destOffset++] = (byte) (255 & (value >> 24));}
    case 3 : { dest[destOffset++] = (byte) (255 & (value >> 16));}
    case 2 : { dest[destOffset++] = (byte) (255 & (value >>  8));}
    case 1 : { dest[destOffset++] = (byte) (255 &  value );}
    default : { }  //don't write anything - no length bytes to write, or invalid lengthLength (> 8)
}

As you can see the for loop performs a calculation (i << 3) which can be hardcoded in the switch statement. This gave the switch statement a slightly faster execution time (around 10%).

I even re-wrote the for loop to remove the calculation on i, like this:

for(int i=(valueByteLength-1)*8; i >= 0; i -= 8){
    dest[destOffset++] = (byte) (255 & (value >> i));
}

But the switch statement was still faster. See benchmark results and code at the end of this text.

Case 1 Benchmark Results

Case 1 was the summing of elements in an array.

I ran 6 different benchmarks in total. Three for for loops with 4, 8 and 16 iterations, and three for switch statements with 4, 8 and 16 iterations (fall-throughs, really).

I ran the benchmarks with the JMH defaults of 20 warmup iterations, 20 iterations and 10 forks for each benchmark.

Here is the JMH output:

# Run complete. Total time: 00:40:09

Benchmark                            Mode  Cnt          Score         Error  Units
SwitchVsForBenchmarks.forPerf16     thrpt  200  106452845.714 ±   97751.374  ops/s
SwitchVsForBenchmarks.forPerf4      thrpt  200  145582940.249 ±   84820.886  ops/s
SwitchVsForBenchmarks.forPerf8      thrpt  200  128390391.931 ±   93989.496  ops/s
SwitchVsForBenchmarks.switchPerf16  thrpt  200   76846712.635 ±  746547.562  ops/s
SwitchVsForBenchmarks.switchPerf4   thrpt  200  144371895.096 ±   30794.486  ops/s
SwitchVsForBenchmarks.switchPerf8   thrpt  200  109372718.365 ± 1408334.708  ops/s

Notice that at 4 iterations the performance is almost the same, but the for loop still wins. And then the for loop outperforms the switch statement more and more the higher the number of iterations become.

Case 1 Benchmark Code

Here is the code for the 6 different benchmarks - 3 for for loops with 4, 8 and 16 iterations, and 3 for switch statements with 4, 8 and 16 iterations.

public class SwitchVsForBenchmarks {


    @State(Scope.Thread)
    public static class BenchmarkState {

        int iterations16 = 16;
        int iterations8  = 8;
        int iterations4  = 4;
        byte[] source    = new byte[16];

        @Setup(Level.Trial)
        public void toSetup() {
            for(int i=0; i<source.length; i++){
                source[i] = (byte) i;
            }
        }

    }


    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    public Object switchPerf4(BenchmarkState state, Blackhole blackhole) {
        int result = 0;
        int index  = 0;
        switch(state.iterations4){
            case 16 : result += state.source[index++];
            case 15 : result += state.source[index++];
            case 14 : result += state.source[index++];
            case 13 : result += state.source[index++];
            case 12 : result += state.source[index++];
            case 11 : result += state.source[index++];
            case 10 : result += state.source[index++];
            case  9 : result += state.source[index++];
            case  8 : result += state.source[index++];
            case  7 : result += state.source[index++];
            case  6 : result += state.source[index++];
            case  5 : result += state.source[index++];
            case  4 : result += state.source[index++];
            case  3 : result += state.source[index++];
            case  2 : result += state.source[index++];
            case  1 : result += state.source[index++];

        }

        blackhole.consume(result);

        return result;
    }

    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    public Object switchPerf8(BenchmarkState state, Blackhole blackhole) {
        int result = 0;
        int index  = 0;
        switch(state.iterations8){
            case 16 : result += state.source[index++];
            case 15 : result += state.source[index++];
            case 14 : result += state.source[index++];
            case 13 : result += state.source[index++];
            case 12 : result += state.source[index++];
            case 11 : result += state.source[index++];
            case 10 : result += state.source[index++];
            case  9 : result += state.source[index++];
            case  8 : result += state.source[index++];
            case  7 : result += state.source[index++];
            case  6 : result += state.source[index++];
            case  5 : result += state.source[index++];
            case  4 : result += state.source[index++];
            case  3 : result += state.source[index++];
            case  2 : result += state.source[index++];
            case  1 : result += state.source[index++];
        }

        blackhole.consume(result);

        return result;
    }


    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    public Object switchPerf16(BenchmarkState state, Blackhole blackhole) {
        int result = 0;
        int index  = 0;
        switch(state.iterations16){
            case 16 : result += state.source[index++];
            case 15 : result += state.source[index++];
            case 14 : result += state.source[index++];
            case 13 : result += state.source[index++];
            case 12 : result += state.source[index++];
            case 11 : result += state.source[index++];
            case 10 : result += state.source[index++];
            case  9 : result += state.source[index++];
            case  8 : result += state.source[index++];
            case  7 : result += state.source[index++];
            case  6 : result += state.source[index++];
            case  5 : result += state.source[index++];
            case  4 : result += state.source[index++];
            case  3 : result += state.source[index++];
            case  2 : result += state.source[index++];
            case  1 : result += state.source[index++];
        }

        blackhole.consume(result);

        return result;
    }



    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    public Object forPerf4(BenchmarkState state, Blackhole blackhole) {
        int result = 0;

        for(int i=0; i<state.iterations4; i++){
            result += state.source[i];
        }

        blackhole.consume(result);
        return result;
    }


    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    public Object forPerf8(BenchmarkState state, Blackhole blackhole) {
        int result = 0;

        for(int i=0; i<state.iterations8; i++){
            result += state.source[i];
        }

        blackhole.consume(result);
        return result;
    }


    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    public Object forPerf16(BenchmarkState state, Blackhole blackhole) {
        int result = 0;

        for(int i=0; i<state.iterations16; i++){
            result += state.source[i];
        }

        blackhole.consume(result);
        return result;
    }
}

Case 2 Benchmark Results

Case 2 is the serializing of a long to bytes. I ran 3 different benchmarks. The first benchmark measured the switch construct. The second benchmark measured the for loop with the i << 3 operation per iteration. The third benchmark measured the for loop where the i << 3 has been optimized away.

Here are the results of the three benchmarks for case 2

Benchmark                            Mode  Cnt          Score        Error  Units
IapGeneratorBenchmarks.switchPerf    thrpt  200  207393763.888 Â± 108142.191  ops/s
IapGeneratorBenchmarks.forPerf1      thrpt  200  179691926.763 Â±  11248.378  ops/s
IapGeneratorBenchmarks.forPerf2      thrpt  200  187926493.328 Â± 123181.766  ops/s

As you can see, the switch construct seems to perform a bit better than the two for loop constructs.

In a real live application the performance might be different. Perhaps the compiled switch code is longer than the compiled for loop code. This could result in the switch code pushing other code out of the instruction cache, making other code run slower. Yes, this is a bit speculative. I just wanted to make the point that these are micro benchmarks, and that the performance of the benchmarked techniques might be different in a real app.

Case 2 Benchmark Code

Here is the benchmark code for the second case:

public class IapGeneratorBenchmarks {

    @State(Scope.Thread)
    public static class BenchmarkState {

        byte[]    dest     = new byte[10 * 1024];

    }


    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    public Object switchPerf(BenchmarkState state, Blackhole blackhole) {
        long value = 123456789;
        int valueByteLength = 8;  //use 8 bytes to represent the long value
        int destOffset = 0;

        switch(valueByteLength){
            case 8 : { state.dest[destOffset++] = (byte) (255 & (value >> 56));}
            case 7 : { state.dest[destOffset++] = (byte) (255 & (value >> 48));}
            case 6 : { state.dest[destOffset++] = (byte) (255 & (value >> 40));}
            case 5 : { state.dest[destOffset++] = (byte) (255 & (value >> 32));}
            case 4 : { state.dest[destOffset++] = (byte) (255 & (value >> 24));}
            case 3 : { state.dest[destOffset++] = (byte) (255 & (value >> 16));}
            case 2 : { state.dest[destOffset++] = (byte) (255 & (value >>  8));}
            case 1 : { state.dest[destOffset++] = (byte) (255 &  value );}
            default : { }  //don't write anything - no length bytes to write, or invalid lengthLength (> 8)
        }

        blackhole.consume(state.dest);
        return state.dest;
    }


    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    public Object forPerf1(BenchmarkState state, Blackhole blackhole) {
        long value = 123456789;
        int valueByteLength = 8;
        int destOffset = 0;

        for(int i=(valueByteLength-1)*8; i >= 0; i-=8){
            state.dest[destOffset++] = (byte) (255 & (value >> i));
        }

        blackhole.consume(state.dest);
        return state.dest;
    }

    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    public Object forPerf2(BenchmarkState state, Blackhole blackhole) {
        long value = 123456789;
        int valueByteLength = 8;
        int destOffset = 0;

        for(int i=valueByteLength-1; i >= 0; i--){
            state.dest[destOffset++] = (byte) (255 & (value >> (i<<3)));
        }

        blackhole.consume(state.dest);
        return state.dest;
    }
}

Next: Java ArrayList vs. OpenArrayList Performance

Tweet
	Jakob Jenkov