Java performance tuning is somewhat of a black art. It is hard to find good information about Java performance tuning, and there are several reasons for this.
First of all, performance tuning depends a lot on the system you are building and the hardware your system runs on. Over time your system, the Java virtual machine and the hardware executing your system evolves. With this evolution follows also an evolution in the applicable Java performance techniques.
Second, we Java developers have been fed a lot of untrue stories about the Java compiler and the Java Virtual Machine. It is often said that the Java compiler or VM can do a better job of optimizing your code than you can.
However, this is not always entirely correct. Algorithms, data formats, data structures, memory usage patterns, IO usage patterns etc. matter! There are many situations where you can optimize your code better than the Java compiler and JVM, because you know more about what your system is trying to do, its data structures, data usage patterns etc. than Java does.
Please note: This Java performance tutorial is a multi page tutorial. This page is only the introduction!
The Main Performance Techniques
There are many different performance optimization tricks within Java development, but they all tend to fall into one of the following categories:
- Reduce the work (operations) required to perform the task.
- Align code with the hardware (CPU, RAM, SSD etc.) (AKA mechanical sympathy)
- Parallelize tasks when possible, as much as possible, and when it makes sense.
Reducing the work (operations) required to perform a given task requires that you can come up with a faster way to perform that task. It also sometimes requires knowledge about what goes on under the hood in Java, and all the way down to the hardware layer. Sometimes you can come up with a more compact data structure, or a faster data structure, or a faster algorithm etc. which can speed up the task at hand.
Aligning the code with how the underlying hardware works includes tricks like aligning variables on 8 byte addresses (0, 8, 16 etc.), or keeping data within the same cache page, making sure data is stored close to each other to take advantage of serial memory access, or reducing branching to improve / remove branch prediction (to avoid CPU pipeline flushing), or knowing the cache size of the SSD etc.
Reducing distance between CPU and data is a prime example of aligning the code and data with how the underlying hardware works. Reducing the distance between the CPU and the data it works on usually speeds up the processing of the data Accessing data inside the CPU registers is faster than in the L1 cache, which is faster than the L2 cache, which is faster than the L3 cache, which is faster than the main RAM, which is faster than on disk, which is most often faster than on a remote computer.
Breaking a task up into smaller tasks which can be executed in parallel, or simply executing multiple independent tasks in parallel is also a technique that can improve performance significant in some cases. Modern CPUs tend to get more and more parallel execution units (cores), so parallelization can be a big performance improvement - when it is possible to break down tasks in a simple way, or if they are already independent of each other. Parallelization could also be implemented as multiple computers working together to solve a problem, rather than a single.
Aspects Impacting Performance
There are a few recurring aspects of any system that impacts its performance. These aspects are:
- Memory Management
- Data Structures
- Network Communication
Memory management, data structures and algorithms are typically linked closely together. A certain algorithm may require a certain data structure. A certain data structure may impact memory management.
Concurrency means how well the system can distribute its load over multiple threads and CPUs. Concurrency may also be linked to data structures, but not always. It depends on the your system's concurrency model.
Network communication may impact your system's performance too. Some network protocols are faster than others, and you may sometimes be able to create a faster, custom network protocol for your own specific use case. Also your system's communication patterns impact performance - not just how messages are transported forth and back but also how often, and whether communication is synchronous or asynchronous.
The scalability of a system means how well it performs when you scale up or scale out the hardware. Scaling up (vertical scaling) means buying a bigger computer with more memory, more CPUs, faster disks, NIC etc. Scaling out (horizontal scaling) means distributing the system across multiple machines.
Reusable Java Performance Principles
Even though a lot of performance optimization is specific to each individual application, there are still Java performance principles, techniques and patterns which can be reused across many different types of applications and situations. This Java performance tutorial will focus mostly on such reusable principles.
I may also from time to time dive into a specific system / use case to examplify how this system was optimized. Such example cases can be quite enlightening, though many companies tend to hold their cards close to the body when it comes to something that can be considered a competitive edge (and performance is).
Core Java Perfomance Principles
The most common core principles of Java performance tuning which the tips in this tutorial are based on, are:
- Memory is faster than disk - much faster - and memory is cheap.
- All storage (memory / disk) works fastest when read from or written to sequentially. Arbitrary access is slower.
- Object allocation and garbage collection is slow.
- Data formats and data structures make a big difference in speed.
- Asynchronous IO scales better than synchronous IO.
- Singlethreaded performance is a prerequisite for multithreaded performance.
- Shared memory (or disk) concurrency is bad because it usually leads to lots of contention when the system gets busy.
As you will probably notice, many of the performance tips in this tutorial are based on these same principles.
The last one about singlethreaded performance might come as a surprise to some. Parallel computing and parallel programming is all the rage these days (2015), so you might have been told that you should be thinking in breaking down your problem into smaller problems which can be solved in parallel. Unfortunately there are not that many problems that can easily be parallelized.
Additionally, if your server works on many tasks at the same time (e.g incoming HTTP requests), the other CPUs in your server may already be busy working on their own tasks. Parallelizing tasks gain you nothing then, as the CPUs are already busy. In fact, it may hurt performance (unless you have way more CPUs than you are using on average).
Java Performance Credits
The principles and techniques presented in this Java performance tutorial are not all mine. Far from, actually. These tutorials present work by Java performance master minds who have learned and polished these techniques in real life high performance systems. Here are some of the Java performance master minds who's work have inspired or contributed to this Java performance tutorial (the order is random):
- Aleksey Shipilëv
- Martin Thompson
- Azul Systems
- Peter Lawrey
- Rick Hightower
- The Psy-Lob-Saw Blog
- The High Scalability Blog
- ... more coming ...
My own experiences come from from a mix of Java performance experiments, as well as the design and development of VStack.co - a fully hosted application backend which I have cofounded with WorpCloud Ltd. The faster our system is, the more, the bigger and the more demanding clients we can serve.
The Java Performance Toolkit and Benchmarks
The code and benchmarks presented in this Java performance tutorial will be made available on GitHub some time in the future. The code and benchmarks will live in separate GitHub repositories.
The Java performance toolkit is primarily intended to show implementations of the ideas presented in this tutorial. They may not always be full featured - ready for use in a real application - but may sometimes just serve as proof of concept of some idea. However, feel free to use the toolkit in your apps if you want.
The Java performance benchmarks are intended to be runnable on your own hardware, so you can see how a given technique, implementation etc. runs on your specific hardware.
It is easy to do something wrong - both in a performance idea, its implementation, or the benchmark measuring it. If you have any feedback or suggestions for the ideas, implementations or benchmarks presented here, I would very much appreciate if you send them to me. Techniques, JVMs and hardware evolves all the time. So should this Java performance tutorial + implementations + benchmarks.