RSync - Checksums

Is a Checksum Comparison Really Good Enough?
- Compensating for the Checksum Risk
Choosing Checksum Algorithms
- Rolling Checksum Algorithm
- Hash Algorithm

Jakob Jenkov
Last update: 2014-05-23

In the earlier texts I explained that RSync uses a checksum to detect if a block in one version of the file is equal to a block in another version.

A checksum is an algorithm that calculates a smaller value (in bytes) based on a larger block of data. Like a hash value.

Since the checksum is typically a lot smaller than the data itself, say a 16 byte checksum calculated on a 1024 byte block of data, of course multiple different blocks will result in the same checksum. The best the checksum can do is to make it very likely that two blocks are equal, if their checksums are equal.

In fact, calculating checksums from a block of data is subject to these two rules:

If the checksums are not equal, you are 100% sure that the two blocks of data cannot be equal. Equal blocks of data will produce the same checksum, using the same checksum algorithm.
If the checksums are equal, the blocks may be equal too, but this is not a 100% guarantee.

The larger the checksum is (the more bytes) compared to the block of data, the larger the probability is that the two blocks of data are equal, if their checksums match. For instance, a 1 byte checksum on a 1024 byte checksum will result in many blocks of data that will produce the same checksum. A 16 byte checksum will result in a lot less blocks of data that produces the same checksums.

Is a Checksum Comparison Really Good Enough?

A checksum algorithm is usually designed to make small changes to the block of data make large changes in the checksum. And, usually you either make smaller changes to a file, or you add completely new data to it. So, comparing checksums gives you a probability that the two blocks of data are equal, but not a 100% guarantee.

In reality, your files most often do not mutate completely randomly. First of all you often make smaller changes to them, which most checksum algorithms will catch. Second, not all file types can take on all possible mutations of a block of data. File formats often force some kind of structure into the file, which limits the number of mutations a block of data can assume.

At the end of the day, though, all you get is a high probability that two blocks of data are equal, but not a 100% guarantee. This is the tradeoff for a much faster file synchronization, than if you send across the whole file.

Compensating for the Checksum Risk

Most often you will use RSync for backing up files. Of course you don't want your files to be incorrectly backed up, because of checksum being equal for different blocks. But you still want the faster backup. So here is a method to compensate for that.

When you backup a file the first time, copy the complete file across the network. Actually, if you compare a non-empty file to an empty file (no pre-existing copy remotely yet) using RSync, the file will get copied across correctly.

Normally you backup files regularly. For instance, you backup files daily. What you can do then, is to create one full copy every sunday, and make RSync backups the rest of the week. That ensures that at least every week you have a 100% guarantee of a full, correct backup. The rest of the week you have a high probability of a correct backup. Of course you would have to keep the sunday files separate from the rest-of-the-week backups, to make sure you do not overwrite your one full, correct copy.

Choosing Checksum Algorithms

To have a reasonable likeliness of blocks being equal if their checksums are equal, choose a somewhat large checksum algorithm. By "large" I mean one that produces a larger value in terms of bytes. In my RSync implementation I have chosen a 16 byte hash algorithm as checksum algorithm.

A 16 byte hash algorithm is slow to calculate on every possible block of data in a file (remember the byte-for-byte iteration). Therefore the 16 byte hash algorithm is combined with a smaller, much faster rolling checksum algorithm.

RSync uses the fast, rolling checksum algorithm to weed out checksum mismatches quickly. As mentioned earlier, if the checksum of two blocks are not equal, the blocks are not equal either. The faster checksum algorithm will of course result in checksum matches more often than the 16 byte hash algorithm. Thus, when a rolling checksum match is found, the 16 bit hash value for that block is also calculated. If both the rolling checksum and the 16 byte hash value matches, it is considered a match.

Rolling Checksum Algorithm

The rolling checksum algorithm used in RSync consists of two values. An A value and a B value. In the original RSync algorithm, A and B are 16 bit values. In my own implementation they are each 32 bit values.

What is interesting about the faster rolling checksum algorithm is, that as the block for which the checksum is calculated rolls, the checksum can be updated by only subtracting the value of the byte just moving out of the block, and adding the value just moving into the block. This makes it really fast to compute the checksum for rolling blocks of data.

Here is how these values are initialliy calculated on a block of data:

data = array of data where block of data exists in.
i    = index into block. Goes from 0 to blockSize-1;

A = data[0] + data[1] + ... data[i];
B = blockSize * data[0] + (blockSize-1) * data[1]
    + ... +
    (blockSize - i) * data[i] ... + 1 * data[blockSize-1];

It is interesting to notice, that another way to calculate B is just to add A to it, whenever A is updated, like this:

A += data[i];
B += A;

Why is this true? Well, the bytes in data[] are added to A one at a time. First time you add a byte to A, A is equal to that byte's value. This is also equal to 1 * data[i];

As you add the next byte to A, and add A to B again, the first byte has now been added twice to B, and the second byte once. This is equal to 2 * data[0] + 1 * data[1].

As you continue doing this, until the block size, the first byte would have been added to B blockSize times, the second byte has been added blockSize - 1 times etc. which is equal to the first formula given for B.

When updating A and B, when the block 1 byte down the data, you update A and B like this:

start = start index of new block.
end   = end index of new block.  

A -= data[start-1];   //remove old, first byte.
A += data[end];

B -= blockSize * data[start-1];
B += A;

Hash Algorithm

RSync used MD4 as checksum originally, but has been changed to use MD5 later, according to the Wikpedia page on RSync. Java has a built-in imlementation that can be used for both of these algorithms.

Next: RSync - Network Protocol

Tweet
	Jakob Jenkov