Abstract
Adler32 is a common checksum used for checking the integrity of data in applications such as zlib*, a popular compression library. It is designed to be fast to execute in software, but in this paper we present a method to compute it with significantly better performance than the previous implementations. We show how the vector processing capabilities of Intel® Architecture Processors can be exploited to efficiently compute the Adler32 checksum.
Introduction
The Adler32 checksum (https://en.wikipedia.org/wiki/Adler-32) is similar to the Fletcher checksum, but it is designed to catch certain differences that Fletcher is not able to catch. It is used, among other places, in the zlib data compression library (https://en.wikipedia.org/wiki/Zlib), a popular general-purpose compression library.
While scalar implementations of Adler32 can achieve reasonable performance, this paper presents a way to further improve the performance by using the vector processing feature of Intel processors. This is an extension of the method we used to speed up the Fletcher checksum as described in (https://software.intel.com/en-us/articles/fast-computation-of-fletcher-checksums).
Implementation
If the input stream is considered to be an array of bytes (data), the checksum essentially consists of two 16-bit words (A and B), and the checksum can be defined as:
for (i=0; i<end; i++) { A = (A + data[i]) % 65521; B = (B + A) % 65521; }
Doing the modulo operation after every addition is expensive. A well-known way to speed this up is to do the addition using larger variables (for example, 32-bit dwords), and then to perform the modulo only when the variables are about to risk overflowing, for example:
for (i=0; i<5552; i++) { A = (A + data[i]); B = (B + A); } A = A % 65521; B = B % 65521;
The reason that up to 5552 bytes can be processed before needing to do the modulo is that if A and B are initially 65520 and the data is all 0xFF (255), after processing 5552 bytes, B (the larger of the two) will be 0xFFFBC598. But if one processes 5553 such bytes, the result would be greater than 232.
Within that loop, the calculation looks the same as in Fletcher, so the same approach can be used to vectorize the calculation. In this case, the body of the main loop would be an unrolled version of:
pmovzxbd xdata0, [data] ; Loads byte data into dword lanes paddd xa, xdata0 paddd xb, xa
One can see that this looks essentially identical to what one would do with scalar code, except that it is operating on vector registers and, depending on the hardware generation, could be processing 4, 8, or 16 bytes in parallel.
If “a[i]” represents the i’th lane of vector register “a” and N is the number of lanes, we can (as shown in the earlier paper) calculate the actual sums by:
The sums can be done using a series of horizontal adds (PHADDD), and the scaling can be done with PMULLD.
In pseudo-code, if the main loop is operating on eight lanes (either with eight lanes in one register or four lanes unrolled by a factor of two), this might look like:
While (size != 0) { s = min(size, 5222) end = data + s – 7 while (data < end) { compute vector sum data += 8 } end += 7 if (0 == (s & 7)) { size -= s; reduce from vector to scalar sum compute modulo continue while loop } // process final 1…7 bytes Reduce from vector to scalar sum Do final adds in scalar loop Compute modulo }
Performance
The following graph compares the cycles as a function of input buffer size for an optimized scalar implementation, and for both a Streaming SIMD Extension and Intel® Advanced Vector Extensions 2 (Intel® AVX2) based parallel version, as described in this paper.
One can clearly see that the vector versions have a significantly better performance than an optimized scalar one. This is true for all but the smallest buffers.
An Intel® Advanced Vector Extensions 512 version was not tested, but it should perform significantly faster than the Intel AVX2 version.
Versions of this code are in the process of being integrated and released as part of the Intel® Intelligent Storage Acceleration Library (https://software.intel.com/en-us/storage/ISA-L).
Conclusion
This paper illustrated a method for improved Adler32 checksum performance. By leveraging architectural features such as SIMD in the processors and combining innovative software techniques, large performance gains are possible.
Author
Jim Guilford is an architect in the Intel Data Center Group, specializing in software and hardware features relating to cryptography and compression.