Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 3384

Accelerating texture compression with Intel® Streaming SIMD Extensions

$
0
0

Improving ETC1 and ETC2 texture compression

 

What is texture compression?

Texture compression has been used for some time now in computer graphics to reduce the memory consumption and save bandwidth on the graphics pipeline. It is supported by modern graphics APIs, such as OpenGL* ES and DirectX*. The process of compressing a texture is lossy. Existing algorithms must not only achieve the best speedups but also preserve as much of the original information as possible.

Popular compression algorithms are DXT, PVRTC, ETC and ASTC. These algorithms are designed with more emphasis on decompression speed rather than compression. The reason for this is that in a real-world scenario, graphics artists and engineers are expected to perform compression offline and once, whereas decompression will be done during runtime each time the application starts using textures.

But compression is important too! It is not uncommon for applications to perform runtime compression to save storage space or bandwidth. An example of this can be found in the popular browsers, Opera* and Chrome*. Runtime compression allows the browsers to compress graphic tiles in circumstances where RAM is scarce and data transfers to GPU are expensive (for example, on mobile devices).

ETC textures

ETC stands for Ericsson* Texture Compression and is an open standard supported in OpenGL and OpenGL ES. The technique allows lossy compression of images at a ratio of 4:1 (depending on input format and compression method).

The original format was ETC1 (published as iPACKMAN), which was based on a previous project named PACKMAN [1]. The goal of the format was to enable high-quality and low-complexity texture compression for mobile devices. The format was able to handle opaque textures (discarding the alpha channel or encoding it separately). The format was later extended to ETC2 and included alpha channel compression as well as new methods for improving RGB-image quality.

Open source encoders for this are hard to find. Vendors prefer to keep their solutions proprietary or closed source.

ETC1 textures are supported in Android* and benefit from GPU hardware decompression.

As of API level 22, developers can use the ETC1Util class in the android.opengl package to perform runtime compression in their applications.

ETC works on 4x4 pixel blocks. This allows random access during decompression and good data alignment when it comes to SIMD operations. The pixel ordering in a block is different than what you would otherwise expect to find in memory:


Pixel layout for ETC blocks according to Khronos OpenGL ES Specification

Every pixel can be represented in RGBA color space, with each channel having values from 0 to 255. The total size of a block is 512 bits. The size is important because it allows the data to be unpacked and stored into 128 and 256-bit registers for computation.

The operations performed on the block by ETC encoders aim to determine two base colors around which to approximate the rest. Picking these colors is a complex problem, with an exhaustive search consisting of 234 possibilities for a single block.

Improve quality and speed with SSE and AVX instructions

The goal of this article is to present ways in which Intel® Streaming SIMD Extensions can be used to accelerate the encoding process for ETC textures. The speed benefit can be traded to try different combinations and approaches that will in turn improve the quality of the compressed image.

x86 and x86_64 CPUs offer SIMD extensions for operations on large registers:

  • 128-bit for SSE2 through 4.2
  • 256-bit for AVX and AVX2

The supported instructions consist of load/store, logical and arithmetical operations on integers (8-bit, 16-bit, 32-bit) and floating point numbers (single and double precision). These should be enough to cover most compute-intensive parts of the encoding algorithm.

ETC1

ETC1 operates on 4x4 pixel blocks: it splits the block either vertically or horizontally in 4x2 or 2x4 sub-blocks and computes the average color value for each sub-block. Once it does that, it uses a special modifier table to approximate the colors in the blocks as average color +/- table entry index value. The table layout can be found in the official specification[2].

The approximation is done by generating all the possible color combinations from the average color and the table and calculating the deviation (or error) from the original pixel. The table index with the smallest error is then selected to encode the outgoing compressed pixel.

The two sub-block base colors (calculated as pixel averages in our case) are represented as 444 rather than 888. If their values are close, then the first color will be represented as 555, and the second will be computed from the first one by adding a 333 delta value. This mode of encoding the two base colors is called differential and is enabled by setting a special bit in the compressed data.

SSE registers can take advantage of ETC block data locality when performing compression. Values corresponding to each color channel can be stored and computed separately. For the current solution, to calculate the average base colors, we can use SSE to split the original 4x4 pixel block into four 2x2 blocks, process these subblocks, and then gather the data for the corresponding 2x4 or 4x2 ETC1 representation:


Computing four pixels at a time reduces the number of instructions and maps well to SSE registers. See code sample below.

This approach is able to match the quality of a generic non-SIMD implementation.

Below are some examples of test cases and results for 256x256 images.

PSNR (Peak Signal to Noise Ratio) for RGB888 images is defined as:

With MSE representing the mean square error—calculated as the difference of pixels between the original image and the one resulting after decompression:

RMSE represents the root-mean-square deviation.


Top Left: Original image, Top Right: ETC Compressed,
Bottom Right: Stats, Bottom Left: ETC Compressed using SSE2


Above: Average performance improvement of SSE ETC1 implementation compared to regular code


For large images (2048x2048) the graphical deterioration is less visible. This image was compressed in 0.1 seconds (120 MB/sec). And has PSNR = 31 db.

Results for other images would behave in a similar manner with differences between implementations varying by +/-0.1 in PSNR and RMSE and speed improvement by a factor of 2 when using SIMD.


Difference between original image and compressed version.
Pixels were computed as 255 - |original - compressed|
The whiter the pixel the closer it is to the original value.

ETC2

ETC2 brings new improvements to ETC1’s compression: it adds both extra modes for RGB textures and alpha compression.

For RGB, it takes advantage of some invalid color combinations that can result during differential mode and introduces three novel compression techniques:

  • T
  • H
  • Planar

T and H mode require the block to be encoded using four RGB444 colors. The decompressed pixels are associated with one of the four colors. Converting a 444 color to 888 is done by duplicating the bits; for example, abcd efgh ijkl becomes abcdabcd efghefgh ijklijkl.

In both modes the four colors (paint colors) are computed using two base colors.

The difference between T and H is the relation between the four paint colors. In T mode one base color determines the paint color while the other base color is used to determine the other three. A delta value is selected from a lookup table and subtracted/added to the second base color:

base0
base1 - delta
base1
base1 + delta

The delta is then encoded with the base colors into the compressed block.

In H mode the paint colors are computed as follows:

base0 - delta
base0 + delta
base1 - delta
base1 + delta

This encoding mode makes H suitable for color blocks that can be modulated in the intensity direction.

Last but not least, Planar mode provides a way to compress blocks that slowly vary in one direction (as seen in gradients). It uses three RGB676 colors to represent the rest of the block through a series of weighted averages based on each pixel’s position.


The three colors are used to interpolate the rest of the block on decompression.

Implementing ETC2 for RGB

The most complex task to implement the T and H mods for ETC2 is determining the base colors in an efficient way (speed-wise and quality-wise). The original paper detailing ETC2 suggests using Linde–Buzo–Gray (LBG [3]) algorithm and radius compression to identify the two base colors and then to iterate through all the available options, finding the best combination of paint colors based on T, H, and deltas.

For our implementation we used a more straightforward approach. We iterated through each compression mode (ETC1, Planar, T, and H) and selected the one that best matched. ETC1 was implemented as described in the previous section.

Planar colors were selected based on the definition, and then the mean square error was computed for each pixel in the block. SIMD instructions allow computing the base colors and MSE up to 60% faster compared to normal code. This mode is the fastest compute-wise, and it makes sense to try it first in order to use its MSE as a reference or threshold for the other compression modes.

As a fast implementation of T and H mode, we create a map containing the RGB444 colors corresponding to the RGB888 pixels in the block (T and H mode each approximate the block using 4 RGB444 colors). If the number of elements in the map is one, two, or three we pass those colors as base colors in T mode. Iterating through the delta table to select the best value for each pixel in T mode, we identified all others that match within a RGB444 radius and set that as base0. For the rest of the pixels we computed an average and set that as base1 and determined the best delta.

In H mode we split the colors into two clusters using a k-means algorithm and then compute the base colors as cluster averages, with the delta being selected in order to create the smallest error.

SSE can benefit T and H implementations by computing the mean square error per pixel. When adding and multiplying pixel values at this stage, it makes sense to expand each byte in order to avoid overflows. This means we could only store four, or at most eight values in a SSE register. The resulting speedup is between 1.5x and 2x.

AVX allows us to load an entire color channel in a single register (for example: 16 red values expanded to 16 bits each) and perform all the required operations on it. When using all these modes combined we can gain up to 3 db in image quality at the expense of a roughly 10-50x performance penalty over ETC1. For comparison the radius compression proposed by the authors of ETC2 can be 729 to 15,625 times slower, and the gain is 3.5 db higher than ETC1 and 1 db higher than LBG for finding T and H base colors[4].

By tuning the way we select the base colors, we can improve the speed to reach about 3x ETC1 time for an average improvement of 0.5 db (up to 4 db for gradients). This improvement makes ETC2 suitable for performing runtime compression in environments where speed is more important than quality.

Impact on the future

This solution has the potential to reduce memory consumption and save power on devices that use compressed textures. Using SIMD instructions to implement the texture encoders over regular code might further reduce the memory accesses required to perform the operation by storing most of the data in registers.

New encoding techniques such as Adaptive Scalable Texture Compression (ASTC) can be accelerated using Intel® Streaming SIMD Extensions and can make run-time compression a viable alternative on mobile and entry-level devices.

Instruction sets such as AVX-512 could allow more data to be handled at once and enable better precision on performance-oriented machines. While decoding takes place in the hardware, encoding can be achieved in a reasonable amount of time and with sufficient quality in software by taking advantage of CPU features such as SSE and AVX.

Code Samples


Computing average values for vertical and horizontal sub-blocks in ETC1
Input data is stored as __m128i registers: red[0], red[1], red[2], red[3]

References

[1] http://www.jacobstrom.com/publications/StromAkenineGH05.pdf
[2] https://www.khronos.org/registry/gles/specs/3.2/es_spec_3.2.pdf
[3] http://mlsp.cs.cmu.edu/courses/fall2010/class14/LBG.pdf
[4] http://www.ep.liu.se/ecp/016/002/ecp01602.pdf


Viewing all articles
Browse latest Browse all 3384

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>