Introduction
Motivation
Vector units in CPUs have become the de facto standard for acceleration of media, and other kernels that exhibit parallelism according to the single instruction, multiple data (SIMD) paradigm.1 These units enable a single register file to be treated as a combination of multiple registers, whose cumulative width equals that of the vector register file. A single instruction can therefore operate in parallel on all data in this vector register, resulting in significant speedups to applications that exhibit data access trends that fit this pattern. Starting from a 64-bit vector register file that may be treated as an 8-bit register in the architecture extended with MMX™ technology, SIMD on Intel® architecture processors has evolved to enable 256-bit register files that allow for 32 parallel 8-bit operations in Intel® Advanced Vector Extensions (Intel® AVX) and Intel® Advanced Vector Extensions 2 (Intel® AVX2) generations.
Kernels in media workloads fit this pattern of execution naturally, because the same operation (filtering for example) is uniformly applied across several pixels of a frame. Consequently, several popular open source projects leverage SIMD instructions for code acceleration. The x264 project for Advanced Video Coding (AVC) encoding2 and the x265 project for High Efficiency Video Coding (HEVC) encoding3 are the two widely used media libraries that extensively use multiple generations of SIMD instructions on Intel architecture processors, from MMX technology all the way up to Intel AVX2. As shown in Figure 1, x264 and x265 achieve two times and five times speedup respectively over their corresponding baselines that do not use any SIMD code. The x265 encoder gains more performance from Intel AVX2 when compared to x264, because the quantum of work done per frame is substantially larger for HEVC than for AVC.4
Figure 1. Performance benefit for x264 and x265 from Intel® Advanced Vector Extensions 2 for 1080p encoding with main profile using an Intel® Core™ i7-4500U Processor.
Focus of this whitepaper
The recently released Intel® Xeon® Scalable processors, part of the platform formerly code-named Purley, have introduced the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set.5 Intel AVX-512 instructions are capable of performing two times the number of operations in the same number of cycles as the previous generation Intel AVX2 instruction set. To accommodate this increased throughput, a larger fraction of the die is utilized, resulting in increased power being consumed, when compared to the previous-generation SIMD units. Therefore, certain Intel AVX-512 instructions are expected to cause a higher degradation to CPU clock frequency than others.6 While this reduction in frequency is offset by the increased throughput for the Intel AVX-512 instructions, media kernels continue to rely significantly on SIMD instructions in older generations (because not all kernels benefit from the increased width) and on straight-line C code that is not amenable to SIMD conversion, which may see reduced performance.
This whitepaper presents a case study based on our experience using the Intel AVX-512 SIMD instructions to accelerate the compute intensive kernels of x265. We describe how we offset the reduction in CPU frequency to ensure that the overall encoder achieves positive performance benefits. Through this process, we present recommendations of when we think Intel AVX-512 should be enabled with x265 for HEVC encoding. We also share our experience on when to choose Intel AVX-512 as a vehicle for accelerating media kernels.
Key takeaways
Our experience shows that enabling Intel AVX-512 specifically for media kernels requires achieving a balance that should be delicately handled. From our results, we recommend the following:
- When choosing specific kernels that can be accelerated with Intel AVX-512, the same compute-to-memory ratio should be considered. If this ratio is high, using Intel AVX-512 is recommended. Also, when using Intel AVX-512, try to align the buffers to 64B in order to avoid loads that cross cache- line boundaries.
- For desktop and workstation SKUs (like the Intel® Core™ i9-7900X processor that we tested), Intel AVX-512 kernels can be enabled for all encoder configurations, because the reduction in CPU clock frequency is rather low.
- For server SKUs (like the Intel® Xeon® Platinum 8180 processor on which we tested), the frequency dip is higher and increases, with more cores being active. Therefore, Intel AVX-512 should only be enabled when the amount of computation per pixel is high, because only then is the clock-cycle benefit able to balance out the frequency penalty and result in performance gains for the encoder.
Specifically, we recommend enabling Intel AVX-512 only when encoding 4K content using a slower or veryslow preset in the main10 profile. We do not recommend enabling Intel AVX-512 kernels for other settings (resolutions/profiles/presets), because of possible performance impact on the encoder.
While the results and recommendations presented in this paper are not without limitations to the evaluations and our experimental approximations, we believe that they will help the community at large to understand the benefits of using Intel AVX-512 for accelerating media workloads.
The rest of the paper is organized as follows: The "Background" section presents the background relevant to the technical material presented in the paper. "Acceleration of x265 Kernels with Intel Advanced Vector Extensions 512" discusses the choices we made to accelerate specific kernels of x265 and discusses results for the main and main10 profiles. "Accelerating x265 Encoding with Intel Advanced Vector Extensions 512" presents the results for the overall encoder for the main and main10 profiles. Finally, Section 5 provides detailed recommendations for when Intel AVX-512 should be enabled when using x265 and generic recommendations for when Intel AVX-512 should be chosen when accelerating specific kernels. This section also describes future work.
Background
This section presents the relevant background of the concepts presented in this paper. Specifically, section "HEVC Video Encoding" provides the background on HEVC. "x265, an Open Source HEVC Encoder" discusses x265 with specific focus on the existing methods of performance optimizations that it employs. Section "Introduction to the Intel® Xeon® Scalable Processor Platform" presents the relevant background on Intel Xeon Scalable processors, and Section "SIMD Vectorization Using Intel Advanced Vector Extensions 512" discusses in more detail the Intel AVX-512 architecture.
HEVC video encoding
HEVC was ratified as an encoding standard by the JCT- VC (Joint Collaborative Team on Video Coding) in 2013 as a successor to the vastly popular AVC standard.4 The video encoding and decoding processes in HEVC resolves around identifying three units: a coding unit (CU) that represents each block in the picture, a prediction unit (PU) that represents the mode decision, including motion compensated prediction of the CU, and a transform unit (TU) that represents the way in which the generated residual error between the predicted and the actual block is coded.
Initially, a frame is divided into a sequence of its largest non- overlapping coding units, called a coding tree unit (CTU). A CTU can then be split into multiple CUs with variable sizes of 64x64, 32x32, 16x16, and 8x8 to form a quad-tree. Each CU is then predicted from a set of candidate-blocks, which may be in either the same frame or different frames. If the block used for the prediction is in the same frame, the block is said to intra-predicted, while if it is in a different frame, it is said to be inter-predicted.
Intra-predicted blocks are represented by a combination of the prediction block and a mode that denotes the angle of the prediction. The allowed modes for intra-prediction are labeled DC, planar, and angular modes representing various angles from the predicted block. Inter-predicted blocks are represented by a combination of the block used for prediction (the reference block) and the motion vector (MV) that represents the delta between the current and the reference block. Blocks that have zero MV are said to use the merge mode, while others use the AMP (Advanced Motion Prediction) mode. The skip mode is a special case of the merge mode when the predicted block is identical to the source, that is, no residual. The AMP modes may use PUs that are the same size of the CU (denoted as 2Nx2N PUs) or may further partition them (denoted as rectangular and asymmetric PUs) to compute the MVs. The residual generated as a difference from the original and the predicted picture is then quantized and coded using TUs that may vary from 32x32 up to 4x4 blocks, depending on the prediction mode.
The entire process of inter, intra, CU, PU, and TU selection benefits across a broad variety of usage models including big data, artificial intelligence, high-performance computing, enterprise-class IT, cloud, storage, communication, and Internet of Things. Top enhancements include performance for a wide range of workloads with one and a half of memory bandwidth, integrated network/fabric, and optional integrated accelerators. Our results in x265 indicate a significant gen- over-gen speedup of 50 – 67 percent for offline encodes when compared to the previous-generation Intel® Xeon® processor 10 is called Rate-Distortion Optimization (RDO). The goal of Intel® Xeon® processor E5-2600. This boost comes primarily from RDO is to ensure that distortion is minimized at the target bitrate or the bitrate is minimized at the target quality level as represented by distortion. Throughout the process of RDO, various combinations of CUs, PUs, and TUs are attempted by an encoder, for which it employs several kernels. In this paper, we chose to vectorize these specific kernels by converting them to use Intel AVX-512 instructions.
HEVC encoding also supports multiple profiles for encoding a video, with each profile representing a different number of bits used to represent each pixel. The main and main10 profile are popular profiles of HEVC (their AVC counterparts are called main and high profiles respectively). Each component of a pixel is represented with a minimum of 8 bits in the main profile resulting in the values ranging from 0 –255. The main10 profile uses 10 bits per pixel, allowing for a higher range of 0 –1023 for each pixel, enabling the representation of more details in the encoded video. 2.2 x265, an Open Source HEVC Encoder The x265 encoder is an open-source HEVC that compresses video in compliance to the HEVC standard.7 This encoder has been integrated into several open-source frameworks including VLC* , HandBrake*,8 and FFMpeg9 and is the de facto open-source video encoder for HEVC. The x265 encoder has assembly optimizations for several platforms, including Intel architecture, ARM*, and PowerPC*.
The x265 encoder employs techniques for inter-frame and intra-frame parallelism to deal with the increased complexity of HEVC encoding.10 For inter-frame parallelism, x265 encodes multiple frames in parallel by using system-level software threads. For intra-frame parallelism, x265 relies on the Wavefront Parallel Processing (WPP) tool exposed by the HEVC standard. This feature enables encoding rows of CTUs of a given frame in parallel, while ensuring that the blocks required for intra-prediction from the previous row are completed before the given block starts to encode; as per the standard, this translates to ensuring that the next CTU on the previous row completes before starting the encode of a CTU on the current row. The combination of these features gives a tremendous boost in speed with no loss in efficiency compared to the publicly available reference encoder, HM.
Introduction to the Intel® Xeon® processor Scalable family platform
The Intel® Xeon® processor Scalable family, part of the Intel® platform formerly code-named Purley, are designed to deliver new levels of consistent and breakthrough performance. The platform is based on cutting-edge technology and provides compelling the improved microarchitecture features available on Intel Xeon Scalable processors.
SIMD vectorization using Intel® AVX-512
The Intel AVX-512 vector blocks present a 512-bit register file, allowing 2X parallel data operations per cycle compared to that of Intel AVX2. Though the benefits of vectorizing kernels to use the Intel AVX-512 architecture seem obvious, several key questions must be answered specifically for media workloads before embarking on this task. First, is there sufficient parallelism inherently preset in media kernels that they can leverage this increased parallelism? Second, is the fraction of the execution that exploits this parallelism sufficiently large such that we can expect average speedups as per Amdhal’s law? Third, by enabling such vectorization, is there some effect on the execution on the serial- and non-vector codes?
Acceleration of x265 Kernels with Intel® Advanced Vector Extensions 512 (Intel® AVX-512)
As a first step in acceleration, we used handwritten Intel AVX-512 instructions to select the kernels from x265 to be accelerated. While automated tools that generate vectorized SIMD code are available, we found that handwritten assembly outperforms auto-vectorizing tools, which convinced us to use this technique. This section details how this technique was performed and the gains in cycle count we observed from these kernels for sample runs in main and main10 profiles.
Selecting the kernels to accelerate
We selected over 1,000 kernels from the core compute We selected over 1,000 kernels from the core compute of x265 to optimize with Intel AVX-512 instructions for the main and main10 profiles. These kernels were chosen based on their resource requirements. Some kernels may require frequent memory access like different block-copy and block-fill kernels, while others may involve intense computation like DCT, iDCT, and quantization kernels. There is also a third class of kernels that involve a combination of both in varying proportions. We found that ensuring that the buffers that the assembly routines accessed were 64-byte aligned reduces cache misses and in general helps Intel AVX-512 kernels. A complete list of the kernels optimized with Intel AVX-512 instructions for main and main10 kernels are listed in Appendix A1 and A2 respectively.
Framework to evaluate cycle-count improvements
The x265 encoder implements a sample test bench as a correctness and performance measurement tool for assembly kernels. It accepts valid arguments for a given kernel and invokes the C primitive and corresponding assembly kernel and compares both output buffers. It verifies all possible corner cases for the given input type by using a randomly distributed set of values. Each assembly kernel is called 100 times and checked against its C primitive output for ensuring the correctness. To measure performance improvement, the test bench measures the difference in the clock ticks (as reported by the rdtsc instruction) between the assembly kernel and the C kernel for 1,000 runs and reports the average between them.
Cycle-Count improvement for kernels in the main and main10 profiles
Figure 2 shows the cycle-count improvements for each of the 500 kernels in the main profile and the 600+ kernels in the main10 profile that were accelerated with Intel AVX-512. In each curve, the kernels are sorted in increasing order of their cycle count gains over the corresponding Intel AVX-512 implementation. Appendix A details the per-kernel gains over Intel AVX2 in cycle counts.
On average, we saw a 33 percent and 40 percent gain in the cycle count over the Intel AVX2 kernels for kernels in the main and main10 profile respectively. The reason for the higher gains is as follows. In the main10 profile, x265 uses 16 bits to represent each pixel, as opposed to the main profile, which uses 8 bits; although main10 technically only needs 10 bits, using 16 bits simplifies all data structures in the software. Therefore, the amount of work that has to be done for the same number of pixels is doubled. Due the higher quantum of compute, kernels in the main10 profile gain more from Intel AVX-512 over Intel AVX2, than what the kernels in the main profile gain. These results from cycle counts indicate that at the kernel level, there is much benefit in using Intel AVX-512 to accelerate x265. However, this does not account for the reduction in clock frequency incurred when using Intel AVX-512 instructions compared to using Intel AVX2 instructions. In the next section, we look at the effect on overall encoding time, which also accounts for this effect.
Accelerating x265 Encoding with Intel Advanced Vector Extensions 512
In this section, we look at the impact of using Intel AVX-512 kernels for real encoding use cases with x265. Section "Test Setup" describes our test setup including the videos chosen, the x265 presets used, and the system configurations of the test machines. Section "Encoding on Intel® Core™ Processors" presents results on a workstation machine with an Intel Core i9-7900X processor, while section "Encoding on Intel Xeon Scalable Processors" presents results on a typical high-end server CPU that has two Intel Xeon Platinum 8180 processors.
Test setup
Our tests mainly focused on encoding 1080p videos with the main profile and 4K videos with the main10 profile. We used four typical 1080p clips (crowdrun, ducks_take_off, park_ joy, and old_town_cross), and three 4k clips (Netflix_Boat, Netflix_FoodMarket, and Netflix_Tango) for our tests 10. Appendix B gives a little more detail, along with screenshots of the videos used. We encode the 1080p to the main profile at the following bitrates (in Kbps): 1000, 3000, 5000, and 7000. For the 4K clips, the main10 profiles target the following bitrates (in Kbps): 8000, 10000, 12000, and 14000.
We encode these videos with a version of x265 that has all the kernels described in Section 3; these kernels were contributed as part of the default branch of x265. The kernels are disabled by default and may be enabled with the –asm avx512 option in the x265 command-line interface.
Figure 2. Cycle-count gains of the main and main10 profile Intel® Advanced Vector Extensions 512 kernels over the corresponding Intel® Advanced Vector Extensions 2 kernels.
We focused our experiments on four presets of x265 to represent the wide set of use cases that x265 presents: ultrafast, veryfast, medium, and veryslow. These presets represent a wide variety of trade-offs between encode efficiency and frames per second (FPS). The veryslow preset generates the most efficient encode but is the slowest; this preset is also the preferred choice for any offline encoding use cases such as OTT. The ultrafast preset is the quickest setting of x265 but generates the encode with the lowest efficiency. The veryfast and medium presets represent intermediate points in the trade-off between performance and encoder efficiency. Typically, the more efficient presets employ more tools of HEVC, resulting in more compute-per- pixel than the less efficient presets. This is important to call out as Intel AVX-512 kernels tend to give better speedup when the compute-per-pixel is higher, as shown from the results in the previous section.
Encoding on Intel® Core™ Processors
Figure 3 shows the performance of encoding 1080p and 4K video in main and main10 profile with Intel AVX-512 kernels relative to using Intel AVX2 kernels on a workstation-like configuration with an Intel Core i9-7900X processor using a single instance of x265. The full details of the system configuration are described in Appendix C. The single instance results in high utilization of the CPU across all configurations, representing a typical use case for this system when performing HEVC encoding.
Intel® Core™ i9-7900X Processor
Figure 3. Encoder performance from using Intel® Advanced Vector Extensions 512 kernels on a single instance of x265, as measured on a workstation-like system with an Intel® Core™ i9-7900X processor.
From the results, we see that for all profiles and presets, enabling Intel AVX-512 kernels results in a positive performance gains. On the Intel Core i9-7900X processor system, our measurements did not indicate any significant reduction in clock frequency. The cycle-count improvements from the kernels therefore directly reflect an increased encoder performance. When we observed the relative encoder performance per encode, we observed that there were no command lines that demonstrated lower performance with Intel AVX-512 than with Intel AVX2.
We therefore recommend that for the Intel Core i9-7900X processor, and similar systems where the frequency reduction is minimal, Intel AVX-512 kernels be enabled for all encoding profiles and resolutions when using x265.
Encoding on Intel Xeon Scalable Processors
In this section, we present results from using x265 accelerated by Intel AVX-512 on a high-end server configuration with two Intel Xeon Platinum 8180 processors arranged in a dual-socket configuration with 28 hyperthreaded cores per CPU. For full details of the system configuration, refer to Appendix C.
x265 single instance performance using 8 threads and 16 threads
Figure 4 shows the performance of a single instance of x265 with kernels that use Intel AVX-512 for encoding 1080p videos in the main profile and 4K videos in the main10 profile relative to using kernels that only use Intel AVX2 instructions. Two configurations, one with 8 threads per instance and another with 16 threads per instance, are shown in the graph to understand the impact of increasing the number of active cores on the CPU; limiting the number of threads for each instance is done using the --pools
option of the x265 library.
The figure shows that for a given thread configuration, the gains when encoding 4K content in the main10 profile are higher than for the 1080p content in the main profile. Also, for a given resolution and profile, the gains that we see from the presets that have more work-per-pixel (the higher efficient presets like the veryslow preset) are higher than the faster presets; in fact, for 1080p content in the main profile, we see an average performance loss. These gains are consistent with previously observed results that demonstrate that the more the work per pixel of a specific configuration, the better it is to use Intel AVX-512. Additionally, when we investigated the S-curves of these profiles (not shown here for brevity), we saw that several encoder command lines outside the 4K main10 veryslow setting lost performance over Intel AVX2.
We therefore recommend using Intel AVX-512-enabled kernels only when doing 4K encodes in the main10 profile with the versylow preset. For other presets and encoder settings, the amount of work per pixel is insufficient to offset the reduction in clock frequency to the gains in cycle-count achieved.
One additional observation we can make from Figure 4 is that the performance gains are in general higher across the board when using 8 threads for the single instance of x265, compared to the 16 threads. Upon further analysis, we observe that when more cores are activated with Intel AVX- 512 instructions in the Intel Xeon Platinum 8180 processor, the frequency reduces further, resulting in lower gains from using Intel AVX-512 instructions. In a typical server, however, encoder vendors attempt to maximize all available CPU cores to get the maximum throughput out of the given server.
This use case is explored in Section 4.3.2 where we attempt to saturate the server with 4K main10 encodes to see if the lower frequency when more cores are activated may result in muting the gains.
Intel® Xeon® Platinum 8180 Processor
Figure 4. Relative performance of a single instance of x265 when using Intel® Advanced Vector Extensions 512 kernels with 8 or 16 threads over Intel® Advanced Vector Extensions 2 kernels on a server configuration with two Intel® Xeon® Platinum 8180 processors.
Saturating Intel® Xeon® Platinum 8180 processors using multiple instances of x265
To study whether activating more cores results in performance loss for 4K encodes in the main10 profile, we saturated one and both CPUs of a dual-socket Intel Xeon Platinum 8180 processor-based server with four and eight instances of x265, respectively, with each instance using 16 threads. We measured the total FPS achieved by all x265 instances to encode the same clip at different bitrates when using kernels that use Intel AVX-512 and reported the number relative to when the Intel AVX2-enabled kernels were used. Figure 5 shows these results.
Intel® Xeon® Platinum 8180 processor - Single and Dual Socket Saturation
Figure 5. Single-socket and dual-socket saturation of theIntel® Xeon® Platinum 8180 processor with x265 instances.
Figure 5. Shows that even when saturating one or both CPUs, encoding 4K videos with main10 shows positive performance gains over using the Intel AVX2 counterparts. However, the gains are lower than the corresponding gains achieved when a single instance of x265 that uses fewer cores. Additionally, we observe that for lower efficiency presets such as veryfast and medium, the gains are muted due to the higher frequency drop with more active cores.
These results reiterate our recommendation that Intel AVX-512 kernels should only be enabled when encoding 4K content for the main10 profile for the veryslow preset. For other presets that have lower compute per pixel, enabling Intel AVX-512 kernels may result in a performance loss over using Intel AVX2 kernels.
Figure 5 shows that even when saturating one or both CPUs, encoding 4K videos with main10 shows positive performance gains over using the Intel AVX2 counterparts. However, the gains are lower than the corresponding gains achieved when a single instance of x265 that uses fewer cores. Additionally, we observe that for lower efficiency presets such as veryfast and medium, the gains are muted due to the higher frequency drop with more active cores.
These results reiterate our recommendation that Intel AVX-512 kernels should only be enabled when encoding 4K content for the main10 profile for the veryslow preset. For other presets that have lower compute per pixel, enabling Intel AVX-512 kernels may result in a performance loss over using Intel AVX2 kernels.
Conclusions and Future Work
In this paper, we presented our experience with using the Intel AVX-512 instructions available in the newly introduced Intel Xeon Scalable processors to accelerate the open-source HEVC encoder x265. The specific challenges that we had to overcome included selecting the right kernels to accelerate with Intel AVX-512 such that the reduction in CPU frequency were offset from the benefits in cycle count, and choosing the right encoder configuration that enabled the right balance of compute per pixel to achieve positive gains in encoder performance.
Recommendations
Our experience shows that enabling Intel AVX-512 specifically for media kernels requires achieving a balance that should be delicately handled. From our results, we recommend the following:
- When choosing specific kernels that can be accelerated with Intel AVX-512, the same compute-to-memory ratio should be considered. If this ratio is high, using Intel AVX-512 is recommended. Also, when using Intel AVX-512, try to align the buffers to 64B in order to avoid loads that cross cache- line boundaries.
- For desktop and workstation SKUs (like the Intel Core i9-7900X processor that we tested), Intel AVX-512 kernels can be enabled for all encoder configurations because the reduction in CPU clock frequency is rather low.
- For server SKUs (like the Intel Xeon Platinum 8180 processor on which we tested), the frequency dip is higher, and increases, with more cores being active. Therefore, Intel AVX-512 should only be enabled when the amount of computation per pixel is high, because only then is the clock- cycle benefit able to balance out the frequency penalty and result in performance gains for the encoder.
Specifically, we recommend enabling Intel AVX-512 only when encoding 4K content using a slower or veryslow preset in the main10 profile. We do not recommend enabling Intel AVX-512 kernels for other settings (resolutions/profiles/presets), because of possible performance impact on the encoder.
While the results and recommendations presented in this paper are not without the limitations of the evaluations and our experimental approximations, we believe that they will help the community at large to understand the benefits of using Intel AVX-512 for accelerating media workloads.
Future work
The task of accelerating x265 with Intel AVX-512 has opened several avenues for future work. The accelerated kernels are available through the public mailing list. Future extensions of this work to enable further acceleration from Intel AVX-512 include (1) performing a thorough analysis of the use of Intel AVX-512 for videos at other resolutions and presets available in x265, (2) enabling schemes to dynamically enable and disable Intel AVX-512 kernels by monitoring the CPU frequency, and (3) a fundamental re-architecting of the encoder to segregate the worker threads into different types of threads, only some of which may run Intel AVX-512 limiting the number of cores where the CPU frequency drop is observed. We will continue to develop and contribute these solutions to open source, and encourage the reader to also contribute the project at http://x265.org.
Acknowledgements
This work was funded in part by a non-recurring engineering grant from Intel to MulticoreWare. We would like to thank the various developers and engineers at MulticoreWare for their extensive support throughout this work. In particular, we would like to thank Thomas A. Vaughan for his guidance and Min Chen for his expert comments on the assembly patches.
Appendix A
A1 – Main profile instructions per cycle (IPC) gains
Primitive | IPC Gain | Primitive | IPC Gain | Primitive | IPC Gain | Primitive | IPC Gain |
---|---|---|---|---|---|---|---|
sad | 0.16% | i422 chroma_vss | 32.70% | i420 chroma_vpp | 23.19% | luma_vss | 43.18% |
pixelavg _pp | 0.87% | luma_vss | 32.89% | addAvg | 23.37% | luma_vss | 43.35% |
i444 chroma_vps | 1.14% | sad_x3 | 33.01% | addAvg | 23.38% | i444 chroma_hpp | 43.43% |
i444 chroma_vps | 1.18% | luma_vps | 33.05% | i444 chroma_hps | 23.53% | ssd_s | 43.57% |
pixelavg _pp | 1.41% | i420 chroma_hpp | 33.08% | i420 chroma_hps | 23.77% | luma_hps | 43.68% |
convert_p2s | 1.95% | i444 chroma_hpp | 33.14% | var | 23.95% | luma_vss | 43.75% |
i420 chroma_vps | 2.45% | sad_x4 | 33.14% | i420 chroma_hpp | 24.03% | luma_hps | 43.84% |
i420 chroma_vps | 2.72% | i444 chroma_vss | 33.16% | i422 chroma_vpp | 24.11% | luma_hps | 43.94% |
i422 chroma_hps | 2.83% | i420 chroma_vss | 33.16% | i444 chroma_vss | 24.15% | luma_vsp | 44.06% |
i420 p2s | 3.21% | copy _ps | 33.33% | i422 chroma_vss | 24.15% | luma_vsp | 44.11% |
i444 p2s | 3.21% | i420 copy _ps | 33.33% | i420 chroma_vss | 24.15% | sub_ps | 44.11% |
sad_x3 | 3.29% | i444 chroma_vss | 33.34% | i420 chroma_vps | 24.20% | i444 chroma_hpp | 44.15% |
i420 chroma_vps | 3.62% | i422 chroma_vss | 33.34% | i444 chroma_vpp | 24.20% | convert_p2s | 44.33% |
sad_x4 | 4.50% | i420 chroma_vss | 33.34% | i420 chroma_vpp | 24.20% | i444 chroma_hpp | 44.35% |
sad | 4.62% | i422 copy _ps | 33.43% | sad | 24.21% | luma_vss | 44.42% |
i420 chroma_hps | 4.90% | i444 chroma_vss | 33.43% | i444 chroma_vps | 24.22% | luma_hps | 44.43% |
i420 chroma_hps | 5.19% | i422 chroma_vss | 33.43% | i420 chroma_vps | 24.22% | luma_hpp | 44.48% |
pixel_satd | 5.42% | i420 chroma_hpp | 33.55% | i444 chroma_hps | 24.25% | luma_vpp | 44.54% |
i444 chroma_vps | 5.43% | i422 chroma_hpp | 33.57% | i420 chroma_hpp | 24.42% | luma_vss | 44.61% |
i422 chroma_hps | 5.82% | dequant_normal | 33.60% | sad_x4 | 24.53% | cpy1Dto2D_shl | 44.61% |
i444 chroma_vps | 6.78% | sad_x4 | 33.62% | i444 chroma_hps | 24.57% | luma_vsp | 44.62% |
dct | 7.06% | i444 chroma_vss | 33.89% | i422 chroma_hps | 24.65% | luma_vsp | 44.66% |
i444 chroma_hps | 7.08% | i420 chroma_vss | 33.89% | psyCost_pp | 24.89% | luma_vss | 44.70% |
i444 chroma_hps | 7.26% | sad_x3 | 33.92% | i422 chroma_vps | 25.00% | luma_vpp | 44.74% |
i422 chroma_vss | 8.85% | i420 pixel_satd | 34.01% | i444 chroma_vss | 25.17% | luma_vsp | 44.85% |
luma_vss | 9.76% | i444 chroma_hps | 34.02% | i422 chroma_vss | 25.17% | i422 copy _sp | 45.20% |
i422 chroma_hps | 10.27% | luma_vps | 34.04% | i420 chroma_vss | 25.17% | getResidual32 | 45.24% |
i444 chroma_hps | 11.00% | i444 chroma_hpp | 34.20% | i422 chroma_vps | 25.66% | luma_vpp | 45.30% |
i444 chroma_hps | 11.14% | i420 pixel_satd | 34.20% | luma_vps | 25.82% | luma_hps | 45.35% |
sad | 11.26% | i420 chroma_hpp | 34.23% | i444 chroma_vps | 25.89% | i444 chroma_hpp | 45.41% |
i420 chroma_hps | 11.38% | i444 chroma_vss | 34.43% | i444 chroma_vps | 25.92% | luma_hpp | 45.49% |
pixel_sa8d | 11.55% | i422 chroma_vss | 34.43% | i420 chroma_hps | 25.95% | convert_p2s | 45.52% |
i444 chroma_hps | 11.91% | i420 chroma_vss | 34.43% | i420 chroma_vps | 26.07% | luma_hps | 45.58% |
luma_vpp | 11.96% | i422 chroma_vsp | 34.59% | convert_p2s | 26.25% | luma_vpp | 45.62% |
i422 chroma_hps | 12.10% | i444 chroma_vss | 34.71% | i422 chroma_vps | 26.42% | convert_p2s | 45.62% |
copy _pp | 12.54% | i444 chroma_vss | 34.76% | i444 chroma_vps | 26.56% | luma_vpp | 45.69% |
ssd_s | 12.58% | addAvg | 34.88% | i444 chroma_vss | 26.71% | cpy2Dto1D_shl | 45.75% |
i420 chroma_vps | 12.58% | addAvg | 35.14% | i422 chroma_vss | 26.71% | i422 addAvg | 45.76% |
i444 chroma_hps | 12.79% | sad | 35.43% | i420 chroma_vss | 26.71% | convert_p2s | 46.00% |
idct | 13.32% | ssd_ss | 35.45% | sad_x4 | 26.80% | i420 add_ps | 46.09% |
luma_vps | 13.78% | i444 chroma_vss | 35.51% | i422 chroma_hpp | 27.06% | add_ps | 46.10% |
i444 chroma_hps | 13.87% | i420 pixel_satd | 35.55% | i422 chroma_hps | 27.13% | luma_vsp | 46.14% |
sad | 13.88% | pixelavg _pp | 35.56% | luma_hpp | 27.15% | luma_hps | 46.29% |
copy _cnt | 14.25% | luma_vpp | 35.62% | i420 pixel_satd | 27.23% | luma_vss | 46.31% |
luma_vpp | 14.28% | luma_vpp | 36.21% | i444 chroma_vss | 27.24% | i444 chroma_vsp | 46.52% |
pixel_satd | 14.45% | i420 chroma_hpp | 36.45% | i422 chroma_vss | 27.24% | i422 chroma_vsp | 46.52% |
idct | 14.49% | i422 chroma_hpp | 36.65% | luma_hpp | 27.29% | i420 chroma_vsp | 46.52% |
pixel_satd | 14.92% | i422 chroma_hpp | 36.76% | luma_vps | 27.45% | luma_hps | 46.65% |
pixel_satd | 14.99% | sad | 36.76% | psyCost_pp | 27.62% | pixelavg _pp | 46.67% |
sad | 15.21% | i422 chroma_hpp | 36.81% | luma_vsp | 27.72% | luma_vss | 46.88% |
idct | 15.23% | copy _pp | 36.82% | i422 chroma_hps | 28.00% | i422 addAvg | 46.88% |
sad_x3 | 15.32% | pixelavg _pp | 36.84% | pixel_satd | 28.50% | luma_hps | 46.90% |
i444 chroma_vpp | 15.47% | convert_p2s | 36.87% | cpy2Dto1D_shl | 28.69% | luma_vsp | 46.97% |
i422 chroma_vpp | 15.47% | i420 p2s | 36.87% | luma_vps | 28.71% | i422 p2s | 47.10% |
i420 chroma_vpp | 15.47% | i444 p2s | 36.87% | i444 chroma_hpp | 28.78% | copy _pp | 47.11% |
pixel_satd | 15.52% | i444 chroma_hpp | 37.07% | i420 pixel_satd | 28.80% | luma_vss | 47.64% |
pixel_satd | 15.62% | luma_vpp | 37.11% | i422 pixel_satd | 28.81% | i444 chroma_hpp | 47.83% |
pixel_satd | 15.66% | luma_vss | 37.49% | i422 pixel_satd | 28.95% | i422 addAvg | 47.85% |
sad_x3 | 15.70% | addAvg | 37.76% | luma_vss | 29.26% | luma_hps | 48.46% |
pixel_satd | 15.75% | i444 chroma_vps | 37.90% | i444 chroma_vss | 29.29% | copy _ps | 48.57% |
i420 chroma_hps | 15.83% | i444 chroma_vss | 38.04% | i420 chroma_hps | 29.42% | sub_ps | 48.83% |
copy _pp | 15.93% | i444 chroma_vps | 38.05% | luma_vpp | 29.43% | luma_hpp | 48.97% |
luma_vpp | 16.10% | i444 chroma_vps | 38.23% | scale1D_128to64 | 29.50% | i422 add_ps | 49.02% |
nquant | 16.33% | sad | 38.42% | luma_vss | 29.59% | i444 chroma_vsp | 49.43% |
sad | 16.35% | i444 chroma_hpp | 38.45% | i444 chroma_vpp | 29.69% | i420 sub_ps | 49.46% |
i444 chroma_vpp | 16.39% | Weight_sp | 38.48% | i422 chroma_vpp | 29.69% | add_ps | 49.50% |
i420 chroma_hps | 16.60% | i444 chroma_hpp | 38.55% | i420 chroma_vpp | 29.69% | i422 sub_ps | 49.52% |
i444 chroma_vpp | 17.02% | sad | 38.56% | i422 chroma_hps | 29.71% | i420 addAvg | 49.74% |
i422 chroma_vpp | 17.02% | luma_hpp | 38.79% | i422 pixel_satd | 29.75% | convert_p2s | 49.75% |
i420 chroma_vpp | 17.02% | pixel_satd | 39.15% | i444 chroma_vpp | 29.82% | i422 p2s | 49.75% |
pixel_satd | 17.08% | luma_hpp | 39.21% | i422 chroma_vpp | 29.82% | i444 p2s | 49.75% |
luma_vps | 17.10% | i444 chroma_hpp | 39.30% | luma_vss | 29.91% | luma_vss | 49.84% |
luma_vps | 17.36% | i444 chroma_vps | 39.39% | i444 chroma_vss | 29.92% | luma_hpp | 50.00% |
i444 chroma_vss | 17.55% | addAvg | 39.51% | i422 chroma_vss | 29.92% | copy _sp | 50.11% |
i420 chroma_vss | 17.55% | i420 chroma_hpp | 39.55% | i420 chroma_vss | 29.92% | luma_vss | 50.22% |
pixel_satd | 17.59% | i422 pixel_satd | 39.57% | luma_vps | 30.19% | luma_hpp | 50.61% |
pixel_satd | 17.66% | i422 chroma_hpp | 39.61% | sad_x4 | 30.24% | luma_hpp | 51.19% |
i444 chroma_vss | 18.42% | convert_p2s | 39.78% | sad | 30.30% | i444 chroma_vsp | 51.23% |
i422 chroma_vss | 18.42% | i420 p2s | 39.78% | luma_vps | 30.37% | luma_hpp | 51.70% |
i420 chroma_vss | 18.42% | i422 p2s | 39.78% | luma_vps | 30.39% | nonPsyRdoQuant | 51.74% |
i444 chroma_vpp | 18.49% | i444 p2s | 39.78% | i444 chroma_vpp | 30.39% | i444 chroma_vsp | 52.08% |
i420 chroma_vpp | 18.49% | copy _sp | 39.93% | i422 chroma_vpp | 30.39% | copy _pp | 52.17% |
luma_vps | 18.50% | i420 addAvg | 40.02% | i420 chroma_vpp | 30.39% | i444 chroma_vsp | 52.22% |
luma_vpp | 18.51% | luma_hps | 40.04% | ssd_ss | 30.44% | i444 chroma_vsp | 52.28% |
sad_x3 | 18.99% | i444 chroma_hpp | 40.07% | i422 chroma_hpp | 30.45% | nonPsyRdoQuant | 52.32% |
copy _pp | 19.76% | addAvg | 40.64% | i420 pixel_satd | 30.53% | i422 copy _ss | 52.45% |
luma_vss | 19.80% | luma_vsp | 40.87% | i422 chroma_vpp | 30.54% | nonPsyRdoQuant | 52.56% |
pixel_satd | 19.89% | i444 chroma_vsp | 40.96% | i444 chroma_hpp | 30.54% | i444 chroma_vsp | 52.77% |
sad | 20.09% | i420 chroma_vsp | 40.96% | i422 chroma_hpp | 30.56% | i422 chroma_vsp | 52.77% |
sad_x3 | 20.26% | luma_vss | 41.01% | i444 chroma_hpp | 30.63% | blockfill_s | 52.93% |
i444 chroma_hps | 20.52% | i420 copy _sp | 41.12% | i420 chroma_hpp | 30.85% | i444 chroma_vsp | 53.30% |
i420 chroma_hps | 20.80% | copy _cnt | 41.14% | luma_vsp | 30.95% | i422 chroma_vsp | 53.30% |
psyCost_pp | 21.15% | luma_vsp | 41.16% | sad_x4 | 30.95% | i420 chroma_vsp | 53.30% |
i444 chroma_hps | 21.17% | Weight_pp | 41.23% | i422 chroma_vss | 30.99% | i422 chroma_vsp | 53.36% |
pixel_satd | 21.19% | luma_hps | 41.42% | i444 chroma_hps | 31.12% | i444 chroma_vsp | 54.34% |
pixel_satd | 21.21% | addAvg | 41.84% | i444 chroma_vpp | 31.17% | i422 chroma_vsp | 54.34% |
quant | 21.23% | i420 addAvg | 41.87% | i444 chroma_vpp | 31.20% | i420 chroma_vsp | 54.34% |
sad_x3 | 21.29% | luma_vsp | 41.99% | sad | 31.29% | psyRdoQuant | 54.44% |
i444 chroma_vpp | 21.42% | luma_hps | 42.05% | luma_vsp | 31.33% | luma_hpp | 54.62% |
i422 chroma_vpp | 21.42% | convert_p2s | 42.13% | sad_x3 | 31.34% | i444 chroma_vsp | 54.64% |
i420 chroma_vpp | 21.42% | i420 p2s | 42.13% | i422 pixel_satd | 31.46% | i420 chroma_vsp | 54.64% |
i420 chroma_vps | 21.60% | i422 p2s | 42.13% | luma_hps | 31.52% | luma_hpp | 54.78% |
pixel_satd | 21.61% | i444 p2s | 42.13% | i444 chroma_vpp | 31.57% | luma_hpp | 55.06% |
i444 chroma_vps | 21.69% | i444 chroma_vsp | 42.31% | pixelavg _pp | 31.62% | luma_hpp | 55.40% |
i422 chroma_hps | 21.99% | i422 chroma_vsp | 42.31% | luma_vps | 31.76% | copy _pp | 55.41% |
i420 addAvg | 22.01% | i420 chroma_vsp | 42.31% | i444 chroma_hps | 31.78% | psyRdoQuant | 55.70% |
luma_vsp | 22.09% | luma_vsp | 42.35% | sad_x3 | 31.95% | psyRdoQuant | 55.72% |
i444 chroma_vps | 22.27% | i420 chroma_hpp | 42.43% | i444 chroma_vss | 31.96% | var | 55.75% |
i422 chroma_vps | 22.41% | nonPsyRdoQuant | 42.51% | i420 chroma_vss | 31.96% | copy _ss | 56.00% |
sad_x4 | 22.44% | luma_hps | 42.54% | i422 chroma_vss | 32.01% | i444 chroma_vsp | 56.36% |
var | 22.51% | addAvg | 42.56% | i444 chroma_hpp | 32.12% | i422 chroma_vsp | 56.36% |
i444 chroma_vpp | 22.64% | luma_hps | 42.58% | var | 32.17% | i420 chroma_vsp | 56.36% |
i420 chroma_vpp | 22.64% | luma_vss | 42.82% | i420 chroma_hpp | 32.32% | i420 copy _ss | 56.63% |
sad_x4 | 22.84% | i422 addAvg | 42.93% | i444 chroma_hps | 32.44% | i444 chroma_vsp | 57.60% |
i444 chroma_vpp | 22.87% | luma_vpp | 42.97% | luma_vsp | 32.61% | i420 chroma_vsp | 57.60% |
i422 chroma_vpp | 22.87% | dequant_scaling | 42.98% | i444 chroma_vss | 32.67% | copy _pp | 58.33% |
i422 chroma_hpp | 22.92% | luma_hpp | 42.99% | i420 chroma_vss | 32.67% | copy _ss | 60.09% |
sad_x4 | 23.09% | i444 chroma_vsp | 43.05% | i444 chroma_vss | 32.69% | psyRdoQuant | 62.80% |
i444 chroma_vpp | 23.19% | i422 chroma_vsp | 43.05% | i422 chroma_vss | 32.69% | i444 chroma_vsp | 62.98% |
|
|
|
| i420 chroma_vss | 32.69% | i420 chroma_vsp | 62.98% |
A2 – Main10 profile IPC gains
Primitive | IPC Gain | Primitive | IPC Gain | Primitive | IPC Gain | Primitive | IPC Gain |
---|---|---|---|---|---|---|---|
convert_p2s | 1.26% | i422 chroma_hps | 39.92% | i422 chroma_vpp | 29.64% | i444 chroma_hpp | 49.20% |
i420 p2s | 1.26% | i422 p2s | 40.30% | i420 chroma_vpp | 29.64% | i444 chroma_hps | 49.45% |
i444 p2s | 1.26% | luma_hpp | 40.35% | i444 chroma_vsp | 29.82% | cpy2Dto1D_shl | 49.70% |
addAvg | 1.86% | i422 chroma_hpp | 40.52% | i422 chroma_vsp | 29.82% | luma_hvpp | 49.80% |
addAvg | 6.88% | copy _cnt | 40.55% | i420 chroma_vsp | 29.82% | luma_vss | 49.84% |
dct | 7.06% | luma_vpp | 40.58% | luma_vss | 29.91% | i420 chroma_hps | 49.85% |
sad_x3 | 7.65% | luma_vsp | 40.59% | i444 chroma_vss | 29.92% | convert_p2s | 49.87% |
sad | 7.74% | i444 chroma_vps | 40.60% | i422 chroma_vss | 29.92% | i420 p2s | 49.87% |
sad | 8.29% | i422 chroma_vps | 40.60% | i420 chroma_vss | 29.92% | i422 p2s | 49.87% |
i420 addAvg | 8.36% | i420 chroma_vps | 40.60% | i444 chroma_vps | 29.93% | i422 p2s | 49.87% |
sad_x3 | 8.77% | sad_x3 | 40.64% | i422 chroma_vps | 29.93% | i444 p2s | 49.87% |
luma_vss | 9.76% | nonPsyRdoQuant | 40.70% | i420 chroma_vps | 29.93% | luma_hps | 49.94% |
intra_pred_ang27 | 9.79% | add_ps | 40.71% | luma_vsp | 30.06% | i422 chroma_hps | 50.07% |
cpy2Dto1D_shl | 10.13% | sad_x4 | 40.73% | i444 chroma_vsp | 30.11% | i444 chroma_hpp | 50.13% |
sad_x3 | 10.81% | luma_vpp | 40.73% | i422 chroma_vsp | 30.11% | luma_vss | 50.22% |
sad_x4 | 10.96% | copy _pp | 40.81% | i420 chroma_vsp | 30.11% | luma_hpp | 50.25% |
i420 addAvg | 11.05% | i422 chroma_vps | 40.88% | pixel_satd | 30.30% | i420 chroma_vpp | 50.28% |
pixel_satd | 11.05% | luma_vss | 41.01% | i422 pixel_satd | 30.30% | luma_hps | 50.67% |
i420 pixel_satd | 11.05% | i444 chroma_vsp | 41.02% | i422 pixel_satd | 30.35% | addAvg | 50.67% |
i422 pixel_satd | 11.05% | i420 chroma_vsp | 41.02% | add_ps | 30.69% | i422 addAvg | 50.67% |
luma_vsp | 12.64% | i444 chroma_vsp | 41.05% | sad | 30.94% | luma_hpp | 50.75% |
copy _cnt | 13.29% | i420 chroma_vsp | 41.05% | dequant_normal | 31.10% | i420 chroma_hpp | 50.82% |
idct | 13.32% | sad | 41.06% | sad | 31.37% | copy _pp | 50.95% |
i444 chroma_vps | 14.44% | intra_pred_ang34 | 41.06% | pixel_satd | 31.43% | i422 addAvg | 50.99% |
i422 chroma_vps | 14.44% | convert_p2s | 41.09% | i420 pixel_satd | 31.43% | luma_hps | 51.17% |
i420 chroma_vps | 14.44% | i444 p2s | 41.09% | i422 pixel_satd | 31.43% | i422 chroma_hpp | 51.22% |
idct | 14.49% | nonPsyRdoQuant | 41.21% | i444 chroma_vpp | 31.60% | i444 chroma_hpp | 51.37% |
i444 chroma_vpp | 14.84% | sad_x4 | 41.22% | i422 chroma_vss | 31.76% | luma_hpp | 51.48% |
idct | 15.23% | i422 chroma_vpp | 41.25% | i444 chroma_vss | 31.96% | luma_hps | 51.57% |
luma_vsp | 15.24% | i420 chroma_vpp | 41.25% | i420 chroma_vss | 31.96% | copy _ss | 51.58% |
sad_x3 | 15.53% | i420 chroma_vpp | 41.36% | sad | 31.99% | luma_hpp | 51.63% |
addAvg | 15.60% | i444 chroma_vsp | 41.40% | psyCost_pp | 32.12% | luma_hps | 51.64% |
i422 chroma_vpp | 15.71% | luma_vpp | 41.43% | i420 chroma_hps | 32.32% | luma_hps | 51.65% |
i420 chroma_vpp | 15.71% | luma_hvpp | 41.46% | i422 addAvg | 32.46% | luma_hps | 51.70% |
addAvg | 15.90% | luma_vpp | 41.48% | i422 chroma_vss | 32.62% | luma_hps | 51.81% |
i422 chroma_vpp | 16.07% | i444 chroma_vsp | 41.51% | i444 chroma_vss | 32.67% | i422 chroma_hpp | 51.86% |
intra_pred_ang25 | 16.22% | luma_hvpp | 41.54% | i420 chroma_vss | 32.67% | luma_hps | 51.89% |
nquant | 16.33% | intra_pred_ang11 | 41.55% | i444 chroma_vss | 32.69% | addAvg | 51.89% |
sad_x4 | 16.42% | convert_p2s | 41.58% | i422 chroma_vss | 32.69% | i420 addAvg | 51.89% |
luma_vsp | 16.55% | sad_x4 | 41.71% | i420 chroma_vss | 32.69% | i422 addAvg | 51.89% |
i420 addAvg | 17.12% | sad_x4 | 41.71% | luma_vss | 32.89% | luma_hps | 51.93% |
sad_x4 | 17.33% | luma_vsp | 41.78% | i444 chroma_vsp | 33.14% | luma_hps | 51.99% |
i444 chroma_vss | 17.55% | sad_x4 | 41.83% | i422 chroma_vsp | 33.14% | i444 chroma_hpp | 52.16% |
i420 chroma_vss | 17.55% | i444 chroma_vsp | 42.01% | i444 chroma_vss | 33.16% | i422 copy _sp | 52.45% |
i444 chroma_vps | 17.88% | i444 chroma_vsp | 42.08% | i420 chroma_vss | 33.16% | i422 copy _ps | 52.45% |
i422 chroma_vps | 17.88% | i422 chroma_vsp | 42.08% | convert_p2s | 33.27% | i422 copy _ss | 52.45% |
i420 chroma_vps | 17.88% | nonPsyRdoQuant | 42.13% | i444 chroma_vss | 33.34% | i444 chroma_hps | 52.94% |
pixel_satd | 18.02% | pixelavg _pp | 42.17% | i422 chroma_vss | 33.34% | copy _ss | 53.20% |
i422 addAvg | 18.13% | i422 chroma_vpp | 42.20% | i420 chroma_vss | 33.34% | i420 chroma_hps | 53.22% |
i444 chroma_vss | 18.42% | i420 chroma_vpp | 42.20% | i444 chroma_vss | 33.43% | i422 chroma_hps | 53.27% |
i422 chroma_vss | 18.42% | luma_vps | 42.30% | i422 chroma_vss | 33.43% | i420 chroma_hpp | 53.48% |
i420 chroma_vss | 18.42% | sub_ps | 42.52% | pixelavg _pp | 33.45% | copy _pp | 53.53% |
addAvg | 19.50% | luma_vsp | 42.55% | pixel_satd | 33.45% | i422 chroma_hpp | 53.81% |
i444 chroma_vps | 19.54% | luma_hvpp | 42.65% | i420 pixel_satd | 33.45% | i422 chroma_hpp | 53.89% |
i422 chroma_vps | 19.54% | pixelavg _pp | 42.65% | addAvg | 33.46% | i444 chroma_hpp | 54.31% |
i420 chroma_vps | 19.54% | luma_vps | 42.72% | luma_vsp | 33.47% | ssd_ss | 54.69% |
sad_x3 | 19.75% | convert_p2s | 42.77% | sad_x4 | 33.51% | i422 chroma_hpp | 54.77% |
luma_vss | 19.80% | luma_vss | 42.82% | i444 chroma_vsp | 33.79% | i420 chroma_hpp | 55.18% |
i422 pixel_satd | 19.95% | luma_vsp | 43.05% | i422 chroma_vsp | 33.79% | luma_hpp | 55.53% |
pixel_satd | 20.02% | convert_p2s | 43.11% | i420 chroma_vsp | 33.79% | i444 chroma_hpp | 55.56% |
i420 pixel_satd | 20.02% | i444 chroma_hpp | 43.15% | i444 chroma_vss | 33.89% | i444 chroma_hpp | 55.78% |
i422 pixel_satd | 20.02% | luma_vsp | 43.17% | i420 chroma_vss | 33.89% | i444 chroma_hpp | 55.94% |
i444 chroma_vps | 20.09% | luma_vss | 43.18% | luma_vsp | 34.08% | luma_hpp | 55.96% |
i420 chroma_vps | 20.09% | luma_vsp | 43.22% | sub_ps | 34.13% | copy _sp | 56.00% |
i422 chroma_vss | 20.53% | luma_hvpp | 43.24% | i444 chroma_vsp | 34.18% | copy _ps | 56.00% |
sad_x4 | 20.69% | luma_vss | 43.35% | i420 chroma_vsp | 34.18% | i444 chroma_hpp | 56.07% |
i444 chroma_vps | 20.86% | luma_vsp | 43.36% | i444 chroma_vsp | 34.22% | luma_hpp | 56.16% |
i422 chroma_vps | 20.86% | i420 chroma_hpp | 43.38% | i422 chroma_vsp | 34.22% | i420 copy _sp | 56.63% |
i444 chroma_vpp | 20.98% | cpy1Dto2D_shl | 43.50% | i420 chroma_vsp | 34.22% | i420 copy _ps | 56.63% |
quant | 21.23% | luma_vsp | 43.50% | i444 chroma_vss | 34.43% | i420 copy _ss | 56.63% |
i422 chroma_vpp | 21.45% | luma_vpp | 43.51% | i422 chroma_vss | 34.43% | i422 chroma_hpp | 57.32% |
sad | 21.61% | copy _pp | 43.54% | i420 chroma_vss | 34.43% | i444 chroma_hps | 57.33% |
i444 chroma_vpp | 21.78% | luma_hvpp | 43.57% | pixel_satd | 34.59% | luma_hpp | 57.40% |
i444 chroma_vps | 22.06% | luma_vpp | 43.58% | i444 chroma_vss | 34.71% | i420 chroma_hps | 57.97% |
i420 chroma_vps | 22.06% | luma_hvpp | 43.60% | i444 chroma_vss | 34.76% | luma_hpp | 58.55% |
i444 chroma_vsp | 22.12% | luma_vss | 43.75% | intra_pred_ang10 | 34.76% | i444 chroma_hps | 59.21% |
i422 chroma_vsp | 22.12% | luma_vps | 43.77% | i444 chroma_vps | 34.80% | i420 chroma_hps | 59.46% |
i420 chroma_vsp | 22.12% | i444 chroma_vsp | 43.80% | i444 chroma_vps | 34.98% | blockfill_s | 59.53% |
i444 chroma_vsp | 22.14% | i420 chroma_vsp | 43.80% | luma_vps | 35.07% | luma_hpp | 59.56% |
i422 chroma_vsp | 22.14% | pixelavg _pp | 43.94% | i444 chroma_vps | 35.34% | i422 chroma_hps | 59.75% |
i420 chroma_vsp | 22.14% | psyRdoQuant | 44.02% | Weight_pp | 35.37% | copy _sp | 60.09% |
i422 chroma_vpp | 22.28% | sad_x3 | 44.17% | i444 chroma_vss | 35.51% | copy _ps | 60.09% |
i420 chroma_vpp | 22.28% | pixelavg _pp | 44.23% | luma_vps | 35.63% | luma_hps | 60.23% |
i444 chroma_vpp | 22.28% | luma_hvpp | 44.24% | i422 chroma_hps | 35.68% | psyRdoQuant | 60.25% |
i422 chroma_vpp | 22.35% | luma_hvpp | 44.28% | i444 chroma_vps | 36.38% | luma_hpp | 60.26% |
ssd_ss | 22.60% | luma_vsp | 44.31% | i422 chroma_vss | 36.56% | i444 chroma_hps | 60.28% |
i444 chroma_vpp | 23.06% | dequant_scaling | 44.37% | sad | 36.66% | i420 chroma_hps | 60.48% |
sad_x4 | 23.09% | convert_p2s | 44.40% | luma_vpp | 36.68% | luma_hps | 60.76% |
luma_vpp | 23.67% | luma_vpp | 44.41% | i444 chroma_vpp | 36.70% | copy _pp | 60.87% |
luma_vpp | 23.82% | luma_vss | 44.42% | luma_vsp | 36.71% | i444 chroma_hps | 60.92% |
i444 chroma_vpp | 23.84% | sad_x4 | 44.42% | sad_x3 | 36.75% | i422 chroma_hps | 61.09% |
i444 chroma_vss | 24.15% | luma_vpp | 44.60% | sad_x4 | 36.78% | luma_hpp | 61.28% |
i422 chroma_vss | 24.15% | luma_vss | 44.61% | pixel_satd | 36.88% | i444 chroma_hpp | 61.38% |
i420 chroma_vss | 24.15% | luma_hvpp | 44.61% | i422 chroma_vpp | 36.91% | luma_hpp | 61.43% |
intra_pred_ang9 | 24.37% | getResidual32 | 44.64% | copy _pp | 36.96% | luma_hpp | 61.44% |
i444 chroma_vpp | 24.41% | luma_hpp | 44.68% | addAvg | 37.08% | i422 chroma_hps | 61.55% |
luma_vpp | 24.48% | luma_vss | 44.70% | sad_x4 | 37.09% | luma_hpp | 61.58% |
i422 addAvg | 24.62% | luma_hvpp | 44.73% | i420 chroma_vpp | 37.29% | luma_hpp | 62.26% |
psyCost_pp | 24.88% | i444 chroma_vsp | 44.76% | i422 chroma_vpp | 37.36% | i422 chroma_hps | 62.31% |
i420 chroma_vpp | 24.90% | i422 chroma_vsp | 44.76% | i420 chroma_vpp | 37.36% | luma_hpp | 62.35% |
i422 chroma_vpp | 25.11% | i420 chroma_vsp | 44.76% | luma_vss | 37.49% | i420 chroma_hpp | 62.39% |
i420 chroma_vpp | 25.11% | sad_x4 | 44.85% | luma_vpp | 37.53% | i420 chroma_hps | 62.39% |
i444 chroma_vps | 25.17% | luma_hvpp | 45.15% | i444 chroma_vps | 37.54% | i444 chroma_hpp | 62.46% |
i422 chroma_vps | 25.17% | luma_vps | 45.19% | i422 chroma_vps | 37.54% | luma_hpp | 62.63% |
i420 chroma_vps | 25.17% | i422 chroma_hpp | 45.23% | i420 chroma_vps | 37.54% | i444 chroma_hps | 62.88% |
i444 chroma_vss | 25.17% | intra_pred_dc | 45.26% | i444 chroma_vpp | 37.59% | i420 chroma_hps | 62.95% |
i422 chroma_vss | 25.17% | sad | 45.31% | i420 chroma_vpp | 37.59% | luma_hpp | 63.07% |
i420 chroma_vss | 25.17% | luma_vps | 45.36% | i444 chroma_vps | 37.59% | i444 chroma_hps | 63.15% |
i422 chroma_vps | 25.28% | psyRdoQuant | 45.40% | i422 chroma_vps | 37.59% | luma_hps | 63.16% |
i444 chroma_vps | 25.97% | i420 add_ps | 45.40% | pixel_satd | 37.60% | i420 chroma_hpp | 63.34% |
i422 chroma_vps | 25.97% | pixelavg _pp | 45.52% | i444 chroma_vps | 37.60% | luma_hpp | 63.61% |
i420 chroma_vps | 25.97% | addAvg | 45.54% | i420 chroma_vps | 37.60% | i420 chroma_hps | 63.85% |
luma_vpp | 26.22% | i420 addAvg | 45.54% | i444 chroma_vsp | 37.66% | luma_hpp | 63.91% |
sad | 26.25% | i422 addAvg | 45.54% | i422 chroma_vps | 37.68% | i420 chroma_hpp | 64.12% |
psyCost_pp | 26.30% | i444 chroma_vsp | 45.57% | i444 chroma_vpp | 37.69% | i444 chroma_hps | 64.15% |
i444 chroma_vsp | 26.38% | i422 chroma_vsp | 45.57% | i444 chroma_vps | 37.71% | i444 chroma_hpp | 64.23% |
i420 chroma_vsp | 26.38% | i420 chroma_vsp | 45.57% | i420 chroma_vps | 37.71% | i422 chroma_hpp | 64.39% |
i420 addAvg | 26.39% | luma_vps | 45.58% | convert_p2s | 37.73% | i422 chroma_hpp | 64.56% |
i422 addAvg | 26.39% | pixelavg _pp | 45.61% | i420 p2s | 37.73% | i444 chroma_hps | 64.84% |
pixel_satd | 26.62% | luma_vps | 45.62% | i422 p2s | 37.73% | i422 chroma_hps | 64.87% |
i444 chroma_vss | 26.71% | luma_vps | 45.64% | i444 p2s | 37.73% | i444 chroma_hpp | 64.92% |
i422 chroma_vss | 26.71% | sad_x3 | 45.65% | i444 chroma_vpp | 37.74% | i420 chroma_hps | 64.93% |
i420 chroma_vss | 26.71% | i422 add_ps | 45.68% | i444 chroma_vpp | 37.76% | i422 chroma_hpp | 65.05% |
luma_vsp | 26.77% | addAvg | 45.72% | addAvg | 37.80% | i444 chroma_hps | 65.06% |
luma_vps | 27.04% | i420 addAvg | 45.72% | i422 chroma_vpp | 37.99% | i420 chroma_hpp | 65.14% |
luma_vpp | 27.10% | pixelavg _pp | 45.80% | i444 chroma_vss | 38.04% | i422 chroma_hps | 65.35% |
i444 chroma_vss | 27.24% | i444 chroma_hpp | 45.95% | i420 chroma_hpp | 38.04% | i422 chroma_hps | 65.63% |
i422 chroma_vss | 27.24% | psyRdoQuant | 45.96% | luma_vps | 38.08% | i444 chroma_hps | 65.72% |
i422 chroma_vps | 27.26% | luma_vsp | 45.97% | i444 chroma_vpp | 38.09% | i422 chroma_hpp | 65.80% |
i420 addAvg | 27.28% | sad | 46.04% | i444 chroma_vpp | 38.27% | i444 chroma_hpp | 65.88% |
i422 addAvg | 27.28% | luma_hvpp | 46.17% | i422 chroma_vpp | 38.27% | i420 chroma_hpp | 65.92% |
addAvg | 27.55% | luma_vss | 46.31% | i444 chroma_hps | 38.30% | i420 chroma_hpp | 65.94% |
i422 chroma_vpp | 27.71% | sad_x3 | 46.36% | intra_pred_ang2 | 38.34% | i444 chroma_hps | 66.03% |
i420 chroma_vpp | 27.71% | sad_x3 | 46.42% | i444 chroma_hps | 38.37% | i422 chroma_hps | 66.03% |
pixel_satd | 27.93% | luma_vps | 46.44% | i444 chroma_vpp | 38.48% | i420 chroma_hps | 66.15% |
ssd_s | 28.04% | luma_hpp | 46.46% | copy _pp | 38.51% | i422 chroma_hpp | 66.20% |
pixel_satd | 28.10% | i444 chroma_vsp | 46.66% | addAvg | 38.54% | i422 chroma_hps | 66.20% |
pixelavg _pp | 28.47% | sad_x3 | 46.71% | nonPsyRdoQuant | 38.57% | i420 chroma_hps | 66.29% |
i420 pixel_satd | 28.54% | luma_hpp | 46.82% | sad_x3 | 38.74% | i422 chroma_hpp | 66.32% |
i422 pixel_satd | 28.54% | luma_vss | 46.88% | sad_x3 | 38.80% | i444 chroma_hpp | 66.38% |
pixel_satd | 28.56% | i422 chroma_hps | 46.99% | sad | 38.84% | i444 chroma_vpp | 66.41% |
i420 pixel_satd | 28.56% | intra_pred_ang26 | 47.26% | Weight_sp | 38.86% | i444 chroma_hps | 66.50% |
i422 pixel_satd | 28.56% | luma_vps | 47.31% | pixel_satd | 38.88% | i444 chroma_vpp | 66.61% |
i444 chroma_vps | 28.75% | luma_hvpp | 47.44% | i420 pixel_satd | 38.88% | i444 chroma_vpp | 66.63% |
luma_vps | 28.78% | pixelavg _pp | 47.50% | copy _pp | 38.96% | i444 chroma_hps | 66.64% |
luma_vps | 28.82% | luma_vss | 47.64% | i422 sub_ps | 39.19% | i444 chroma_hpp | 66.64% |
i422 chroma_hps | 28.86% | luma_vps | 47.69% | i420 sub_ps | 39.34% | i420 chroma_hpp | 66.64% |
i420 chroma_hps | 29.02% | i420 chroma_hpp | 47.78% | i420 chroma_hps | 39.47% | i420 chroma_hpp | 66.65% |
sad_x3 | 29.04% | i422 chroma_hps | 47.82% | luma_vpp | 39.54% | i444 chroma_hps | 66.71% |
i444 chroma_hps | 29.11% | luma_vsp | 47.93% | luma_hvpp | 39.63% | i422 chroma_hpp | 66.71% |
luma_vsp | 29.13% | luma_hvpp | 48.30% | i444 chroma_vps | 39.68% | i444 chroma_hps | 66.75% |
luma_vss | 29.26% | addAvg | 48.40% | i420 chroma_vps | 39.68% | i444 chroma_hps | 66.91% |
i444 chroma_vss | 29.29% | i420 addAvg | 48.40% | luma_hpp | 39.72% | i422 chroma_hpp | 66.92% |
luma_vpp | 29.39% | luma_hps | 48.96% | addAvg | 39.77% | i444 chroma_hpp | 67.59% |
luma_vss | 29.59% | luma_hps | 49.05% | convert_p2s | 39.79% | i444 chroma_hpp | 67.78% |
|
|
|
| i420 p2s | 39.79% | i420 chroma_hpp | 69.14% |
|
|
|
| i444 p2s | 39.79% | i444 chroma_hpp | 69.23% |
Appendix B
1080p Test Clips and Bitrates Used
The following 1080p clips were used for generating test results.
park_ joy _1080p.y4m
crowd_run_1080p50.y4m
ducks_take_off_1080p50.y4m
old_town_cross_1080p50.y4m
4k Test Clips and Bitrates Used
The following 4k clips were used for generating test results.
Netflix_Boat_4096x2160_60fps_10bit_420.y4m
Netflix_Tango_4096x2160_60fps_10bit_420.y4m
Netflix_FoodMarket_4096x2160_60fps_10bit_420.y4m
Appendix C
Configurations for Testing on Intel® Core™ i7-4500U Processor | |
---|---|
System Attribute | Value |
OS Name | Windows 10 professional |
Version | 10.0.16299 Build 16299 |
System Model | MS-7A93 |
System Type | x64-based PC |
Processor | Intel® Core™ i7- 4500U CPU @ 3.30GHz, 3312 MHz, 10 Core(s), 20 Logical Processor(s) |
Core(s) per socket: | 2 |
Thread(s) per core: | 2 |
Socket(s): | 1 |
NUMA node(s): | 1 |
BIOS | |
BIOS Version/Date | American Megatrends Inc. 1.00, 6/2/2017 |
SMBIOS Version | 3 |
BIOS Mode | UEFI |
Graphic Interface: | |
Version | PCI-Express |
Link Width | x16 |
Max. Supported | x16 |
Memory: | |
Type | DDR3 |
Channel | 1 |
Size | 8 GB |
DRAM Frequency | 800 MHz |
command Rate (CR) | 2T |
Configurations for Testing on Intel® Core™ i9-7900X Processor | |
---|---|
System Attribute | Value |
OS Name | Microsoft Windows 10 Enterprise |
Version | 110.0.16299 Build 16299 |
System Model | MS-7A93 |
System Type | x64-based PC |
Processor | Intel® Core™ i9-7900X CPU at 3.30GHz, 3312Mhz, 10 Core(s), 20 Logical Processor(s) |
Core(s) per socket: | 10 |
Thread(s) per core: | 2 |
Socket(s): | 1 |
NUMA node(s): | 1 |
BIOS | |
BIOS Version/Date | American Megatrends Inc. 1.00, 6/2/2017 |
SMBIOS Version | 3 |
BIOS Mode | UEFI |
Graphic Interface: | |
Version | PCI-Express |
Link Width | x16 |
Max. Supported | x16 |
Memory: | |
Type | DDR4 |
Channel | 2 |
Size | 32 GB |
DRAM Frequency | 1066.8 MHz |
command Rate (CR) | 2T |
Configurations for Testing on Intel® Xeon® Platinum 8180 Processor | |
---|---|
System Attribute | Value |
OS Name | CentOS |
Version | 7.2 |
System Model | Intel S4PR1SY2B |
System Type | x86_64 |
Processor | Intel® Xeon® Platinum 8180 CPU at 2.50 GHz |
Core(s) per socket: | 28 |
Thread(s) per core: | 2 |
Socket(s): | 2 |
NUMA node(s): | 2 |
BIOS | |
BIOS Version/Date | SE5C620.86B.0X. 01.0007.062120172 125 / 06/21/2017 |
SMBIOS Version | 2.8 |
BIOS Mode | UEFI |
Graphic Interface: | |
Version | PCI-Express |
Link Width | x16 |
Max. Supported | x16 |
Memory: | |
Type | DDR4 |
Channel | 2 |
Size | 192 GB |
DRAM Frequency | 1333 MHz |
command Rate (CR) | 2T |
References
- David A. Patterson and John L. Hennessey, Computer Organization and Design: the Hardware/Software Interface, 2nd Edition, Morgan Kaufmann Publishers, Inc., San Francisco, California, 1998, p.751.
- VideoLAN Organization, x264, The best H.264/AVC encoder. https://www.videolan.org/developers/x264.html
- MulticoreWare Inc., x265 HEVC Encoder/H.265 Video Codec. http://x265.org/
- G. J. Sullivan, J.-R. Ohm, W.-J. Han and T. Wigand, "Overview of the High Efficiency Video Coding (HEVC) Standard,"IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12,pp. 1649-1668, 2012.
- Intel Corporation, Intel Advanced Vector Instructions 512. https://www.intel.in/content/www/in/en/architecture-and-technology/avx-512-overview.html
- Intel Corporation, "Intel® Xeon® Processor Scalable Family Specification Update", February, 2018. https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf
- x265.org
- HandBrake, An OpenSource Video Transcoder.https://handbrake.fr/
- FFMPEG, A complete, cross-platform solution to record, convert and stream audio and video.
- MulticoreWare Inc., "x265 Receives Significant Boost from Intel Xeon Scalable Processor Family."http://x265.org/x265-receives-significant-boost-intel-xeon-scalable-processor-family/