Abstract
Video streaming is becoming a common practice across many different fields and companies. Software vendors have to deliver video streams efficiently and quickly, while maintaining a high level of content quality. FFmpeg*1 is widely used to meet these requirements for video and audio compressing and decompressing, and provides a way to make use of multiple different libraries in one package. This white paper showcases Linux* performance improvements for an Intel® Xeon® Scalable processor on video transcoding using FFmpeg with x264 and x265 libraries, compared to the previous-generation Intel Xeon processor E5-2699 v4.
FFmpeg* Basics
FFmpeg is a free software framework that is used for transcoding multimedia files including audio and video. It utilizes several other libraries and codecs, and packages them into one software bundle. Users are able to either quickly complete transcoding with minimal options, or get more advanced and provide different optimizations, depending on their needs. In most cases, a raw input video is passed into FFmpeg, which is then converted to a more common file format accessible by a wide range of devices including smartphones and desktops. This is done by breaking down the stream into individual decoded frames to then repackage in the new format. Often FFmpeg is used to do this live on the Internet; when watching a video online, this process is completed to get the video to stream on your device. FFmpeg can also be used to combine video and audio, and add various filters to improve or modify the resulting output video file.
Many large to small software vendors make use of FFmpeg for their multimedia content creation and delivery. FFmpeg’s popularity can largely be attributed to it being free software and supporting a wide array of formats and file types, both legacy and new. It is also highly portable across many different hardware configuration and operating system versions. Using FFmpeg instead of directly using the codecs themselves is also advantageous, as FFmpeg supports multiple outputs with one input.
Figure 1: FFmpeg transcoding high-level flowchart.
x264 and x265
x264 and x265 are both implementations of h.264 and h.265 encoding methods. The h.264 encoding standard (also known as Advanced Video Coding, or AVC) was developed first. h.265 (known as High Efficiency Video Coding, or HEVC) is in general an extension of h.264. In most cases, h.265 outperforms h.264 by providing extra compression without losing video quality. However, h.265 requires more computing power and is still being adopted by most hardware. Since h.264 has been around longer, it is currently supported by most hardware, and can be used on a wider range of devices.
Best Practice
Setup Guide—The following guide was followed for this white paper:
https://trac.ffmpeg.org/wiki/CompilationGuide/Centos
During analysis we only installed Yasm*, libx264, and libx265, and then installed FFmpeg. Note that because not all libraries were used, not all libraries needed to be enabled for FFmpeg. In this particular workload, the General Public License (GPL), libx264, libx265, and non-free libraries were enabled.
In FFmpeg the preset determines how fast the encoding process will be—at the expense of compression efficiency. Put differently, if you choose ultrafast, the encoding process is going to run fast, but the file size will be larger when compared to medium. The visual quality will be the same. Valid presets are ultrafast,superfast, veryfast, faster, fast,medium, slow, slower, veryslow and placebo.
Video bitrate is the rate of video data transmitted over time. The higher the bitrate, the better the quality is, but will take longer to encode. This, together with the type of preset can influence the encoding speed and quality of video, which is important when deciding on what parameters should be used when streaming to different target devices and connections.
How different preset setting and bitrate size affects encoding time:
Figure 2: Image from https://trac.ffmpeg.org/wiki/Encode/H.264.
Intel® Xeon® Platinum 8180 Processor versus Intel® Xeon® Processor E5-2699 v4
Platform Test Configurations
Hardware and software
FFmpeg: http://ffmpeg.org/releases/ffmpeg-snapshot.tar.bz2
x264: https://git.videolan.org/
x265: https://bitbucket.org/multicoreware/x265
Table 1: Test systems hardware and software configurations.
Intel® Xeon® Platinum 8180 Processor-based system | Intel® Xeon® Processor E5-2699-based system | |
---|---|---|
#Sockets, cores and, or sockets | 2 sockets, 28 cores | 2 sockets, 22 cores |
#Logical cores (Intel® Hyper-Threading Technology enabled) | 112 | 88 |
Processor base frequency | 2.50 GHz | 2.20 GHz |
Max turbo frequency | 3.80 GHz | 3.60 GHz |
Memory capacity (test systems setup) | 192 GB | 128 GB |
Memory frequency | DDR4 2666 MHz | DDR4 2400 MHz |
Max #of memory channels | 6 | 4 |
Operating system | CentOS* 7.4 | CentOS 7.4 |
TDP (thermal design power) | 205 W | 145 W |
Memory Latency and Bandwidth
Using Intel® Memory Latency Checker (Intel® MLC) v3.4 default benchmark, the memory latencies and bandwidth of both platforms were measured. The Intel® Xeon® Platinum 8180 processor system shows better latency and more memory bandwidth than the Intel Xeon processor E5-2699 v4 based platform.
Figure 3: Memory latency and bandwidth comparisons.
FFmpeg Workload2
Transcoding profiles and workloads: speed test, capacity test (channel density):
- Input y4m video source: https://media.xiph.org/video/derf/
- Y4M file is a recognized FFmpeg format that stores uncompressed images that make up the video frames before compressing into MPEG-2 format.
Workload scripts:
ffmpeg -i in.y4m -codec:v libx264 -preset <preset> -b:v <bitrate> -maxrate <bitrate> -bufsize <2*bitrate> -psnr out.264
ffmpeg -i in.y4m -codec:v libx265 -preset <preset> -b:v <bitrate> -maxrate <bitrate> -bufsize <2*bitrate> -psnr out.265
Preset: slow, faster
Bitrate: x264: 1Mbps, 5Mbps, 10Mbps, 15Mbps | x265: 5Mbps, 10Mbps, 15Mbps, 20Mbps
2Represents a typical video transcoding workload but other usages will have different FFmpeg configuration and settings.
Performance
FFmpeg speed test
The speed test's performance metrics are measured in FPS (frames per second) while encoding a single y4m input to x264 and x265 files. For the x264 speed test, bitrate didn't vary much when comparing the Intel Xeon Platinum 8180 processor against the Intel Xeon processor E5-2699 v4. The scaling is somewhat similar for all resolutions and presets except for the 4k (3840 x 2160), with the preset set to faster where scaling is a little lower. In the x265 speed test, the scaling is much higher compared to similar workloads in x264, and bitrate size makes a bigger difference in performance.
Figure 4: x264 speed test (converting y4m file to x264).
Figure 5: x265 speed test (converting y4m file to x265).
FFmpeg capacity test (channel density) N-to-N
In the capacity tests, the goal is to run as many instances of the speed test until platform %CPU utilization is > 90 percent. We use numactl on both systems combined with the speed test scripts to guarantee higher local memory utilization. In both x264 and x265 workloads, Intel Xeon Platinum 8180 processor scaled well against the previous-generation platform, although performance scaling is higher with x265 workloads.
Figure 6: Capacity test (run N instances of speed test until %CPU is > 90 percent).
Performance Analysis
So why is the Intel® Xeon® Scalable processor system performing better than the previous-generation Intel Xeon processor E5-2699 v4 based platform in transcoding videos? Several improvements in the new architecture that have more efficient instructions per cycle, higher turbo frequency, and memory bandwidth provided performance improvement.
To further understand what's going on with the two platforms, we ran one of the capacity test workloads (launched several instances of FFmpeg encoding a 1920 x 1080 y4m video to x264 until platform %CPU utilization was > 90 percent) and analyzed the CPU activity.
We used numactl --physcpubind [cpus] to control the non-uniform memory access (NUMA) policy for shared memory, ensuring each core will be using local memory as much as possible.
As seen in Figures 7 and 8, the Intel Xeon Platinum 8180 processor not only has higher CPU operating frequency, but also more efficient (lower) cycles per instructions (CPI) than the Intel Xeon E5-2699 v4.
Figure 7: CPU operating frequency.
Figure 8: CPI (lower is better).
Similar to the memory latency chart in Figure 3, the memory bandwidth capacity of the Intel Xeon Platinum 8180 processor system is much higher, having six memory channels, while running one of the capacity test workloads. This means that you have more capacity to run more instances/channels of the transcoding workload simultaneously.
Figure 9: Memory bandwidth while running a workload.
Summary
Video is becoming one of the most popular mediums to deliver or share content either via live streaming or video on demand. FFmpeg has become a popular framework for video transcoding codecs such as AVC/x264 and HEVC/x265. With the Intel Xeon Scalable processor-based platform, it provides a performance improvement compared to the previous-generation Intel Xeon processor E5- v4 family. Improvements in instructions per cycle, higher turbo frequency, and higher memory bandwidth support provides a more efficient way to transcode videos.
About the Authors
Meghan Gorman is an Application Engineer in Intel's Software and Services Group, working on application tuning and optimization for Intel® architecture.
Rodolfo De Vega is an Application Engineer in Intel's Software and Services Group, working on application tuning and optimization for Intel architecture.
References
- FFmpeg documentation: https://www.ffmpeg.org/documentation.html
- Compile FFmpeg on CentOS: https://trac.ffmpeg.org/wiki/CompilationGuide/Centos
Related Resources
- Intel Xeon Scalable Processor https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-platform.html
- Intel® VTune™ Amplifier https://software.intel.com/en-us/intel-vtune-amplifier-xe
- Intel® MLC https://software.intel.com/en-us/articles/intelr-memory-latency-checker