Employing performance libraries can be a great way to streamline and unify the computational execution flow for data intensive tasks, thus minimizing the risk of data stream timing issues and heisenbugs. Here we will describe the two libraries that can be used for signal processing within Intel® System Studio.
Intel® Integrated Performance Primitives (Intel®IPP)
Performance libraries such as the Intel IPP contain highly optimized algorithms and code for common functions including as signal processing, image processing, video/audio encode/decode, cryptography, data compression, speech coding, and computer vision. Advanced instruction sets help the developer take advantage of new processor features that are specifically tailored for certain applications. One calls Intel IPP as if it is a black box pocket of computation for their low-power or embedded device – ‘in’ flows the data and ‘out’ receives the result. In this fashion, using the Intel IPP can take the place of many processing units created for specific computational tasks. Intel IPP excels in a wide variety of domains where intelligent systems are utilized.
Without the benefit of highly optimized performance libraries, developers would need to optimize computationally intensive functions themselves carefully to obtain adequate performance. This optimization process is complicated, time consuming, and must be updated with each new processor generation. Intelligent systems often have a long lifetime in the field and there is a high maintenance effort to hand-optimize functions.
Signal processing and advanced vector math are the two function domains that are most in demand across the different types of intelligent systems. Frequently, a digital signal processor (DSP) is employed to assist the general purpose processor with these types of computational tasks. A DSP may come with its own well-defined application interface and library function set. However, it is usually poorly suited for general purpose tasks; DSPs are designed to quickly execute basic mathematical operations (add, subtract, multiply, and divide). The DSP repertoire includes a set of very fast multiply and accumulate (MAC) instructions to address matrix math evaluations that appear frequently in convolution, dot product and other multi-operand math operations. The MAC instructions that comprise much of the code in a DSP application are the equivalent of SIMD instruction sets. Like the MAC instructions on a DSP, these instruction sets perform mathematical operations very efficiently on vectors and arrays of data. Unlike a DSP, the Single Instruction Multiple Data (SIMD) instructions are easier to integrate into applications using complex vector and array mathematical algorithms since all computations execute on the same processor and are part of a unified logical execution stream.
For example, an algorithm that changes image brightness by adding (or subtracting) a constant value to each pixel of that image must read the RGB values from memory, add (or subtract) the offset, and write the new pixel values back to memory. When using a DSP coprocessor, that image data must be packaged for the DSP (placed in a memory area that is accessible by the DSP), signaled to execute the transformation algorithm, and finally returned to the general-purpose processor. Using a general-purpose processor with SIMD instructions simplifies this process of packaging, signaling, and returning the data set. Intel IPP primitives are optimized to match each SIMD instruction set architecture so that multiple versions of each primitive exist in the library.
Intel IPP can be reused over a wide range of Intel® Architecture based processors, and due to automatic dispatching, the developer’s code base will always pick the execution flow optimized for the architecture in question without having to change the underlying function call (Figure 2). This is especially helpful if an embedded system employs both an Intel® Core™ processor for data analysis/aggregation as well as a series of Intel® Atom™ Processor based SoCs for data pre-processing/collection. In that scenario, the same code base may be used in part on both the Intel® Atom™ Processor based SoCs in the field and the Intel® Core™ processor in the central data aggregation point.
Figure 1: Library Dispatch for Processor Targets
With specialized SoC components for data streaming and I/O handling combined with a limited user interface, one may think that there are not a lot of opportunities to take advantage of optimizations and/or parallelism, but that is not the case. There is room for
- heterogeneous asynchronous multi-processing (AMP) based on different architectures, and
- synchronous multi-processing (SMP) taking advantage of the Intel® Hyper-Threading Technology and dual-core design used with the latest generation of processors designed for low-power intelligent systems.
Both concepts often coexist in the same SoC. Code with failsafe real-time requirements is protected within its own wrapper managed by a modified round-robin real-time scheduler, while the rest of the operating system (OS) and application layers are managed using standard SMP multi-processing concepts. Intel Atom Processors contain two Intel Hyper-Threading Technology based cores and may contain an additional two physical cores resulting in a quad-core system. In addition Intel Atom Processors support the Intel SSSE3 instruction set. A wide variety of Intel IPP functions found at http://software.intel.com/en-us/articles/new-atom-support are tuned to take advantage of Intel Atom Processor architecture specifics as well as Intel SSSE3.
Figure 2: Intel IPP is tuned to take advantage of the Intel Atom Processor and the Intel SSSE3 instruction set
Throughput intensive applications can benefit from the use of use of Intel SSSE3 vector instructions and parallel execution of multiple data streams through the use of extra-wide vector registers for SIMD processing. As just mentioned, modern Intel Atom Processor designs provide up to four virtual processor cores. This fact makes threading an interesting proposition. While there is no universal threading solution that is best for all applications, the Intel IPP has been designed to be thread-safe:
- Primitives within the library can be called simultaneously from multiple threads within your application.
- The threading model you choose may have varying granularity.
- Intel IPP functions can directly take advantage of the available processor cores via OpenMP*.
- Intel IPP functions can be combined with OS-level threading using native threads or Intel® Cilk™ Plus.
The quickest way to multithread an application that uses the Intel IPP extensively is to take advantage of the OpenMP* threading built into the library. No significant code rework is required. However, only about 15-20 percent of Intel IPP functions are threaded. In most scenarios it is therefore preferable to also look to higher level threading to achieve optimum results. Since the library primitives are thread safe, the threads can be implemented directly in the application, and the performance primitives can be called directly from within the application threads. This approach provides additional threading control and allows meeting the exact threading needs of the application.
Figure 3: Function level threading and application level threading using the Intel IPP
When choosing applying threading at the application level, it is generally recommended to disable the library’s built-in threading. Doing so eliminates competition for hardware thread resources between the two threading models, and thus avoids oversubscription of software threads for the available hardware threads.
Intel IPP provides flexibility in linkage models to strike the right balance between portability and footprint management.
Table 1: Intel IPP Linkage Model Comparison
The standard dynamic and dispatched static models are the simplest options to use in building applications with the Intel IPP. The standard dynamic library includes the full set of processor optimizations and provides the benefit of runtime code sharing between multiple Intel IPP-based applications. Detection of the runtime processor and dispatching to the appropriate optimization layer is automatic.
If the number of Intel IPP functions used in your application is small, and the standard shared library objects are too large, using a custom dynamic library may be an alternative.
To optimize for minimal total binary footprint, linking against a non-dispatched static version of the library may be the approach to take. This approach yields an executable containing only the optimization layer required for your target processor. This model achieves the smallest footprint at the expense of restricting your optimization to one specific processor type and one SIMD instruction set. This linkage model is useful when a self-contained application running on only one processor type is the intended goal. It is also the recommended linkage model for use in kernel mode (ring 0) or device driver applications.
Intel IPP addresses both the needs of the native application developer found in the personal computing world and the intelligent system developer who must satisfy system requirements with the interaction between the application layer and the software stack underneath. By taking the Intel IPP into the world of middleware, drivers and OS interaction, it can be used for embedded devices. The limited dependency on OS libraries and its support for flexible linkage models makes it simple to add to embedded cross-build environments with popular GNU* based cross-build setups like Poky-Linux* or MADDE*.
Developing for intelligent systems and small form factor devices frequently means that native development is not a feasible option. Intel IPP can be easily integrated with a cross-build environment and be used with cross-build toolchains that accommodate the flow requirements of many of these real-time systems. Use of the Intel IPP allows embedded intelligent systems to take advantage of vector instructions and extra-wide vector registers on the Intel Atom Processor. Developers can also meet determinism requirements without increasing the risks associated with cross-architecture data handshakes of complex SoC architectures.
Developing for embedded small form factor devices also means that applications with deterministic execution flow requirements have to interface more directly with the system software layer and the OS scheduler. Software development utilities and libraries for this space need to be able to work with the various layers of the software stack, whether it is the end-user application or the driver that assists with a particular data stream or I/O interface. The Intel IPP has minimal OS dependencies and a well-defined ABI to work with the various modes. One can apply highly optimized functions for embedded signal and multimedia processing across the platform software stack while taking advantage of the underlying application processor architecture and its strengths, all without redesigning and returning the critical functions with successive hardware platform upgrades.
Intel® Math Kernel Library (Intel® MKL)
IntelMKL includes routines and functions optimized for Intel® processor-based computers running operating systems that support multiprocessing. Intel MKL includes a C-language interface for the Discrete Fourier transform functions, as well as for the Vector Mathematical Library and Vector Statistical Library functions.
The Intel® Math Kernel Library includes the following groups of routines:
- Basic Linear Algebra Subprograms (BLAS):
- Vector operations
- Matrix-vector operations
- Matrix-matrix operations
- Sparse BLAS Level 1, 2, and 3 (basic operations on sparse vectors and matrices)
- LAPACK routines for solving systems of linear equations
- LAPACK routines for solving least squares problems, eigenvalue and singular value problems, and Sylvester's equations
- Auxiliary and utility LAPACK routines
- ScaLAPACK computational, driver and auxiliary routines
- PBLAS routines for distributed vector, matrix-vector, and matrix-matrix operation
- Direct and Iterative Sparse Solver routines
- Vector Mathematical Library (VML) functions for computing core mathematical functions on vector arguments
- Vector Statistical Library (VSL) functions for generating vectors of pseudorandom numbers with different types of statistical distributions and for performing convolution and correlation computations
- General Fast Fourier Transform (FFT) Functions, providing fast computation of Discrete Fourier Transform via the FFT algorithms
- Tools for solving partial differential equations - trigonometric transform routines and Poisson solver
- Optimization Solver routines for solving nonlinear least squares problems through the Trust-Region (TR) algorithms and computing Jacobi matrix by central differences
- Basic Linear Algebra Communication Subprograms (BLACS) that are used to support a linear algebra oriented message passing interface
- Data Fitting functions for spline-based approximation of functions, derivatives and integrals of functions, and search
Intel IPP and Intel MKL for Signal Processing
The next question is when to use one Fourier Transform over another with respect to Intel IPP and Intel MKL.
DFT processing time can dominate a software application. Using a fast algorithm, Fast Fourier transform (FFT), reduces the number of arithmetic operations from O(N2) to O(N log2 N) operations. Intel MKL and Intel IPP are highly optimized for Intel architecture-based multi-core processors using the latest instruction sets, parallelism, and algorithms.
Read further to decide which FFT is best for your application.
Table 2: Comparison of Intel MKL and Intel IPP Functionality
| Intel MKL | Intel IPP |
Target Applications | Mathematical applications for engineering, scientific and financial applications | Media and communications applications for audio, video, imaging, speech recognition and signal processing |
Library Structure |
|
|
Linkage Models | Static, dynamic, custom dynamic | Static, dynamic, custom dynamic |
Operating Systems | Linux* | Linux* |
Processor Support | IA-32 and Intel® 64 architecture-based and compatible platforms, IA-64 | IA-32 and Intel® 64 architecture-based and compatible platforms, IA-64, Intel IXP4xx Processors |
Intel MKL and Intel IPP Fourier Transform Features
The Fourier Transforms provided by MKL and IPP are respectively targeted for the types of applications targeted by MKL (engineering and scientific) and IPP (media and communications). In the table below, we highlight specifics of the MKL and IPP Fourier Transforms that will help you decide which may be best for your application.
Table 3: Comparison of Intel MKL and Intel IPP DFT Features
Feature | Intel MKL | Intel IPP |
API | DFT | FFT |
Interfaces | C LP64 (64-bit long and pointer) | C |
Dimensions | 1-D up to 7-D | 1-D (Signal Processing) |
Transform Sizes | 32-bit platforms - maximum size is 2^31-1 | FFT - Powers of 2 only DFT -232 maximum size (*) |
Mixed Radix Support | 2,3,5,7 kernels ( **) | DFT - 2,3,5,7 kernels (**) |
Data Types (See Table 3 for detail) | Real & Complex | Real & Complex |
Scaling | Transforms can be scaled by an arbitrary floating point number (with precision the same as input data) | Integer ("fixed") scaling
|
Threading | Platform dependent
Can use as many threads as needed on MP systems. | 1D and 2D
|
Accuracy |
| High Accurate |
Data Types and Formats
The Intel MKL and Intel IPP Fourier transform functions support a variety of data types and formats for storing signal values. Mixed types interfaces are also supported. Please see the product documentation for details.
Table 4: Comparison of Intel MKL and Intel IPP Data Types and Formats
Feature | Intel MKL | Intel IPP |
Real FFTs | ||
Precision | Single, Double | Single, Double |
1D Data Types | Real for all dimensions | Signed short, signed int, float, double |
2D Data Types | Real for all dimensions | Unsigned char, signed int, float |
1D Packed Formats | CCS, Pack, Perm, CCE | CCS, Pack, Perm |
2D Packed Formats | CCS, Pack, Perm, CCE | RCPack2D |
3D Packed Formats | CCE | N/A |
Format Conversion Functions |
|
|
Complex FFTs | ||
Precision | Single, Double | Single, Double |
1D Data Types | Complex for all dimensions | Signed short, complex short, signed int, complex integer, complex float, complex double |
2D Data Types | Complex for all dimensions | Complex float |
Formats Legend
CCE - stores the values of the first half of the output complex conjugate-even signal
CCS - same format as CCE format for 1D, is slightly different for multi-dimensional real transforms
For 2D transforms. CCS, Pack, Perm are not supported for 3D and higher rank
Pack - compact representation of a complex conjugate-symmetric sequence
Perm - same as Pack format for odd lengths, arbitrary permutation of the Pack format for even lengths
RCPack2D - exploits the complex conjugate symmetry of the transformed data to store only half of the resulting Fourier coefficients
Performance
The Intel MKL and Intel IPP are optimized for current and future Intel® processors, and they are specifically tuned for two different usage areas:
- Intel MKL is suitable for large problem sizes
- Intel IPP is specifically designed for smaller problem sizes including those used in multimedia, data processing, communications, and embedded C/C++ applications.
Choosing the Best FFT for Your Application
Before making a decision, developers must understand the specific requirements and constraints of the application. Developers should consider these questions:
- What are the performance requirements for the application? How is performance measured, and what is the measurement criteria? Is a specific benchmark being used? What are the known performance bottlenecks?
- What type of application is being developed? What are the main operations being performed and on what kind of data?
- What API is currently being used in the application for transforms? What programming language(s) is the application code written in?
- Does the FFT output data need to be scaled (normalized)? What type of scaling is required?
- What kind of input and output data does the transform process? What are the valid and invalid values? What type of precision is required?