About the Author
Zhen is an application engineer with Intel Software and Service group focusing on Android applications development and optimization for x86 devices, Web HTML5 application development. Besides, Zhen also has rich experience in mobile application performance tuning and power optimization.
Introduction
Everimaging HaoZhaoPian( Good Photo) is a versatile image processing application. It provides a whole new level of processing experience with High-dynamic-range (HDR) images, Edit, Manage, Share, and other powerful functions.
The team chose to use the Intel® C++ Compiler to recompile the HDR function (Native Development Kit) of this app. The HDR algorithm contains three main modules: OpenCV, merging, and tone mapping. Here, OpenCV and merging were recompiled by Intel® C++ Compiler and the performance was improved up to 300%1.
This document describes:
- Why the performance improved so much
- What is the hotspot in this app and what is the right way to do Intel® C++ Compiler optimization.
- How does it help application developers get the most benefit with Intel® C++ Compiler
Benchmark on HaoZhaoPian
The next table shows the performance results of running the Merging module of the HaoZhaoPian HDR app after compiling with Intel® C++ Compiler.
Feature: HDR Merging Process
Photo Size: 2048x1536
1 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configuration(s): [describe config + what test used + who did testing]. For more information go to http://www.intel.com/performance
Figure 1. [Intel® C++ Compiler improved performance on Intel® Medfield* Platform by more than 200 percent compared to the typical benchmark (with less than 35 percent).]
You may be asking “why?” The next sections will tell you.
HaoZhaoPian Optimization Benchmark Overview
Benchmark testing is focused the two main parts of HDR processing: thumbnail merging and completed photo processing/storage. We ran the software on several platforms (Intel® Architecture version APP on our PR3, ARM version APP on ARM phone, Apple version APP on iPhone with the same or similar release version number) and recorded the total time.
We considered several possibilities for the performance improvement. The original version of OpenCV compiled with GCC may have some obscure compatible bugs on Intel® Architecture, but Intel® C++ Compiler avoid it.
Automatic vectorization plays a leading role here because the alignment algorithm (the main hot spot of HDR) in OpenCV benefits a lot from automatic vectorization.
Hotspots of the Key Algorithm
Figure 2. Hotspots in HaoZhaoPian on Intel® Architecture Form Factor Reference Design
Figure 3. Hotspots in HaoZhaoPian on Samsung Galaxy S3*
Two Hotspots in HaoZhaoPian That Benefitted from Intel® C++ Compiler
The first was hmMergeN, a side hotspot. Its performance improved 2x. This module, developed by Everimaging, was designed to merge the different elements in HDR processing. It is hard to trace the source code, because the core algorithm in this function belongs to Everimaging and we cannot parse what it you want.
The second was hmAlign, a main hotspot. Its performance improved 7x.2 hmAlign uses the picture align algorithm in OpenCV to align two photos shot in HDR mode. It can be traced deeply since OpenCV is an open source program initiated by Intel. The next section is based on this one.
2 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configuration(s): [describe config + what test used + who did testing]. For more information go to http://www.intel.com/performance
Analyze on the key hotspots of OpenCV
Our OpenCV test environment included the following:
- Software Development Platform: Tablet DV2.1
- Android version: 4.0.4
- Test Application: HelloOpenCV_gcc; HelloOpenCV_icc
- Main Function: Abstract the features of SURF and calculate the total number of SURF features.
- Dependent Library: libopencv.so (ICC version and GCC version)
The include files we used are listed below.
The code for the Main Function is shown below.
We obtained the following test results:
After trying a variety of functions in OpenCV, we finally found the key function that benefitted the most from re-compiling with Intel® C++ Compilerusing default parameters. It was the cvExtractSURF() function.
What is cvExtractSURF()?
[Put a lead in sentence here.]
The cvExtractSURF function extracts the most robust features from an image. For each feature it returns its location, size, orientation, and optionally the basic or extended descriptor.
What does cvExtractSURF() do?
The cvExtractSURF() function can be used for object tracking and localization, image stitching, etc. As the following code snippets show, [say why this code is here]
Using Vtune™ Analyzer to Chart and Some Other Tips
These screen shots show tracing the hotspot of libOpenCV-GCC use the Vtune Analyzer.
These screen shots show tracing the hotspot of libOpenCV-ICC use the Vtune Analyzer. [Tell the what the screen shots show. What are they looking at and why?]
What is Intel® Threading Building Blocks (Intel® TBB)?
Intel® Threading Building Blocks (Intel® TBB) is a C++ template library that Intel developed for writing software programs that take advantage of multi-core processors.
What is the Intel® Math Kernel Library (Intel® MKL)?
The Intel® Math Kernel Library (Intel® MKL) is a library of optimized math routines for science, engineering, and financial applications. It includes the following [put what these things are]:
- BLAS
- LAPACK
- ScaLAPACK
- sparse solvers
- fast Fourier transforms
- vector math
The next screen shot, logged by vTune, shows the CPU time consumed to run the GCC version APP.
The next screen shot shows the CPU time consumed to run the ICC version.
Issues when Using GCC to Compile OpenCV
We think using GCC produced these issues:
- OpenMP is removed in Intel® C/C++ Compiler XE 12.1 for Linux*. OpenCV (v2.0) no longer uses OpenMP.
- Intel TBB does NOT take effect when using the default GCC compiling setting.
- Compared to Intel MKL, libm.so, the math library, takes lots of extra time on Inter-Process Communication.
Summary
The next two graphs produced in [what tool?] show the performance differences using GCC and Intel® C++ Compiler to compile OpenCV.
OpenCV compiled by GCC | VS | OpenCV compiled by ICC |
Different Ways to Solve This Issue – Intel TBB
Use Intel TBB to enable parallel code in OpenCV. OpenMP is no longer used. Important note: Only Intel TBB 2.2 or later will work. If TBB is installed (using it is optional), turn it on using the WITH_TBB option and adjust the Intel TBB include and library paths if needed. You should see the following message in the CMake output:
USE TBB: YES
Other Ways to Solve This Issue - Intel® Integrated Performance Primitives(IPP)
OpenCV does not require IPP, but can be configured to use IPP to make color conversion, Haar training and DFT functions run faster. Using IPP (Optional). If you have IPP installed on your machine and want to build OpenCV with IPP support, check if IPP is turned on in CMake configuration options and if it was detected.
Further Research
We hope to have time to answer these questions:
- Why does libm take such a long time to do these jobs? Low performance on IPC does not seem to be the only reason.
- Compared to libm, why is Intel MKL so efficient?
- Why does the ARM version have less impact on this issue?
- Are there other performance-critical cases like OpenCV that depend on multi-thread or math algorithm library that we can help?
Optimized Intel® C++ Compiler Parameters for X86 NDK Applications
Optimize x86 NDK applications
Compiler flags
To use compiler flags in the ndk-build script, add the following line to the jni/Android.mk file:
LOCAL_CFLAGS := #add here the compiler flags
Be sure to benchmark if the particular flags are really improving performance!
Compiler flags GCC
So the suggested flags for good performance would be: - O3 -ffast-math -mtune=atom -msse3 -mfpmath=sse
Intel® C/C++ Compiler XE 12.1 for Linux*
Based on the high performance Intel® C/C++ Compiler XE 12.1 for Linux*, widely used by HPC customers for achieving better performance on Intel® architecture Intel® C/C++ Compiler XE 12.1 for Linux* comes with a well established support infrastructure.
Intel® C++ Compiler for Android can be used for nearly all Android components. It integrates seamlessly into AOSP and can be used with NDK projects and standalone tool-chain.
SIMD Instruction Enhancements, especially on Intel® Atom™ processor
Intel® C++ Compiler Optimizations Specific for the Intel® Atom™ Processor
Use the –xSSSE3_ATOM optimization switch. And the optimizations include: Streaming SIMD Extensions 3 (SSE3) instruction support; Use of LEA for stack operations for instruction reordering; Support for movbe instruction (-minstruction=movbe).
What you should know about Inter Procedural Optimizations (IPO)
IPO include: O2 and O3 activate “almost” file-local IPO (-ip); Extending compilation time and memory usage; Working for libraries; In-lining of functions is most important feature of IPO but there is much more
Extends optimizations across file boundaries
Automatic Vectorization
Vectorization use SIMD instructions (single instruction, multiple data). The options of Vetorization are “-vec: enabled by default with O2/3” and “-vec-report[n]”. A variety of auto-vectorizing hints and user-mandated pragmas that can help the compiler to generate effective vector instructions.
Optimize x86 NDK applications
For good performance use the following compiler flags in Intel® C++ Compiler.
Package the Intel Libraries with Your Application
The Android build system can include third-party libraries in the final apk files. Be sure to copy the Intel Compiler libraries libintlc.so, libimf.so, libsvml.so in your jni folder. Add the following code to your Android.mk file in the jni folder:
include $(CLEAR_VARS)
LOCAL_MODULE := libintlc
LOCAL_SRC_FILES := libintlc.so
include $(PREBUILT_SHARED_LIBRARY)
include $(CLEAR_VARS)
LOCAL_MODULE := libimf
LOCAL_SRC_FILES := libimf.so
include $(PREBUILT_SHARED_LIBRARY)
include $(CLEAR_VARS)
LOCAL_MODULE := libsvml
LOCAL_SRC_FILES := libsvml.so
include $(PREBUILT_SHARED_LIBRARY)
include $(CLEAR_VARS)
LOCAL_MODULE := libirc
LOCAL_SRC_FILES := libirc.so
include $(PREBUILT_SHARED_LIBRARY)
Then load the Intel compiler libraries from Java* code. This needs to be done manually because only libraries from the system lib folder are loaded automatically. The application library needs to be loaded after the Intel compiler libraries.
System.loadLibrary("intlc");
System.loadLibrary("imf"); // libimf.so depends on libintlc.so! Must put after libintlc.so!
System.loadLibrary("svml"); // libsvml.so depends on libintlc.so! Must put after libintlc.so!
System.loadLibrary("irc"); // libirc.so depends on libintlc.so! Must put after libintlc.so!
System.loadLibrary("hello-jni"); // replace with your application library name