This document describes how to setup your Intel® Edison Breakout Board with Android Things*.
Installing Android Things* on Intel® Edison Breakout Board
Installing Android Things* and Unbricking a Sparkfun* Blocks for Intel® Edison Module
This document describes how to setup your Sparkfun* Blocks for Intel® Edison Module with Android Things*.
How to Make Floating Licenses Accessible Across a Local Network
This guide explains how to make the floating license server available to Intel® software development tool client systems for license check-out.
Method 1: Propagate Environment Variable
If you select this method, you need to apply it each time a floating license is installed.
Windows* Network
Make sure powershell remoting is enabled.
Execute in the powershell console:
$comps = @(<list>)$setenv = {[Environment]::SetEnvironmentVariable("INTEL_LICENSE_FILE", <value>, "Machine")}
If you are a domain admin, then execute this command:
Invoke-Command -Computer $comps -ScriptBlock $setenv
If you are not a domain admin, then execute this command:
foreach ($comp in $comps) { Invoke-Command -Computer $comp -Credential $comp\Administrator -ScriptBlock $setenv }
Here, <list> is the list of computers on local network. <value><value> is the port@server-hostname value provided by the ISL manager installer.
Linux* Network
Establish passwordless SSH connectivity (can be done using script sshconnectivity.exp provided with ISL Manager)
Execute the following commands:
comps=<list> for comp in "$comps"; do ssh $comp "echo INTEL_LICENSE_FILE=<value> >> /etc/environment" done
Method 2: Share Directory
If you choose this method, you only need to apply it once.
Windows Network
Share the directory "C:\Program Files (x86)\Common Files\Intel\Licenses" using the "Properties -> Sharing" context menu item. Then, on each machine in the network, set the environment variable to INTEL_LICENSE_FILE to \\<admin_host>\<license_share> as described in Method 1.
Linux Network
On admin host, type
echo "/opt/intel/licenses *(ro)">> /etc/exportsexportfs -ra
In the command above, "*" can be replaced with more specific wildcard or IP range.
Execute the following code on each machine on the network (for example via SSH, as described in Method 1):
mkdir -p /opt/intel/licensesecho "<admin_host>:/opt/intel/licenses /opt/intel/licenses nfs ro">> /etc/fstabmount -a
For more information about the floating license server, see https://software.intel.com/en-us/articles/intel-software-license-manager-getting-started-tutorial.
Intel® Aero Ready to Fly Drone
Overview Compute Board Vision Accessory Kit Enclosure Kit Ready to Fly Drone
Get your drone applications airborne quickly.1 This quadcopter is a fully assembled development platform that combines the Intel® Aero Compute Board and the Intel® Aero Vision Accessory Kit with the Intel® Aero Flight Controller, GPS, compass, airframe, ESCs, motors, transmitter, and receiver. The only thing needed to start flying is a charged battery2.
Get Your Drone Development Platform
Pre-Order Now
GitHub* Wiki
Source Code
Preview of the Intel® Aero Ready to Fly Drone
The Intel® Aero Ready to Fly Drone includes the following:
Intel Aero Compute Board
- Intel® Atom™ x7-Z8750 processor
- 4 GB LPDDR3-1600
- 32 GB eMMC
- MicroSD* memory card slot
- M.2 connector 1 lane PCIe for SSD
- Intel® Dual Band Wireless-AC 8260
- USB 3.0 OTG
- Reprogrammable I/O via Altera® Max® 10 FPGA
- 8 MP RGB camera (front-facing)
- VGA camera, global shutter, monochrome (down-facing)
- Open source, embedded Linux*, Yocto Project*
- Insyde Software's InsydeH2O* UEFI BIOS optimized for the Intel® Aero Platform for UAVs. More about InsydeH20.
Intel® RealSense™ Camera (R200)
Intel® Aero Flight Controller with Dronecode* PX4* Autopilot
- STM32 microcontroller
- 6 DoF IMU, magnetometer, and altitude sensors
- Connected to the Aero Compute Board over HSUART and communicates using MAVLink* protocol
Pre-assembled Quadcopter
- Carbon fiber airframe
- GPS and compass
- Power distribution board
- 4 electronic speed controllers
- 4 motors
- 8 snap-on propellers
- Spektrum DSMX* serial receiver
- Spektrum DXe rransmitter (2.4 GHz DSMX)
Specifications
Name | Measurement |
---|---|
Drone Dimensions - hub-to-hub (diagonal) | 360 mm |
Drone Height - from the base to the top of a GPS antenna | 222 mm |
Propeller– length | 230 mm |
Weight of Drone– basic configuration without battery | 865 g |
Gross Weight(maximum)– takeoff weight | 1900 g 3 |
Flight Time(maximum)– with 4S, 4000mAh battery, hovering, no added payload | 20 min 3 |
Sustained Wind(maximum) | 15 knots 3 |
Control Distance(maximum) - with supplied remote control | 300 m 3 |
Airspeed(maximum) | 15 m/s 3 |
Altitude of Operation(maximum)– height above sea level | 5000 3 |
Outside Air Temperature(minimum / maximum) | -0 C / +45 C |
ESC and Motor - designed and manufactured by Yuneec
| Modified for Intel® Aero
|
1 The Intel Aero Ready to Fly Drone is a kit for developers and is intended to be modified by developers according to their professional judgment. Intel has not established operating limitations for the kit nor tested any configurations other than the base configuration. Developers are responsible for testing and ensuring the safety of their own configurations, and establishing the operating limits of those configurations.
2 Recommended battery: Li-Po, 4S, 4000+ mAh, with XT60 connector
3 Estimated
Intel’s Optimized Tools and Frameworks for Machine Learning and Deep Learning
This article gives an introduction to the Intel’s optimized machine learning and deep learning tools and frameworks and also gives a description of the Intel’s libraries that have been integrated into them so they can take full advantage and run fastest on Intel® architecture. This information will be useful to first-time users, data scientists, and machine learning practitioners, for getting started with Intel optimized tools and frameworks.
Introduction
Machine learning (ML) is a subset of the more general field of artificial intelligence (AI). ML is based on a set of algorithms that learn from data. Deep learning (DL) is a specialized ML technique that is based on a set of algorithms that attempt to model high-level abstractions in data by using a graph with multiple processing layers (https://en.wikipedia.org/wiki/Deep_learning).
ML, and in particular DL, are currently used in a growing number of applications and industries, including image and video recognition/classification, face detection, natural language processing, and financial forecasting and prediction.
A convenient way to work with DL is to use the Intel’s optimized ML and DL frameworks. Using Intel optimized tools and frameworks to train and deploy deep networks guarantees that these tools will use Intel® architecture in the most efficient way. Examples of how some of these frameworks have been optimized to take advantage of Intel architecture, as well as charts showing speed-up of these optimized frameworks compared to non-optimized ones, can be found in https://software.intel.com/en-us/ai/deep-learning
(See for example, http://itpeernetwork.intel.com/myth-busted-general-purpose-cpus-cant-tackle-deep-neural-network-training/
and
In the figure, we can see that the Intel’s solution stack for ML and DL spans different layers. On top of the hardware, Intel has developed highly optimized math libraries that make the most efficient use of the several families of Intel® processors. Those optimized math libraries are the foundation for higher-level tools and frameworks that are used to solve ML and DL problems across different domains.
In the next section, a brief summary of Intel’s libraries and tools used to optimize the frameworks will be described. Although these libraries and tools have been used to optimize the ML and DL frameworks for Intel architecture, they can also be used in other applications or software packages that require highly optimized numerical routines that can take advantage of the vectorization, multithreading, and distributed computing capabilities present in Intel® hardware.
Intel® software tools for machine learning and deep learning
Intel is actively working with the open source community to ensure that existing and new frameworks are optimized to take advantage of Intel architecture and is optimizing these ML and DL tools by using powerful libraries that provide building blocks to accelerate these tasks.
Intel has developed three libraries that are highly optimized to run on Intel architecture.
- Intel® Math Kernel Library (Intel® MKL) (https://software.intel.com/en-us/intel-mkl) includes a set of highly-optimized performance primitives for DL applications (https://software.intel.com/en-us/node/684759). This library also includes functions that have been highly optimized (vectorized and threaded) to maximize performance on each Intel® processor family. These functions have been optimized for single-core vectorization and cache memory utilization, as well as with automatic parallelism for multi-core and many-core processors.
Intel MKL provides standard C and Fortran APIs for popular math libraries like BLAS, LAPACK, and FFTW, which means no code changes are necessary. Just relinking the application to use Intel MKL will maximize performance on each Intel processor family. This will provide great performance in DL applications with minimum effort.
Intel MKL is optimized for the most recent Intel processors, including Intel® Xeon® and Intel® Xeon Phi™ processors. In particular, it is optimized for Intel® Advanced Vector Extensions 2 and Intel® Advanced Vector Extensions 512 ISAs.
Intel® MKL library can be downloaded for free via the Community Libraries program (https://software.intel.com/sites/campaigns/nest/).
- Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) (https://01.org/mkl-dnn) is an open source performance library for DL applications that can be used to maximize performance on Intel architecture.
This library provides optimized deep neural network primitives for rapid integration into DL frameworks. It welcomes community contributions for new functionality, which can be immediately used in applications ahead of Intel MKL releases. This way, DL scientists and software developers can both contribute to and benefit from this open source library.
- Intel® Data Analytics Acceleration Library is a performance library of highly optimized algorithmic building blocks for all data analysis stages (preprocessing, transformation, analysis, modeling, validation, and decision making). It is designed for use with popular data platforms including Hadoop*, Spark*, R*, and others, for efficient data access. This library is available for free via the Community Licensing for Intel® Performance Libraries (https://software.intel.com/sites/campaigns/nest/).
Deep learning frameworks
Intel’s optimized ML and DL frameworks use the functionality of the libraries described in the previous section. They allow us to perform training and inference in a highly efficient manner using Intel processors.
Intel is actively working in integrating the math libraries into the various frameworks so the users of these frameworks can run their DL training and inference tasks in the most efficient way on Intel processors. For example, Intel® Distribution of Caffe* and Intel® Optimization for Theano* are integrated with the most recent versions of Intel MKL. Intel is also adding multimode capabilities to those frameworks in order to distribute the training workload across nodes, which results in reducing the total training time.
The interaction between the different libraries and frameworks previously described can be represented visually in the following block diagram, which shows how Intel MKL and Intel MKL-DNN libraries are used as building blocks for the several optimized frameworks.
When working with DL techniques, there are two main steps:
- Training: In this step we attempt to create a model based on labeled data (for example, labeled images)
- Inference (also known as scoring): Once a model has been created, this model can be deployed to make predictions on the data (for example, to find objects in unlabeled images)
The Intel optimized ML and DL frameworks allow maximum performance on both steps when running on Intel architecture.
In DL, it is important to use frameworks and tools that have been optimized for the underlying hardware, because DL tasks (either training or inference) require a large amount of computation to complete. Although many of the popular DL frameworks (like Caffe, Theano, and so on) are open source software, they are not optimized to run efficiently on Intel architecture. You will only get the high performance on Intel architecture when you use the versions of these frameworks that Intel has optimized.
Whether you are interested in learning about ML and DL or if you are a data scientist with specific ML or DL tasks to perform, you can benefit from using Intel optimized frameworks. The best way to start your exploration of Intel optimized tools is to visit the Intel® Developer Zone portal for AI, https://software.intel.com/ai, where you can get general information about Intel® technologies supported for AI.
To download Intel optimized frameworks, as well as install documentation and training, you can go to https://software.intel.com/ai/deep-learning.
This webpage contains links to GitHub* pages to download the optimized frameworks, as well as links to varied documentation, videos, and examples.
If you are a software developer and interested in creating or optimizing your own DL tools or frameworks, you can take a look at examples of how Intel’s modern code expert engineers have optimized popular frameworks. One example is in https://software.intel.com/videos/getting-the-most-out-of-ia-using-the-caffe-deep-learning-framework.
There you can learn how modern code techniques have been used to optimize the popular Caffe* DL framework, and you can apply those techniques to analyze and optimize your own ML or DL tool or framework to run with maximum efficiency on Intel architecture.
In addition to the Intel-optimized frameworks, Intel is also releasing a DL SDK, which is an integrated environment that allows the user to visualize different aspects of the DL process in real time, as well as handle visual representations of the DL models. Intel plans to continue working on this high-level tool for DL. You can visit https://software.intel.com/en-us/deep-learning-sdk to get more information about this new tool.
Conclusion
The use of data analytics (in particular ML and DL) has become a competitive advantage in many industries. Given the fast pace at which new ML and DL tools are currently developed, it is important for ML practitioners and data scientists to take advantage of tools and frameworks that have been already optimized and tuned to the underlying hardware, instead of investing time and resources trying to optimize them or loosing advantage because of long processing times when using non-optimized tools. Intel is actively working with the open source community to optimize ML and DL tools and frameworks, which will offer maximum performance on Intel architecture with minimum effort from the user. To start using Intel’s optimized tools for your ML and DL needs, visit https://software.intel.com/en-us/ai.
Thread Parallelism in Cython*
Introduction
Cython* is a superset of Python* that additionally supports C functions and C types on variable and class attributes. Cython is used for wrapping external C libraries that speed up the execution of a Python program. Cython generates C extension modules, which are used by the main Python program using the import
statement.
One interesting feature of Cython is that it supports native parallelism (see the cython.parallel
module). The cython.parallel.prange
function can be used for parallel loops; thus one can take advantage of Intel® Many Integrated Core Architecture (Intel® MIC Architecture) using the thread parallelism in Python.
Cython in Intel® Distribution for Python* 2017
Intel® Distribution for Python* 2017 is a binary distribution of Python interpreter, which accelerates core Python packages, including NumPy, SciPy, Jupyter, matplotlib, Cython, and so on. The package integrates Intel® Math Kernel Library (Intel® MKL), Intel® Data Analytics Acceleration Library (Intel® DAAL), pyDAAL, Intel® MPI Library and Intel® Threading Building Blocks (Intel® TBB). For more information on these packages, please refer to the Release Notes.
The Intel Distribution for Python 2017 can be downloaded here. It is available for free for Python 2.7.x and 3.5.x on OS X*, Windows* 7 and later, and Linux*. The package can be installed as a standalone or with the Intel® Parallel Studio XE 2017.
Intel Distribution for Python supports both Python 2 and Python 3. There are two separate packages available in the Intel Distribution for Python: Python 2.7 and Python 3.5. In this article, the Intel® Distribution for Python 2.7 on Linux (l_python27_pu_2017.0.035.tgz
) is installed on a 1.4 GHz, 68-core Intel® Xeon Phi™ processor 7250 with four hardware threads per core (a total of 272 hardware threads). To install, extract the package content, run the install script, and then follow the installer prompts:
$ tar -xvzf l_python27_pu_2017.0.035.tgz $ cd l_python27_pu_2017.0.035 $ ./install.sh
After the installation completes, activate the root environment (see the Release Notes):
$ source /opt/intel/intelpython27/bin/activate root
Thread Parallelism in Cython
In Python, there is a mutex that prevents multiple native threads from executing bycodes at the same time. Because of this, threads in Python cannot run in parallel. This section explores thread parallelism in Cython. This functionality is then imported to the Python code as an extension module allowing the Python code to utilize all the cores and threads of the hardware underneath.
To generate an extension module, one can write Cython code (file with extension .pyx
). The .pyx
file is then compiled by the Cython compiler to convert it into efficient C code (file with extension .c
). The .c
file is in turn compiled and linked by a C/C++ compiler to generate a shared library (.so
file). The shared library can be imported in Python as a module.
In the following multithreads.pyx
file, the function serial_loop
computes log(a)*log(b)
for each entry in the A and B arrays and stores the result in the C array. The log function is imported from the C math library. The NumPy module, the high-performance scientific computation and data analysis package, is used in order to vectorize operations on A and B arrays.
Similarly, the function parallel_loop
performs the same computation using OpenMP* threads to execute the computation in the body loop. Instead of using range, prange
(parallel range) is used to allow multiple threads executed in parallel. prange
is a function of the cython.parallel
module and can be used for parallel loops. When this function is called, OpenMP starts a thread pool and distributes the work among the threads. Note that the prange
function can be used only when the Global Interpreter Lock (GIL) is released by putting the loop in a nogil
context (the GIL global variable prevents multiple threads to run concurrently). With wraparound(False)
, Cython never checks for negative indices; with boundscheck(False)
, Cython doesn’t do bound check on the arrays.
$ cat multithreads.pyx cimport cython import numpy as np cimport openmp from libc.math cimport log from cython.parallel cimport prange from cython.parallel cimport parallel THOUSAND = 1024 FACTOR = 100 NUM_TOTAL_ELEMENTS = FACTOR * THOUSAND * THOUSAND X1 = -1 + 2*np.random.rand(NUM_TOTAL_ELEMENTS) X2 = -1 + 2*np.random.rand(NUM_TOTAL_ELEMENTS) Y = np.zeros(X1.shape) def test_serial(): serial_loop(X1,X2,Y) def serial_loop(double[:] A, double[:] B, double[:] C): cdef int N = A.shape[0] cdef int i for i in range(N): C[i] = log(A[i]) * log(B[i]) def test_parallel(): parallel_loop(X1,X2,Y) @cython.boundscheck(False) @cython.wraparound(False) def parallel_loop(double[:] A, double[:] B, double[:] C): cdef int N = A.shape[0] cdef int i with nogil: for i in prange(N, schedule='static'): C[i] = log(A[i]) * log(B[i])
After completing the Cython code, the Cython compiler converts it to a C code extension file. This can be done by a disutilssetup.py
file (disutils is used to distribute Python modules). To use the OpenMP support, one must tell the compiler to enable OpenMP by providing the flag –fopenmp
in a compile argument and link argument in the setup.py
file as shown below. The setup.py
file invokes the setuptools
build process that generates the extension modules. By default, this setup.py
uses GNU GCC* to compile the C code of the Python extension. In addition, we add –O0
compile flags (disable all optimizations) to create a baseline measurement.
$ cat setup.py from distutils.core import setup from Cython.Build import cythonize from distutils.extension import Extension from Cython.Distutils import build_ext setup( name = "multithreads", cmdclass = {"build_ext": build_ext}, ext_modules = [ Extension("multithreads", ["multithreads.pyx"], extra_compile_args = ["-O0", "-fopenmp"], extra_link_args=['-fopenmp'] ) ] )
Use the command below to build C/C++ extensions:
$ python setup.py build_ext –-inplace
Alternatively, you can also manually compile the Cython code:
$ cython multithreads.pyx
This generates the multithreads.c
file, which contains the Python extension code. You can compile the extension code with the gcc
compiler to generate the shared object multithreads.so
file.
$ gcc -O0 -shared -pthread -fPIC -fwrapv -Wall -fno-strict-aliasing -fopenmp multithreads.c -I/opt/intel/intelpython27/include/python2.7 -L/opt/intel/intelpython27/lib -lpython2.7 -o multithreads.so
After the shared object is generated. Python code can import this module to take advantage of thread parallelism. The following section will show how one can improve its performance.
You can import the timeit
module to measure the execution time of a Python function. Note that by default, timeit
runs the measured function 1,000,000 times. Set the number of execution times to 100 in the following examples for a shorter execution time. Basically, timeit.Timer ()
imports the multithreads
module and measures the time spent by the function multithreads.test_serial()
. The argument number=100
tells the Python interpreter to perform the run 100 times. Thus, t1.timeit(number=100)
measures the time to execute the serial loop (only one thread performs the loop) 100 times.
Similarly, t12.timeit(number=100)
measures the time when executing the parallel loop (multiple threads perform the computation in parallel) 100 times.
- Measure the serial loop with
gcc
compiler, compiler option–O0
(disabled all optimizations).
$ python Python 2.7.12 |Intel Corporation| (default, Oct 20 2016, 03:10:12) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Intel(R) Distribution for Python is brought to you by Intel Corporation. Please check out: https://software.intel.com/en-us/python-distribution
Import timeit
and time t1
to measure the time spent in the serial loop. Note that you built with gcc
compiler and disabled all optimizations. The result is displayed in seconds.
>>> import timeit>>> t1 = timeit.Timer("multithreads.test_serial()","import multithreads")>>> t1.timeit(number=100) 2874.419779062271
- Measure the parallel loop with
gcc
compiler, compiler option–O0
(disabled all optimizations).
The parallel loop is measured by t2
(again, you built with gcc
compiler and disabled all optimizations).
>>> t2 = timeit.Timer("multithreads.test_parallel()","import multithreads")>>> t2.timeit(number=100) 26.016316175460815
As you observe, the parallel loop improves the performance by roughly a factor of 110x.
- Measure the parallel loop with
icc
compiler, compiler option–O0
(disabled all optimizations).
Next, recompile using the Intel® C Compiler and compare the performance. For the Intel® C/C++ Compiler, use the –qopenmp
flag instead of –fopenmp
to enable OpenMP. After installing the Intel Parallel Studio XE 2017, set the proper environment variables and delete all previous build:
$ source /opt/intel/parallel_studio_xe_2017.1.043/psxevars.sh intel64 Intel(R) Parallel Studio XE 2017 Update 1 for Linux* Copyright (C) 2009-2016 Intel Corporation. All rights reserved. $ rm multithreads.so multithreads.c -r build
To explicitly use the Intel icc
to compile this application, execute the setup.py
file with the following command:
$ LDSHARED="icc -shared" CC=icc python setup.py build_ext –-inplace
The parallel loop is measured by t2
(this time, you built with Intel compiler, disabled all optimizations):
$ python>>> import timeit>>> t2 = timeit.Timer("multithreads.test_parallel()","import multithreads")>>> t2.timeit(number=100) 23.89365792274475
- Measure the parallel loop with
icc
compiler, compiler option–O3
.
For the third try, you may want to see whether or not using –O3
optimization and enabling Intel® Advanced Vector Extensions (Intel® AVX-512) ISA on the Intel® Xeon Phi™ processor can improve the performance. To do this, in the setup.py
, replace –O0
with –O3
and add –xMIC-AVX512
. Repeat the compilation, and then run the parallel loop as indicated in the previous step, which results in: 21.027512073516846
. The following graph shows the results (in seconds) when compiling with gcc
, icc
without optimization enabled, and icc
with optimization, Intel AVX-512 ISA:
The result shows that the best result (21.03 seconds) is obtained when you compile the parallel loop with the Intel compiler, and enable auto-vectorization (-O3)
combined with Intel AVX-512 ISA (-xMIC-AVX512)
for the Intel Xeon Phi processor.
By default, the Intel Xeon Phi processor uses all available resources: it has 68 cores, and each core uses four hardware threads. A total of 272 threads or four threads/core are running in a parallel region. It is possible to modify the core and number of thread running by each core. The last section shows how to use an environment variable to accomplish this.
- To run 68 threads on 68 cores (one thread per core) executing the loop body for 100 times, set the
KMP_PLACE_THREADS
environment as below:
$ export KMP_PLACE_THREADS=68c,1t
- To run 136 threads on 68 cores (two threads per core) running the parallel loop for 100 times, set the
KMP_PLACE_THREADS
environment as below:
$ export KMP_PLACE_THREADS=68c,2t
- To run 204 threads on 68 cores (three threads per core) running the parallel loop for 100 times, set the
KMP_PLACE_THREADS
environment as below:
$ export KMP_PLACE_THREADS=68c,3t
The following graph summarizes the result:
Conclusion
This article showed how to use Cython to build an extension module for Python in order to take advantage of multithread support for the Intel Xeon Phi processor. It shows how to use the setup script to build a shared library. The parallel loop performance can be improved by trying different compiler options in the setup script. This article also showed how to set different number of threads per core.
Exploring MPI for Python* on Intel® Xeon Phi™ Processor
Introduction
Message Passing Interface (MPI) is a standardized message-passing library interface designed for distributed memory programming. MPI is widely used in the High Performance Computing (HPC) domain because it is well-suited for distributed memory architectures.
Python* is a modern, powerful interpreter which supports modules and packages. Python supports extension C/C++. While HPC applications are usually written in C or FORTRAN for faster speed, Python can be used to quickly prototype a proof of concept and for rapid application development because of its simplicity and modularity support.
The MPI for Python (mpi4py) package provides Python bindings for the MPI standard. The mpi4py package translates MPI syntax and semantics, and uses Python objects to communicate. Thus, programmers can implement MPI applications in Python quickly. Note that mpi4py is object-oriented. Not all functions in the MPI standard are available in mpi4py; however, almost all the commonly used functions are. More information on mpi4pi can be found here. In mpi4py, COMM_WORLD
is an instance of the base class of communicators.
mpi4py supports two types of communications:
- Communication of generic Python objects: The methods of a communicator object are lower-case (
send()
,recv()
,bcast()
,scatter()
,gather()
, etc.). In this type of communication, the sent object is passed as a parameter to the communication call. - Communication of buffer-like objects: The methods of a communicator object are upper-case letters (
Send()
,Recv()
,Bcast()
,Scatter()
,Gather()
, etc.). Buffer arguments to these calls are specified using tuples. This type of communication is much faster than Python objects communication type.
Intel® Distribution for Python* 2017
Intel® Distribution for Python* is a binary distribution of Python interpreter; it accelerates core Python packages including NumPy, SciPy, Jupyter, matplotlib, mpi4py, etc. The package integrates Intel® Math Kernel Library (Intel® MKL), Intel® Data Analytics Acceleration Library (Intel® DAAL), pyDAAL, Intel® MPI Library, and Intel® Threading Building Blocks (Intel® TBB).
The Intel Distribution for Python 2017 is available free for Python 2.7.x and 3.5.x on OS X*, Windows* 7 and later, and Linux*. The package can be installed as a stand alone or with the Intel® Parallel Studio XE 2017.
In the Intel Distribution for Python, mpi4py is a Python wraparound for the native Intel MPI implementation (Intel MPI Library). This document shows how to write an MPI program in Python, and how to take advantage of Intel® multicore architecture using OpenMP threads and Intel® AVX-512 instructions.
Intel Distribution for Python supports both Python 2 and Python 3. There are two separate packages available in the Intel Distribution for Python: Python 2.7 and Python 3.5. In this example, the Intel Distribution for Python 2.7 on Linux (l_python27_pu_2017.0.035.tgz
) is installed on an Intel® Xeon Phi™ processor 7250 @ 1.4 GHz and 68 cores with 4 hardware threads per core (a total of 272 hardware threads). To install, extract the package content, run the install script, and follow the installer prompts:
$ tar -xvzf l_python27_pu_2017.0.035.tgz $ cd l_python27_pu_2017.0.035 $ ./install.sh
After the installation completes, activate the root Intel® Python Conda environment:
$ source /opt/intel/intelpython27/bin/activate root
Parallel Computing: OpenMP and SIMD
While multithreaded Python workloads can use Intel TBB optimized thread scheduling, another approach is to use OpenMP to take advantage of Intel multicore architecture. This section shows how to implement OpenMP multithreads and C math library in Cython*.
Cython is an interpreted language that can be built into native language. Cython is similar to Python, but it supports C function calls and C-style declaration of variables and class attributes. Cython is used for wrapping external C libraries that speed up the execution of a Python program. Cython generates C extension modules, which are used by the main Python program using the import
statement.
For example, to generate an extension module, one can write a Cython code (.pyx
file). The .pyx
file is then compiled by Cython to generate a .c
file, which contains the code of a Python extension code. The .c
file is in turn compiled by a C compiler to generate a shared object library (.so
file).
One way to build Cython code is to write a disutilssetup.py
file (disutils is used to distribute Python modules). In the following multithreads.pyx
file, the function vector_log_multiplication
computes log(a)*log(b)
for each entry in the A and B arrays and stores the result in the C array. Note that a parallel loop (prange
) is used to allow multiple threads executed in parallel. The log function is imported from the C math library. The function getnumthreads()
returns the number of threads:
$ cat multithreads.pyx cimport cython import numpy as np cimport openmp from libc.math cimport log from cython.parallel cimport prange from cython.parallel cimport parallel @cython.boundscheck(False) def vector_log_multiplication(double[:] A, double[:] B, double[:] C): cdef int N = A.shape[0] cdef int i with nogil, cython.boundscheck(False), cython.wraparound(False): for i in prange(N, schedule='static'): C[i] = log(A[i]) * log(B[i]) def getnumthreads(): cdef int num_threads with nogil, parallel(): num_threads = openmp.omp_get_num_threads() with gil: return num_threads
The setup.py
file invokes the setuptools
build process that generates the extension modules. By default, this setup.py
uses GNU GCC* to compile the C code of the Python extension. In order to take advantage of AVX-512 and OpenMP multithreads in the Intel Xeon Phi processor, one can specify the options -xMIC-avx512
and -qopenmp
in the compile and link flags, and use the Intel® compiler icc.
For more information on how to create the setup.py
file, refer to the Writing the Setup Script section of the Python documentation.
$ cat setup.py from distutils.core import setup from Cython.Build import cythonize from distutils.extension import Extension from Cython.Distutils import build_ext setup( name = "multithreads", cmdclass = {"build_ext": build_ext}, ext_modules = [ Extension("multithreads", ["multithreads.pyx"], libraries=["m"], extra_compile_args = ["-O3", "-xMIC-avx512", "-qopenmp" ], extra_link_args=['-qopenmp', '-xMIC-avx512'] ) ] )
In this example, the Parallel Studio XE 2017 update 1 is installed. First, set the proper environment variables for the Intel C compiler:
$ source /opt/intel/parallel_studio_xe_2017.1.043/psxevars.sh intel64 Intel(R) Parallel Studio XE 2017 Update 1 for Linux* Copyright (C) 2009-2016 Intel Corporation. All rights reserved.
To explicitly use the Intel compiler icc
to compile this application, execute the setup.py file with the following command:
$ LDSHARED="icc -shared" CC=icc python setup.py build_ext –inplace running build_ext cythoning multithreads.pyx to multithreads.c building 'multithreads' extension creating build creating build/temp.linux-x86_64-2.7 icc -fno-strict-aliasing -Wformat -Wformat-security -D_FORTIFY_SOURCE=2 -fstack-protector -O3 -fpic -fPIC -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/intel/intelpython27/include/python2.7 -c multithreads.c -o build/temp.linux-x86_64-2.7/multithreads.o -O3 -xMIC-avx512 -march=native -qopenmp icc -shared build/temp.linux-x86_64-2.7/multithreads.o -L/opt/intel/intelpython27/lib -lm -lpython2.7 -o /home/plse/test/v7/multithreads.so -qopenmp -xMIC-avx512
As mentioned above, this process first generates the extension code multithreads.c
. The Intel compiler compiles this extension code to generate the dynamic shared object library multithreads.so
.
How to write a Python Application with Hybrid MPI/OpenMP
In this section, we write an MPI application in Python. This program imports mpi4py
and multithreads
modules. The MPI application uses a communicator object, MPI.COMM_WORLD
, to identify a set of processes which can communicate within the set. The MPI functions MPI.COMM_WORLD.Get_size()
, MPI.COMM_WORLD.Get_rank()
, MPI.COMM_WORLD.send()
, and MPI.COMM_WORLD.recv()
are methods of this communicator object. Note that in mpi4py there is no need to call MPI_Init()
and MPI_Finalize()
as in the MPI standard because these functions are called when the module is imported and when the Python process ends, respectively.
The sample Python application first initializes two large input arrays consisting of random numbers between 1 and 2. Each MPI rank uses OpenMP threads to do the computation in parallel; each OpenMP thread in turn computes the product of two natural logarithms c = log(a)*log(b)
where a and b are random numbers between 1 and 2 (1 <= a,b <= 2
). To do that, each MPI rank calls the vector_log_multiplication
function defined in the multithreads.pyx
file. Execution time of this function is short, about 1.5 seconds. For illustration purposes, we use the timeit
utility to invoke the function ten times just to have enough time to demonstrate the number of OpenMP threads involved.
Below is the application source code mpi_sample.py
:
from mpi4py import MPI from multithreads import * import numpy as np import timeit def time_vector_log_multiplication(): vector_log_multiplication(A, B, C) size = MPI.COMM_WORLD.Get_size() rank = MPI.COMM_WORLD.Get_rank() name = MPI.Get_processor_name() THOUSAND = 1024 FACTOR = 512 NUM_TOTAL_ELEMENTS = FACTOR * THOUSAND * THOUSAND NUM_ELEMENTS_RANK = NUM_TOTAL_ELEMENTS / size repeat = 10 numthread = getnumthreads() if rank == 0: print "Initialize arrays for %d million of elements" % FACTOR A = 1 + np.random.rand(NUM_ELEMENTS_RANK) B = 1 + np.random.rand(NUM_ELEMENTS_RANK) C = np.zeros(A.shape) if rank == 0: print "Start timing ..." print "Call vector_log_multiplication with iter = %d" % repeat t1 = timeit.timeit("time_vector_log_multiplication()", setup="from __main__ import time_vector_log_multiplication",number=repeat) print "Rank %d of %d running on %s with %d threads in %d seconds" % (rank, size, name, numthread, t1) for i in xrange(1, size): rank, size, name, numthread, t1 = MPI.COMM_WORLD.recv(source=i, tag=1) print "Rank %d of %d running on %s with %d threads in %d seconds" % (rank, size, name, numthread, t1) print "End timing ..." else: t1 = timeit.timeit("time_vector_log_multiplication()", setup="from __main__ import time_vector_log_multiplication",number=repeat) MPI.COMM_WORLD.send((rank, size, name, numthread, t1), dest=0, tag=1)
Run the following command line to launch the above Python application with two MPI ranks:
$ mpirun -host localhost -n 2 python mpi_sample.py Initialize arrays for 512 million of elements Start timing ... Call vector_log_multiplication with iter = 10 Rank 0 of 2 running on knl-sb2.jf.intel.com with 136 threads in 14 seconds Rank 1 of 2 running on knl-sb2.jf.intel.com with 136 threads in 15 seconds End timing ...
While the Python program is running, the top command in a new terminal displays two MPI ranks (shown as two Python processes). When the main module enters the loop (shown with the message “Start timing
…”), the top
command reports almost 136 threads running (~13600 %CPU). This is because by default, all 272 hardware threads on this system are utilized by two MPI ranks, thus each MPI rank has 272/2 = 136 threads.
To get detailed information about MPI at run time, we can set the I_MPI_DEBUG
environment variable to a value ranging from 0 to 1000. The following command runs 4 MPI ranks and sets the I_MPI_DEBUG
to the value 4. Each MPI rank has 272/4 = 68 OpenMP threads as indicated by the top
command:
$ mpirun -n 4 -genv I_MPI_DEBUG 4 python mpi_sample.py [0] MPI startup(): Multi-threaded optimized library [0] MPI startup(): shm data transfer mode [1] MPI startup(): shm data transfer mode [2] MPI startup(): shm data transfer mode [3] MPI startup(): shm data transfer mode [0] MPI startup(): Rank Pid Node name Pin cpu [0] MPI startup(): 0 84484 knl-sb2.jf.intel.com {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152, 204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220} [0] MPI startup(): 1 84485 knl-sb2.jf.intel.com {17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,85,86,87,88,89,90,91,92,93,94 ,95,96,97,98,99,100,101,153,154,155,156,157,158,159,160,161,162,163,164,165,166, 167,168,169,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237} [0] MPI startup(): 2 84486 knl-sb2.jf.intel.com {34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254} [0] MPI startup(): 3 84487 knl-sb2.jf.intel.com {51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271} Initialize arrays for 512 million of elements Start timing ... Call vector_log_multiplication with iter = 10 Rank 0 of 4 running on knl-sb2.jf.intel.com with 68 threads in 16 seconds Rank 1 of 4 running on knl-sb2.jf.intel.com with 68 threads in 15 seconds Rank 2 of 4 running on knl-sb2.jf.intel.com with 68 threads in 15 seconds Rank 3 of 4 running on knl-sb2.jf.intel.com with 68 threads in 15 seconds End timing ...
We can specify the number of OpenMP threads used by each rank in the parallel region by setting the OMP_NUM_THREADS
environment variable. The following command starts 4 MPI ranks, 34 threads for each MPI ranks (or 2 threads/core):
$ mpirun -host localhost -n 4 -genv OMP_NUM_THREADS 34 python mpi_sample.py Initialize arrays for 512 million of elements Start timing ... Call vector_log_multiplication with iter = 10 Rank 0 of 4 running on knl-sb2.jf.intel.com with 34 threads in 18 seconds Rank 1 of 4 running on knl-sb2.jf.intel.com with 34 threads in 17 seconds Rank 2 of 4 running on knl-sb2.jf.intel.com with 34 threads in 17 seconds Rank 3 of 4 running on knl-sb2.jf.intel.com with 34 threads in 17 seconds End timing ...
Finally, we can force the program to allocate memory in MCDRAM (High-Bandwidth Memory on the Intel Xeon Phi processor). For example, before the execution of the program, the ”numactl –hardware
” command shows the system has 2 NUMA nodes: node 0 consists of CPUs and 96 GB DDR4 memory, node 1 is the on-board 16 GB MCDRAM memory:
$ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 node 0 size: 98200 MB node 0 free: 73585 MB node 1 cpus: node 1 size: 16384 MB node 1 free: 15925 MB node distances: node 0 1 0: 10 31 1: 31 10
Run the following command, which indicates allocating memory in MCDRAM if possible:
$ mpirun -n 4 numactl --preferred 1 python mpi_sample.py
While the program is running, we can observe that it allocates memory in MCDRAM (NUMA node 1):
$ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 node 0 size: 98200 MB node 0 free: 73590 MB node 1 cpus: node 1 size: 16384 MB node 1 free: 3428 MB node distances: node 0 1 0: 10 31 1: 31 10
Readers can also try the above code on an Intel® Xeon® processor system with the appropriate setting. For example, on Intel® Xeon® processor E5-2690 v4, using –xCORE-AVX2
instead of –xMIC-AVX512
, and set the number of available threads to 28 instead of 272. Also note that E5-2690 v4 doesn’t have High-Bandwidth Memory.
Conclusion
This article introduced the MPI for Python package and demonstrated how to use it via the Intel Distribution for Python. Furthermore, it showed how to use OpenMP and Intel AVX-512 instructions in order to fully take advantage of the Intel Xeon Phi processor architecture. A simple example was included to show how one can write a parallel Cython function with OpenMP, compiled it with the Intel compiler with AVX-512 enabled option, and integrated it with an MPI Python program to fully take advantage of the Intel Xeon Phi processor architecture.
References:
- MPI Forum
- MPI for Python
- Intel® Distribution for Python*
- Intel® Parallel Studio XE 2017
- Intel® MPI Library
- Intel® AVX-512 instructions
- OpenMP
- Cython C-Extensions for Python
- Writing the Setup Script
About the Author
Loc Q Nguyen received an MBA from University of Dallas, a master’s degree in Electrical Engineering from McGill University, and a bachelor's degree in Electrical Engineering from École Polytechnique de Montréal. He is currently a software engineer with Intel Corporation's Software and Services Group. His areas of interest include computer networking, parallel computing, and computer graphics.
Quick Analysis of Vectorization Using the Intel® Advisor 2017 Tool
In this article we continue our exploration of vectorization on an Intel® Xeon Phi™ processor using examples of loops that we used in a previous article. We will discuss how to use the command-line interface in Intel® Advisor 2017 for a quick, initial analysis of loop performance that gives an overview of the hotspots in the code. This initial analysis can be then followed by more in-depth analysis using the graphical user interface (GUI) in Intel Advisor 2017.
Introduction
Intel has developed several software products aimed at increasing productivity of software developers and helping them to make the best use of Intel® processors. One of these tools is Intel® Parallel Studio XE, which contains a set of compilers and analysis tools that let the user write, analyze and optimize their application on Intel hardware.
In this article, we explore Intel® Advisor 2017, which is one of the analysis tools in the Intel Parallel Studio XE suite that lets us analyze our application and gives us advice on how to improve vectorization in our code.
Why is vectorization important? And how does Intel® Advisor help?
Vector-level parallelism allows the software to use special hardware like vector registers and SIMD (Single Instruction Multiple Data) instructions. New Intel® processors, like the Intel® Xeon Phi™ processor features 512-bit wide vector registers which, in conjunction with the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) ISA, allows the use of two vector processing units in each individual core, each of them capable of processing 16 single-precision (32-bit) or 8 double-precision (64-bit) floating point numbers.
To further realize the full performance of modern processors, code must be also threaded to take advantage of multiple cores. The multiplicative effect of vectorization and threading will accelerate code more than the effect of only vectorization or threading.
Intel Advisor analyzes our application and reports not only the extent of vectorization but also possible ways to achieve more vectorization and increase the effectiveness of the current vectorization.
Although Intel Advisor works with any compiler, it is particularly effective when applications are compiled using Intel compilers, because Intel Advisor will use the information from the reports generated by Intel compilers.
How to use Intel® Advisor
The most effective way to use Intel Advisor is via the GUI. This interface gives us access to all the information and recommendations that Intel Advisor collects from our code. Detailed information can be found in https://software.intel.com/en-us/intel-advisor-xe-support, where documentation, training materials, and code samples can be found. Product support and access to the Intel Advisor community forum can be also found in that link.
Intel Advisor also offers a command-line interface (CLI) that lets the user work on remote hosts, and/or generate information in a way that is easy to automate analysis tasks, for example using scripts.
When working on Intel Xeon Phi processor, which is based on the Linux* OS, we might need to use a combination of Advisor’s GUI and CLI for our specific analysis workflow, and in some cases the CLI will be a good starting point for a quick view of a performance summary, as well as in the initial phases of our workflow analysis. Detailed information about the Intel Advisor CLI for Linux can be found at https://software.intel.com/en-us/node/634769.
In the next sections, a procedure for a quick initial performance analysis on Linux using the Intel Advisor CLI will be described. This quick analysis will give us an idea of the performance bottlenecks in our application and where to focus initial optimization efforts. Also, for testing purposes, this procedure will also allow the user to automate testing and results reporting.
This analysis is intended as an initial step and will provide access to only limited information. The full extent of the information and help offered by Intel Advisor is available using a combination of the Intel Advisor GUI and CLI.
Using Intel Advisor on an Intel® Xeon Phi™ processor: Running a quick survey analysis
To illustrate this procedure, I will use the code sample from a previous article that shows vectorization improvements when using the Intel AVX-512 ISA. Details of the source code are discussed in that article. The sample code can be downloaded from here.
This example will be run in the following hardware:
Processor: Intel Xeon Phi processor, model 7250 (1.40 GHz)
Number of cores: 68
Number of threads: 272
The first step for a quick analysis is to create an optimized executable that will run on the Intel Xeon Phi processor. For this, we start by compiling our application with a set of options that will direct the compiler to create this executable in a way that Intel Advisor will be able to extract information from. The options that must be used are –xMIC-AVX512
, which enables the use of all the subsets of Intel Advanced Vector Extensions 512 that are supported by the Intel® Xeon Phi™ processor (Zhang, 2016), and –g
to generate debugging information and symbols. The –O3
option is also used because the executable must be optimized. We can use either the –O2
or the -O3
options for this purpose.
$ icpc Histogram_Example.cpp -g -O3 -restrict -xMIC-AVX512 -o run512 -lopencv_highgui -lopencv_core -lopencv_imgproc
Notice that we have also used the –restrict
option, which informs the compiler that the pointers used in this application are not aliased. Also notice that we are linking the application with the OpenCV* library (www.opencv.org), which we use in this application to read an image from disk. A Makefile
file is included if you download the sample code. This Makefile
file can be used to generate an executable for Intel Advisor.
Next, we can run the CLI version of the Intel Advisor tool. The survey analysis is a good starting point for analysis, because it provides information that will let us identify how our code is using vectorization and where the hotspots for analysis are.
$ advixe-cl -collect survey -project-dir ./AdvProj-Example-AVX512 -search-dir all:=./src -- ./run512 image01.jpg
The above command runs the Intel Advisor tool and creates a project directory AdvProj-Example-AVX512
. Inside this directory, Intel Advisor creates, among other things, a directory named e000
containing the results of the analysis. If we list the contents of the results directory, we see the following:
$ ls AdvProj-Example-AVX512/e000/ e000.advixeexp hs000 loop_hashes.def $
The directory hs000
contains results from the survey analysis just created.
The next step is to view the results of the survey analysis performed by the Intel Advisor tool. Here we will use the CLI to generate the report. To do this, we replace the -collect
option with the -report
one, making sure we refer to the same project directory where the data has been collected. We can use the following command to generate a survey report from the survey data that is contained in the results directory in our project directory:
$ advixe-cl -report survey -project-dir ./AdvProj-Example-AVX512 -format=text -report-output=./REPORTS/survey-AVX512.txt
The above command will create a report named survey-AVX512.txt
in the subdirectory REPORTS
. This report is in a column format and contains several columns, so it can be a little difficult to read on a console. One option for a quick read is to limit the number of columns to be displayed using the –filter
option (only the survey report is supported in the current version of Intel Advisor).
Another option is to create an xml-formatted report. We can do this if we change the value for the -format
option from text
to xml
:
$ advixe-cl -report survey -project-dir ./AdvProj-Example-AVX512 -format=xml -report-output=./REPORTS/survey-AVX512.xml
The xml-formatted report might be easier to read on a small screen, because the information in the columns in the report file is condensed into one column. Here is a fragment of it:
(…) </function_call_site_or_loop><function_call_site_or_loop ID="4" Function_Call_Sites_and_Loops="[child]-[loop in main at Histogram_Example.cpp:107]" Self_Time="0.060s" Total_Time="0.120s" Type="Vectorized (Body)" Why_No_Vectorization="" Vector_ISA="AVX512" Compiler_Estimated_Gain="3.37x" Trip_Counts_Average="" Trip_Counts_Min="" Trip_Counts_Max="" Trip_Counts_Call_Count="" Transformations="" Source_Location="Histogram_Example.cpp:107" Module="run512"> (…) </function_call_site_or_loop><function_call_site_or_loop ID="8" name="[loop in main at Histogram_Example.cpp:87]" Self_Time="0.030s" Total_Time="0.030s" Type="Vectorized (Body; [Remainder])" Why_No_Vectorization="1 vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override " Vector_ISA="AVX512" Compiler_Estimated_Gain="20.53x" Trip_Counts_Average="" Trip_Counts_Min="" Trip_Counts_Max="" Trip_Counts_Call_Count="" Transformations="" Source_Location="Histogram_Example.cpp:87" Module="run512"></function_call_site_or_loop><function_call_site_or_loop ID="1" Function_Call_Sites_and_Loops="[child]-[loop in main at Histogram_Example.cpp:87]" Self_Time="0.030s" Total_Time="0.030s" Type="Vectorized (Body)" Why_No_Vectorization="" Vector_ISA="AVX512" Compiler_Estimated_Gain="20.53x" Trip_Counts_Average="" Trip_Counts_Min="" Trip_Counts_Max="" Trip_Counts_Call_Count="" Transformations="" Source_Location="Histogram_Example.cpp:87" Module="run512">
Recall that the survey option in the Intel Advisor tool will generate a performance overview of the loops in the application. For example, the example shown above shows that the loop starting on line 107 in the source code has been vectorized using Intel AVX-512 ISA. It also shows an estimate of the improvement of the loop’s performance (compared to a scalar version) and timing information. The second and third blocks in the example above give performance overview for the loop at line 87 in the source code. It shows that the body of the loop has been vectorized, but the reminder of the loop has not.
Also notice that the different loops have been assigned a loop ID, which is the way the Intel Advisor tool labels the loops in order to keep track of them in future analysis (for example, after looking at the performance overview shown above, we might want to generate more detailed information about a specific loop by including the loop ID in the command line).
The above is a quick way to run and visualize a vectorization analysis in the Intel Xeon Phi processor. This procedure will let us quickly visualize the basic vectorization information from our codes with minimum effort. It will also let us create quick summaries of progressive optimization steps in the form of tables or plots (if we have run several of these analysis at different stages of the optimization process). However, if we need to access more advanced information from our analysis, like traits or the assembly code, we can use the Intel Advisor GUI possibly from a different computer (either by copying the project folder to another computer or by accessing it over the network) and access the complete information that Intel Advisor offers.
For example, the next figure shows what the Intel Advisor GUI looks like for the survey analysis shown above. We can see that, besides the information contained in the CLI report, The Intel Advisor GUI offers other information, like traits and source and assembly code.
Collecting more detailed information
Once we have looked at the performance summary reported by the Intel Advisor tool using the Survey
option, we can use other options to add more specific information to the reports. One option is to run the Tripcounts
analysis to get information about the number of times loops are executed.
To add this information to our project, we can use the Intel Advisor tool to run a tripcounts
analysis on the same project we used for the survey analysis:
$ advixe-cl -collect tripcounts -project-dir ./AdvProj-Example-AVX512 -search-dir all:=./src -- ./run512 image01.jpg
And similarly to generate a tripcounts
report:
$ advixe-cl -report tripcounts -project-dir ./AdvProj-Example-AVX512 -format=xml -report-output=./REPORTS/tripcounts-AVX512.xml
Now the xml-formatted report will contain information about the number of times the loops have been executed. Specifically, the Trip_Counts
fields in the xml
report will be populated, while the information from the survey report will be preserved. Next is a fragment of the enhanced report (only the first, most time-consuming loop is shown):
(…) </function_call_site_or_loop><function_call_site_or_loop ID="4" Function_Call_Sites_and_Loops="[child]-[loop in main at Histogram_Example.cpp:107]" Self_Time="0.070s" Total_Time="0.120s" Type="Vectorized (Body)" Why_No_Vectorization="" Vector_ISA="AVX512" Compiler_Estimated_Gain="3.37x" Trip_Counts_Average="761670" Trip_Counts_Min="761670" Trip_Counts_Max="761670" Trip_Counts_Call_Count="1" Transformations="" Source_Location="Histogram_Example.cpp:107" Module="run512">
In a similar way, we can generate other types of reports that will give us other useful information about our loops. The –help collect
and –help report
options in the command line Intel Advisor tool will show what types of collections and reports are available:
$ advixe-cl -help collect Intel(R) Advisor Command Line Tool Copyright (C) 2009-2016 Intel Corporation. All rights reserved. -c, -collect=<string> Collect specified data. Specifying --search-dir when collecting data is strongly recommended. Usage: advixe-cl -collect=<string> [-action-option] [-global-option] [--]<target> [<target options>]<string> is one of the following analysis types to perform on <target>: survey - Explore where to add efficient vectorization and/or threading. dependencies - Identify and explore loop-carried dependencies for marked loops. map - Identify and explore complex memory accesses for marked loops. suitability - Analyze the annotated program to check its predicted parallel performance. tripcounts - Find how many iterations are executed.
$ advixe-cl -help report Intel(R) Advisor Command Line Tool Copyright (C) 2009-2016 Intel Corporation. All rights reserved. -R, -report=<string> Report the results that were previously gathered. Generates a formatted data report with the specified type and action options. Usage: advixe-cl -report=<string> [-action-option] [-global-option] [--]<target> [<target options>]<string> is the list of available reports: survey - shows results of the survey analysis annotations - lists the annotations in the sources dependencies - shows possible dependencies hotspots - issues - map - reports memory access patterns suitability - shows possible performance gains summary - shows the collection summary threads - shows the list of threads top-down - shows the report in a top-down view tripcounts - shows survey report with tripcounts data added
For example, to obtain memory access pattern details in our source code, we can run a memory access patterns (MAP) analysis using the map
option:
$ advixe-cl -collect map -project-dir ./AdvProj-Example-AVX512 -search-dir all:=./src -- ./run512 image01.jpg
$ advixe-cl -report map -project-dir ./AdvProj-Example-AVX512 -format=xml -report-output=./REPORTS/map-AVX512.xml
In all the above cases, the project directory (in this example, AdvProj-Example-AVX512
) contains all the information necessary to perform a full analysis using the GUI. When we are ready to use the GUI, we can copy the project directory to a workstation/laptop (or access it over the filesystem) and run the GUI-based Intel Advisor from there, as was shown in a previous section in this article.
Summary
This article showed a simple way to quickly explore vectorization performance using Intel Advisor 2017. This was achieved using the CLI of Intel Advisor to perform a quick and preliminary analysis and report in the Intel Xeon Phi processor using a text window, with the idea of later obtaining more information about our codes by using the Intel Advisor GUI interface.
This procedure will also be useful for consolidating performance information after several iterations of source code optimization. A Unix* script (or similar) can be used to collect information from different reports and quickly consolidate it into tables or plots.
References
Zhang, B. (2016). "Guide to Automatic Vectorization With Intel AVX-512 Instructions in Knights Landing Processors."
DXF File for Intel® Joule™ Expansion Board
DXF File for Intel® Joule™ expansion board is a .ZIP that contains AutoCAD graphic data for the expansion board. This DFX file is for reference only and will require a viewer in order to read the file.
Case Study – Using the Intel® Deep Learning SDK for Training Image Recognition Models
Introduction
The Intel® Deep Learning SDK is a free set of tools for data scientists and software developers to develop, train, and deploy deep learning solutions. The SDK encompasses a training tool and a deployment tool that can be used separately or together in a complete deep learning workflow. In this case study, we explore LeNet*,one of the prominent image recognition topologies for handwritten digit recognition, and show how the training tool can be used to visually set up, tune, and train the Mixed National Institute of Standards and Technology (MNIST) dataset on Caffe* optimized for Intel® architecture. Data scientists are the intended audience.
Human Visual System and Convolutional Neural Networks
Before we dive into the use of the Intel Deep Learning SDK, it helps to basically understand how the human visual system works and how it relates to the design of computer neural networks. The neuron is the basic computational unit in the brain. It receives input from dendrites, and when the combination of all input exceeds a certain threshold, it fires an output that triggers the connected neurons. Mathematically, the biological system can be represented as shown below [1]:
This is an overly simplistic model. In reality, the human brain processes the input signal through multiple layers within the visual cortex, which handle feature extraction, feature detection, and classification. Feature extractions are handled by cells in the visual cortex that subsample overlapping areas in the visual field (receptive fields) thus acting as filters over the input image. Feature detection is handled in cortical areas 17, 18, and 19 and classification in areas 20 and 21. Take a look at the picture below [2]. The processed information is also fed forward and back-propagated to the previous layers before an image is correctly recognized.
In convolutional neural networks (CNN), the convolutional layers act as feature detectors and the fully connected layers as classifiers. The feature extraction is handled at runtime by passing multiple input sets simultaneously through the CNN (called mini-batches) and adjusting the weights in each iteration, thus aggregating features in each forward pass and fine-tuning features during backward propagation. Now let’s take a look at the LeNet topology, which is a prominent CNN for handwritten-digit recognition.
The LeNet topology
The LeNet-5 architecture as published in [3] is shown below:
The topology has seven layers excluding the input: two sets of convolutional and subsampling layers followed by two fully connected layers and an output classifying layer.
The first convolution layer C1 has six feature maps of 28´28 dimension. The kernel size of 5´5 with random weights and a constant bias is chosen. Between the six feature maps, this constitutes a total of 156 trainable parameters (6fm * (5 * 5 + 1[bias term])). The feature maps are now scanned in overlapping areas by moving one pixel each time (stride equals one) thus forming a total of 122,304 connections in the very first layer. Depending on the complexity of the problem we are trying to solve, you can see how quickly the number of neurons in each layer can increase manifold. In order to reduce the number of computational units, we use subsampling. In layer S2, there are six feature maps which are 14´14. You obtain this reduction in dimension by sampling 2´2 pixels in the corresponding feature map C1, adding the four inputs, and then multiplying them by a trainable coefficient and a bias. Note that the 2´2 region in the subsampling (pooling) layer is non-overlapping. So in S2, we end up with 12 trainable parameters ((one coefficient + one bias)* 6 feature maps) and 5,880 connections.
Now look at layer C3. There are 16 feature maps that are 10´10 each. The table below explains how we achieved the reduction in number of pixels between S2 and C3 [3]:
The asymmetric choice of pixels from S2 into C3 ensures that different feature maps extract different features as they each get different pixels while keeping the number of connections reasonable. I will keep the explanation of S4 brief. The concept is exactly the same as S2.
Finally, let’s look at the fully connected layers. Since these are classifiers, all of the features extracted in previous layers are used to match the input to extract the correct output. Remember that in C1, S1, C2, and S2 we chose random weights and bias in the first pass. When the input is evaluated against the actual output from the CNN, the accuracy, as expected would be very low. Our goal is to increase the accuracy of the model in such a way that for most (if not all) inputs, the expected output will match the provided label. The problem-solving method we use to accomplish this is the Gradient Descent for a cost function. In the most simplistic terms, the formula can be shown as below [4]:
C is the cost function, which is a function of the weights and bias chosen. Our goal is to minimize the possibility of errors between the actual output “y” for any input “x” and the expected output “a” over all inputs “n” using the features extracted with weights “w” and bias “b”.
Later, we’ll elaborate on how to create each convolution, pooling and fully connected layers in Caffe and show you how the choice of gradients and other hyper-parameters can be achieved using the Intel Deep Learning SDK.
The MNIST dataset
The MNIST dataset is a repository of 70k grayscale images of handwritten digits [5]. Each image is 28´28 in dimension. The collection was created from two NIST datasets, one of which was collected by Census Bureau employees and another by high-school students. To increase the variation in data, the final MNIST collection uses 30k images from each dataset for training and 5k images from each for testing.
You can obtain the dataset from here. We use the LeNet topology explained above with the MNIST dataset to demonstrate the training tool in the Intel Deep Learning SDK. Before we begin, make sure to process the data so that the category labels (0,1…9) are at the highest level of the directory structure with the corresponding images within each folder:
Now let’s dive into training the model using the Intel Deep Learning SDK. If you have not already installed the Intel Deep Learning SDK, you can do so now from here. There are installers for both Windows* and MacOS*. Behind the scenes, the installer installs the SDK on an Ubuntu* 14.04 or higher machine using Docker. The training process runs on the Caffe framework that is optimized for Intel architecture. Read [7] for assistance. The Intel Deep Learning SDK Training Tool provides a graphical user interface to manipulate parameters and visualize the training process.
Using the Intel® Deep Learning SDK to train the model
One of the main advantages of using the Intel Deep Learning SDK to train a model is its ease of use. As a data scientist, your focus would be more on easily preparing training data, using existing topologies where possible, designing new models if required, and train models with automated experiments and advanced visualizations. The training tool provides all of these benefits while also simplifying the installation of popular deep learning frameworks. Below is a snapshot of the training tool:
Now let’s look at the steps required to generate a trained model from an existing topology. To begin, launch the training UI using the IP address and the port of the device on which you have installed the training tool. If a port number is not specified, it defaults to 8080. Enter your username and the password you created during installation and log in to the interface.
Step 1: Upload the dataset
From the Uploads tab on the left, select the dataset zip file containing the labeled folders and associated images as explained above. Choose a folder path, and then click Upload. Once complete, the uploaded dataset path can be obtained from the Completed section.
Step 2: Create a new dataset
Select training/validation percentages and data augmentation
In the Datasets tab, click New Dataset. Name the dataset and choose an already uploaded dataset from the drop-down list. You can choose the percentage of data you want to use for training, validation, and testing.
The effectiveness of the training process lies in the variability of the data in the dataset, so without altering our base dataset, we can augment the inputs to the model by applying some transformations. Some augmentation techniques available are rotate, bidirectional shift, zoom, and mirror. You can learn more about each of these options in [8].
Process the data
The MNIST dataset we have used has grayscale images that are 28´28 pixels, so we adjust the settings accordingly. If you need to resize the data you have, use one of the available options. The user guide has more details on each option.
Select the database backend and image format
Create the dataset
Click Create Dataset. Once the process is complete, you can visualize the number of data in each label for both the training and testing datasets.
Step 3: Create a model
Select the dataset
In the Models tab on the left, click New Model, and then name the model (for simplicity, I choose the same name as the dataset, but you can choose a different name).
Select the topology
The training tool currently supports three image recognition topologies: LeNet, AlexNet*, and GoogLeNet*. Since we are training a model for handwritten digit recognition, I will choose LeNet, which is well suited for the job. The base data (number of channels in convolution layers, pooling layers, fully connected layers, and so on) is obtained from the Intel® Distribution for Caffe* that runs underneath. If you are training a different model, say using color images from Cifar 100* or ImageNet, you could choose AlexNet or GoogLeNet to train the model.
You also have the provision to create a new custom topology based on LeNet, AlexNet, or GoogLeNet, which I discuss in a later section.
Perform the data transformation
If you need to introduce some randomness during the training process, you could perform some data transformation operation without affecting the raw data in your dataset. These steps are covered in detail in the user guide. For now, I will use the default settings.
Select hyper-parameters
Here you choose settings to fine-tune your model. Some of the important parameters include the following:
Epoch: One complete pass through all the layers of a CNN, with all data files in the dataset having gone through the training.
Batch size: The CNN datasets are usually large. To allow for more effective training, the dataset is split into batches of “x” images. A complete pass for one batch through the CNN is called an iteration. While a large batch size may reduce the variance in the parameter update process, it also consumes a lot of memory. So it is important to balance the two.
Base learning rate: In the gradient descent algorithm we discussed earlier, we said that in the first pass, since the weights and bias is chosen at random, the accuracy is low. In order to find a convergence to the global minimum, we change the base learning rate of the algorithm. While a larger learning rate could mean faster convergence, if the learning rate is too large, it will drive our model away from convergence. On the other hand, a slower learning rate takes tiny steps toward convergence with the global minimum.
Solver type: The default option is the stochastic gradient descent, which takes random samples in the input data set in successive batches to get the solution to converge at the global minimum more quickly. You can choose between other options that are described in the user guide.
Now you are ready to run the model. Once complete, you will see the accuracy and loss at each epoch of execution and final training, validation, and testing accuracies.
At this point, the training is complete. We will need the list of Caffe files generated by the training tool. Click Download, and then save the files. In the next section, we’ll walk through each of these files and connect the concepts we have learned so far.
Understanding the Caffe* model files
The downloaded model.zip has all the necessary files to understand the CNN. This is important in case you want to create a custom topology or debug the model. These files are also used on the deployment platform to validate the model against real-time data. Let’s take a closer look at some of the files in this archive.
The most important file is the train_val.prototxt that contains the architecture of the model [9].
The data layer
Let’s take a look at the description of the data layer.
name: "MNISTLeNet" layer { name: "MNISTFinal" type: "Data" top: "data" top: "label" include { phase: TRAIN } transform_param { mirror: false scale: 0.00390625 } data_param { source: "/workspace/dl-genie/jobs/datasets/ca04ddcd-66b2-422d-8de8-dcc1ad38cfd4/train.txt_LMDB" batch_size: 64 backend: LMDB } }
The code above generates two blobs as indicated by “top”: one for the data and another for the labels. The “include” section indicates that the data layer is being created for the TRAIN phase. Since we need the same structure for validation as well, you will notice that there is another data layer within the train_val.prototxt file that indicates TEST in the “include” section. Next, we scale the pixels so that they are all in the range 0 to 1 (we do this by 1/256 = 0.00390625). Following this, we set the data source to point to the docker container where the dataset is created. Note that the back-end is set to LMDB and the batch size is set to 64, which is what we indicated in the training tool.
The convolutional layer
layer { name: "conv1" type: "Convolution" bottom: "data" top: "conv1" param { lr_mult: 1 } param { lr_mult: 2 } convolution_param { num_output: 20 kernel_size: 5 stride: 1 weight_filler { type: "xavier" } bias_filler { type: "constant" } } }
From the LeNet topology we discussed above, we know that the first convolution layer takes as input the 28´28 binary image. This is indicated by the “bottom:” parameter, which indicates that the data blob is the input. The “param” section indicates the learning rate adjustments for both the weights and the bias, which are the learnable parameters in this layer. The learning rate for the weights (lr_mult: 1) is set to the value that we indicated in the hyper-parameters. The bias learning rate is set to twice the weight learning rate. Experientially, this is known to provide better convergence rates.
Next, let’s define the number of channels in the convolution layer. This is defined by the “num_output”. Note that the number of channels here is different from the LeNet topology we described above. The LeNet topology in the Caffe framework uses a variant of the LeNet algorithm described above; however, conceptually, they are similar. The kernel_size is set to a 5´5 pixels (same as LeNet-5) which is used to scan through the 28´28 input image by moving one pixel at a time (as indicated by “stride”). Also note that we mentioned that the weights and bias are chosen at random on the first iteration, and these parameters learn how to fine-tune features as the training progresses. The weight_filler in this case is set to “Xavier” which samples weights from a uniform distribution [-scale, scale] where scale = sqrt (3/ n); n= number of inputs. If you are curious to learn more about the other types of fillers available in Caffe, read this file. We then set the bias_filler to a constant value = 0.
The sub-sampling or pooling layer
layer { name: "pool1" type: "Pooling" bottom: "conv1" top: "pool1" pooling_param { pool: MAX kernel_size: 2 stride: 2 } }
Defining a sub-sampling layer is easier. It takes as input the previous convolution layer and performs max pooling by taking a 2´2 pixel area and moving two pixels as indicated by “stride”. This indicates that there is no overlapping. By subsampling, we get a feature map size that is half the size of the channel size in the first convolutional layer.
The train-val.prototxt file has two more sections listing the second convolutional layer with 50 channels and the second pooling layer. We will not explain those here as they are similar to the above.
The fully connected layers
layer { name: "ip1" type: "InnerProduct" bottom: "pool2" top: "ip1" param { lr_mult: 1 } param { lr_mult: 2 } inner_product_param { num_output: 500 weight_filler { type: "xavier" } bias_filler { type: "constant" } } } layer { name: "relu1" type: "ReLU" bottom: "ip1" top: "ip1" } layer { name: "ip2" type: "InnerProduct" bottom: "ip1" top: "ip2" param { lr_mult: 1 } param { lr_mult: 2 } inner_product_param { num_output: 10 weight_filler { type: "xavier" } bias_filler { type: "constant" } } }
In Caffe, “InnerProduct” refers to the fully connected layers. The only change we note here is the number of channels, which are set to 500. The Rectified Linear Unit (ReLu) performs element-wise function f(x) = max(0,x). The last fully connected layer outputs 10 signals. This is a binary signal. An activation on output n (say n= 3) indicates that the trained model predicts the input to be 3.
The loss layer
layer { name: "loss" type: "SoftmaxWithLoss" bottom: "ip2" bottom: "label" top: "loss" }
As mentioned previously, in CNNs, we have information regarding the accuracy of the model being back-propagated so that the weights and bias can be adjusted for better feature extraction and prediction. The loss layer is responsible for taking as input the label blob and the current prediction, computing the loss, and sending the data back during back-propagation.
You can also include certain rules within layer definitions. For example, to define an accuracy layer only during training, you could include the rule as shown below:
layer {
name: "accuracy_top5"
type: "Accuracy"
bottom: "ip2"
bottom: "label"
top: "accuracy_top5"
include {
phase: TEST
}
accuracy_param {
top_k: 5
}
}
Let’s look at the solver.prototxt file. This file includes all of the hyper-parameters that we defined using the training tool interface. You can experiment by changing one or more of these parameters to see how the training changes.
# The train/test net protocol buffer definition net: "/workspace/dl-genie/jobs/models/ce7c60e9-be99-47bf-b166-7f9036cde0c8/train_val.prototxt" # test_iter specifies how many forward passes the test should carry out. # In the case of MNIST, we have test batch size 100 and 100 test iterations, # covering the full 10,000 testing images. # The base learning rate, momentum and the weight decay of the network. base_lr: 0.01 momentum: 0.9 weight_decay: 5.0E-4 # The learning rate policy lr_policy: "inv" gamma: 1.0E-4 power: 0.75 # Display every 100 iterations display: 31 # The maximum number of iterations max_iter: 12645 # snapshot intermediate results snapshot: 843 snapshot_prefix: "/workspace/dl-genie/jobs/models/ce7c60e9-be99-47bf-b166-7f9036cde0c8/snapshot" # solver mode: CPU or GPU solver_mode: CPU type: "SGD"
Note that the Intel Deep Learning SDK by default runs the training on the CPU without having to configure Caffe manually. If you want to run the training on Caffe optimized for Intel architecture outside of the SDK, please follow the instructions in this article.
Now that you understand how to interpret the Caffe model files, let’s look at how you can customize the topology using the training tool interface.
Customizing topologies using the Intel® Deep Learning SDK
Let’s revisit the process of creating a new model. After creating a new model name and selecting the dataset, go to the Topology tab and select one of the existing topologies. In this case, I will select the LeNet topology. Click edit as shown below:
A text box prepopulated with the train_val.prototxt file for the chosen model displays:
You can now change the parameters for the convolutional, pooling, fully-connected layers. Type a name for this new model, and then save the new topology. Once saved, a list of the custom topologies you have created becomes available for you to choose from.
You can now proceed with the rest of the training process as explained above. Once complete, the model files downloaded from the training tool can be used on your deployment platform to make predictions in real time.
References and Resources
[1] Using Convolutional Neural Networks for Image Recognition
[2] A Neural Network Architecture for General Image Recognition
[3] Gradient-Based Learning Applied to Document Recognition
[4] Using neural nets to recognize handwritten digits
[7] Intel® Deep Learning SDK – Training tool Installation Guide
[8] Intel® Deep Learning SDK – Training tool user guide
[9] Training LeNet on MNIST with Caffe
[10] Training and deploying deep learning networks with Caffe* optimized for Intel® Architecture
Notice
Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.
This sample source code is released under the Intel Sample Source Code License Agreement.
OvS-DPDK Datapath Classifier – Part 2
Overview
This article describes the design and implementation of Open vSwitch* (OvS) with the Data Plane Development Kit (DPDK) (OvS-DPDK) datapath classifier, also known as dpcls. We recommend that you first read the introductory article where we introduce the high-level dpcls design, the Tuple-Space search (TSS) implementation and its performance optimization strategy, and a few other scenarios. Part 2 (this article) focuses on call graphs and the subroutines involved.
OvS-DPDK has three-tier look-up tables/caches. Incoming packets are first matched against Exact Match Cache (EMC) and in case of a miss are sent to the dpcls (megaflow cache). The dpcls is implemented as a tuple space search (TSS) that supports arbitrary bitwise matching on packet header fields. Packets that miss the dpcls are sent to the OpenFlow* pipeline, also known as ofproto classifier, configured by an SDN controller as depicted in Figure 1.
Figure 1.Open vSwitch* three-tier cache architecture.
Packet Reception Path
The poll mode driver (PMD) thread continuously polls the input ports in its poll list. From each port it can retrieve a burst of packets not exceeding 32 (NETDEV_MAX_BURST). Each input packet then can be classified based on the set of active flow rules. The purpose of classification is to find a matching flow so that the packet can be processed accordingly. Packets are grouped per flow and each group will be processed to execute the specified actions. Figure 2 shows the stages of packet processing.
Figure 2.Packet-processing stages.
Flows are defined by dp_netdev_flow data structure (not to be confused with the struct flow) and are stored into a hash table called flow_table. Some of the information stored in a flow are
- Rule
- Actions
- Statistics
- Batch (queue for processing the packets that matched this flow)
- Thread ID (owning this flow)
- Reference count
Often the words “flows” and “rules” are used interchangeably; however, note that the rule is part of the flow.
The entries stored in the dpcls subtables are {rule, flow pointer} couples. The dpcls classification can then be summarized as a two-step process where:
- A matching rule is found from the subtables after lookup.
- The actions of the corresponding flow (from the flow pointer) are then executed.
Figure 3 shows a call graph for packet classification and action execution.
Figure 3.Lookup, batching, and action execution call graph.
When a packet is received, the packet header fields get extracted and stored into a “miniflow” data structure as part of EMC processing. The miniflow is a compressed representation of the packet that helps in reducing the memory footprint and cache lines.
Various actions can be executed on the classified packet, such as forwarding it to a certain port. Other actions are available, for example adding a VLAN tag, dropping the packet, or sending the packet to the Connection Tracker module.
EMC Call Graph
EMC is implemented as a hash table where the match must occur exactly on the whole miniflow structure. The hash value can be pre-computed by the network interface card (NIC) when the RSS mode is enabled; otherwise hash is computed in the software using miniflow_hash_5tuple().
Figure 4.EMC processing call graph.
In emc_processing(), after the miniflow extraction and hash computation, a lookup is performed in EMC for a match. When a valid entry is found, an exact match check will be carried out to compare that entry with the miniflow extracted from the packet. In case of a miss, the check is repeated on the next linked entry, if any. If no match is found, the packet will be later checked against the dpcls, the second level cache. Figure 4 depicts the call graph of packet processing in exact match cache.
Datapath Classifier Call Graph
Figure 5 shows the main dpcls functions involved in the overall classification and batching processes.
Figure 5.dpcls call graph.
As described in our introductory article a dpcls lookup is performed by traversing all its subtables. For each subtable the miniflow extracted from the packet is used to derive a search key in conjunction with the subtable mask. See netdev_flow_key_hash_in_mask() function in Figure 6 where also hash functions are displayed.
Figure 6.dpcls lookup call graph.
A dpcls lookup involves an iterative search over multiple subtables until a match is found. On a miss, the ofproto table will be consulted. On a hit, the retrieved value is a pointer to the matching flow which will determine the actions to be performed for the packet. A lookup on a subtable involves the following operations:
- The subtable mask will determine the subset of the packet miniflow fields to consider. The mask is then applied on the subset of fields to obtain a search-key.
- A hash value is computed on the search-key to point to a subtable entry.
- If an entry is found, due to the possibility of collisions (as is the behavior in hash table data structures), a check has to be performed to determine whether or not that entry matches with the search key. If it doesn’t match, the check is repeated on the chained list of entries, if any.
Due to their wildcarded nature, the entries stored in dpcls are also referred to as “megaflows,” because an entry can be matched by “several” packets. For example with “Src IP = 21.2.10.*”, incoming packets with Src IPs “21.2.10.1” or “21.2.10.2” or “21.2.10.88” will match the above rule. On the other hand, the entries stored in EMC are referred to as “microflows” (not to be confused with the miniflow data structure).
Note that the term “fast path” comes from the vanilla OvS kernel datapath implementation where the two-level caches are located in kernel space. “Slow path” refers to the user-space datapath that involves the ofproto table and requires a context switch.
Subtables Creation and Destruction
The order of subtables in the datapath classifier is random, and the subtables can be created and destroyed at runtime. Each subtable can collect rules with a specific predefined mask, see the introductory article for more details. When a new rule has to be inserted into the classifier, all the existing subtables are traversed until a suitable subtable matching the rule mask is found. Otherwise a new subtable will be created to store the new rule. The subtable creation call graph is depicted in Figure 7.
Figure 7.Rule insertion and subtable creation call graph.
When the last rule in a subtable is deleted, the subtable becomes empty and can be destroyed. The subtable deletion call graph is depicted in Figure 8.
Figure 8.Subtable deletion call graph.
Slow Path Call Graph
On a dpcls lookup miss, the packet will be classified by the ofproto table, see Figure 9.
An “upcall” is triggered by means of dp_netdev_upcall(). The reply from Ofproto layer will contain all the information about the packet classification. In addition to the execution of actions, a learning mechanism will be activated: a new flow will be stored, a new wildcarded rule will be inserted into the dpcls, and an exact-match rule will be inserted into EMC. This way the two-layer caches will be able to directly manage similar packets in the future.
Figure 9.Upcall processing call graph.
Packet Batching
Packet classification categorizes packets with the active flows. The set of received packets is divided into groups depending on the flows that were matched. Each group is enqueued into a batch specific to the flow, as depicted in Figure 10.
Figure 10.Packet are grouped depending on the matching flow.
All packets enqueued into the same batch will be processed with the list of actions defined by that flow. To improve the packet forwarding performance, the packets belonging to the same flow are batched and processed together.
In some cases there could be very few packets in a batch. In a worst case scenario each packet of the fetched set is hitting a distinct flow, so each batch will enqueue only one packet. That becomes particularly inefficient when it comes to transmitting fewer packets over the DPDK interfaces as it incurs expensive MMIO writes. In order to optimize the MMIO transactions and improve the performance, an intermediate queue is implemented. Enqueued packets are transmitted when their count is greater than or equal to 32. Figure 11 depicts how the packet batching is triggered from emc_processing() and fast_path_processing().
Figure 11.Packet batching call graph.
Action Execution Call Graph
The action execution is done on a batch basis. A typical example of an action can be to forward the packets on the output interface. Figure 12 depicts the case where the packets are sent on the output port by calling netdev_send().
Figure 12.Action execution call graph.
Example of Packet Forwarding to a Physical Port
Figure 13.Overall call graph in case of a forwarding to a physical port.
The call graph in Figure 13 depicts an overall picture of the packet path from ingress until the action execution stage, wherein the packet of interest is forwarded to a certain output physical port.
Accelerating the Classifier on Intel® Processors
A hash table is a data structure that associates keys with values and provides constant-time O(1) lookup on average, regardless of the number of items in the table. The essential ingredient of a high-performance hash table is a good hash function that can reduce collisions and the speed at which the hash can be calculated.
Hash tables have been used in OvS-DPDK for implementing EMC and the dpcls. As mentioned before, the EMC lookup can leverage the hash value computed by the NIC with RSS enabled. Instead, in the dpcls the hash is fully computed in software using murmuhash3. Improving the performance of the dpcls is critical to improve the overall OvS-DPDK performance.
In real-world deployments where the virtual switch is handling a few thousand flows, EMC gets quickly saturated due to its limited capacity (8192 entries). So most of the input packets will be checked against the dpcls. Though a single subtable lookup is inherently fast, significant performance degradation is seen with tens of subtables. The classifier performance suffers because the hash value computation has to be repeated for every subtable until a match is found.
To speed up the hash computation, the built-in CRC32 intrinsics can be leveraged on Intel processors. On an X86_64 processor with Intel® Streaming SIMD Extensions instruction set support, CRC32 intrinsics can be used by passing ‘-msse4.2’ during software configuration. For example, to leverage intrinsics, OvS-DPDK can be configured with the appropriate CFLAGS as shown below.
./configure CFLAGS="-g -O2 -msse4.2"
If you are on a different processor and don't know what flags to choose, we recommend using ‘-march=native’ settings. With this, GCC will detect the processor and automatically set the appropriate flags for it. Do not use this method if you are compiling OVS outside the target machine.
./configure CFLAGS="-g -O2 -march=native"
Conclusion
In this article, we discussed the different stages in packet processing pipeline with a focus on the dpcls. We also discussed batching and the positive impact of flow batching and intermediate queue on performance when the packets are forwarded to a physical port. Also we described the intrinsics that can be used to accelerate the dpcls and the overall switching performance on Intel processor-based platforms.
For Additional Information
For any question, feel free to follow up with the query on the Open vSwitch discussion mailing thread.
Part 1 of this two part series: OVS-DPDK Datapath Classifier
Tuple Space Search
To learn more about the Tuple Space Search:
About the Authors
Bhanuprakash Bodireddy is a network software engineer at Intel. His work primarily focuses on accelerated software switching solutions in user space running on Intel architecture. His contributions to OvS-DPDK include usability documentation, the Keep-Alive feature, and improving the datapath Classifier performance.
Antonio Fischetti is a network software engineer at Intel. His work primarily focuses on accelerated software switching solutions in user space running on Intel architecture. His contributions to OVS with DPDK are mainly focused on improving the datapath Classifier performance.
SaffronArchitect User Guide
SaffronArchitect is the tool you use to access dashboards. It is set up by your system administrator. From it, you can do the following:
- Access dashboards by logging in to SaffronArchitect.
- View configuration settings related to ingested data sources.
- Activate an ingestion.
- Access dashboard documentation.
Log In / Out
Users must log in to the SaffronArchitect application. Login credentials are provided by your system administrator:
.
Tabs
Navigational tabs appear at the top of SaffronArchitect. Each tab is described in this document:
Dashboards
Select Dashboards to access all dashboards that are included in SaffronArchitect. Once clicked, the tab view changes to include the names of all dashboards so that you can easily navigate between them. In the example below, dashboard tabs appear for SaffronStreamline IDR:
Settings
The Settings tab reveals configuration information set up by your system administrator. It cannot be edited:
Control Panel
Access the Control Panel to begin a data ingestion. Note that the data source location and other configuration considerations have been previously added by your system administrator.
When you access this tab, the system checks to see if an ingestion has been previously run.
Once it establishes that an ingestion has not yet run, click Ingest Data. This tells the back-end system to run an ingestion.
When data is fully ingested, Ingestion completed is returned. The various dashboards are populated with information.
As you periodically return to the Control Panel, the system checks to see if new data has been added. The example at left indicates that no new data has been ingested.
If for whatever reason you need to run an ingestion, check Force Ingestion and then click Ingest Data.
Help
Click Help to reveal a list of helpful documents:
Issue Similarity and Expert Recommendation
The Issue Similarity and Expert Recommendation dashboard provides an overall view of possible similar issues and expertise for a given issue in a given date range.
In the example below, issues ranked with the highest similarity (and might be duplicates) are listed first. Supporting information (Matches) for the ranked score are listed. Differences, or how the issues are not similar, are also listed.
Recommended experts (based on skills and previous work assignments) are provided to help you decide who should work on a given issue.
Click Export to save the data as a CSV file.
Project Status
The Project Status dashboard displays a report-like view of highly-similar issues and possible duplicates in a given date range.
The Similarity of Issues chart visually displays the degree of similarity among possible duplicate issues identified by the system. In the example below, only a small percentage of similar items are labeled High. Top Mentions displays the most common attributes among issues that have been identified as similar.
Hover over the chart to see how many issues are labeled at which level:
Test Case Similarity and Expert Recommendation
The Test Case Similarity and Expert Recommendation dashboard displays similar and possibly duplicate test cases that are related to a queried test case.
In the example below, tests ranked with the highest similarity (and might be duplicates) are listed first. Supporting information (Matches) for the ranked score are listed. Differences, or how the tests are not similar, are also listed.
Recommended experts (based on skills and previous work assignments) are provided to help you decide who should work on a given test.
Click Export to save the data as a CSV file.
Intel® Aero Compute Board and Intel® RealSense™ Technology for Wi-Fi* Streaming of RGB and Depth Data
Contents
Introduction
Target Audience
General Information
What’s Needed for this Sample Application?
What is the Intel® Aero Platform?
Two examples:
The Intel® Aero Compute Board
Operating system
Connector information
Intel® Aero Vision Accessory Kit
Intel® RealSense™ Technology
GStreamer
Setting Up Eclipse Neon*
Header files
Libraries
The Source Code
My Workflow
Some Initial Thoughts
gint main(gint argc, gchar *argv[])
static void print_realsense_error( rs_error *err )
static gboolean init_realsense( rs_device **camera )
static gboolean init_streams( rs_device *camera )
static void cb_need_data ( GstAppSrc *appsrc, guint unused_size, gpointer user_data )
Intel® Aero Compute Board Setup
Connecting Wirelessly to the Intel Aero Compute Board
Troubleshooting
Useful Bash Shell Scripts
migrateAero
makeAero
How to Configure QGroundControl
Step 1
Step 2
Step 3
Step 4
Launch the App
Intel® Aero Compute Board and GitHub*
Other Resources
Summary
Introduction
This article shows you how to send a video stream including RGB and Depth data from the Intel® Aero Compute Board that has an Intel® RealSense™ R200 camera attached to it. This video stream will be broadcast over the compute board’s Wi-Fi* network to a machine that is connected to the Wi-Fi network. The video stream will be displayed in the QGroundControl* internal video window.
Target Audience
This article and the code sample within it is geared toward software engineers and drone developers who want to start learning about the Intel Aero Compute Board. Some information taken from other documents is also included.
General Information
The example given in this article assumes that you are working on an Ubuntu* 16.4 machine. Though you can work with GStreamer and LibRealSense on a Windows* platform, this article’s source code was written on top of Ubuntu 16.04; therefore giving any details for Windows is out of the scope of this document.
Although I will be referencing the Intel RealSense R200 camera in this article, this example is NOT using the LibRealSense library and taking advantage of the camera’s depth capabilities. Future articles will address that type of functionality. This is a simple app to get someone up and running with streaming to QGroundControl.
Note that the source code is running on the Intel Aero Compute Board, not a client computer. It sends a video stream out to a specified IP address. The client computer must be attached to the Intel Aero Compute Board network.
What’s Needed for this Sample Application?
Assuming you do not have an Intel® Ready-to-Fly Drone and will be working with the board itself, the following items will be needed:
- Intel Aero Compute Board
- Power supply
- Intel® Aero Vision Accessory Kit
- Eclipse* Neon IDE (or your IDE of choice)
- GStreamer Software Library
- Intel LibRealSense Library
- A computer that is capable of running QGroundControl. The GStreamer runtimes need to be installed.
What is the Intel® Aero Platform?
The Intel® Aero Platform for UAVs is a set of Intel® technologies that allow you to create applications that enable various drone functionalities. At its core is the Intel Aero Compute Board and the Intel® Aero Flight Controller. The combination of these two hardware devices allows for powerful drone applications. The flight controller handles all aspects of drone flight, while the Intel Aero Compute Board handles real-time computation. The two can work in isolation from one another or communicate via the MAVlink* protocol.
Two examples:
Video streaming: When connected to a camera, the Intel Aero Compute Board can handle all the computations of connecting to the camera and then pulling that stream of data and doing something with it. Perhaps it’s streaming that data back to a ground control station via the built-in Wi-Fi capabilities. All this computation is handled freely of the Aero flight controller.
Collision avoidance: The Intel Aero Compute Board is connected to a camera, this time the Intel RealSense camera (R200). The application can pull depth data from the camera, crunch that data, and make tactical maneuvers based on the environment around the drone. These maneuvers can be calculated on the computer board, and then using Mavlink, an altered course can be sent to the flight controller.
This article discusses video streaming; collision avoidance is out of the scope of this article.
The Intel® Aero Compute Board
Operating system
The Intel Aero Compute Board uses a customized version of Yocto* Linux*. Plans are being considered to provide Ubuntu in the future. Keeping the Intel Aero Compute Board up to date with the latest image of Yocto is out of the scope of this document. For more information on this process, please see the Intel-Aero / meta-intel-aero wiki.
Connector information
1 | Power and console UART |
2 | USB 3.0 OTG |
3 | Interface for Intel® RealSense™ camera (R200) |
4 | 4-lane MIPI interface for high-resolution camera |
5 | 1-lane MIPI interface for VGA camera |
6 | 80-pin flexible I/O supports third-party flight controller and accessories (I2C, UART, GPIOs) |
7 | MicroSD memory card slot |
8 | Intel® Dual Band Wireless-AC |
9 | M.2 interface for PCIe SSD |
10 | Micro HDMI port |
R | RESERVED for future use |
Intel® Aero Vision Accessory Kit
The Intel® Vision Accessory Kit contains three cameras: an Intel RealSense camera (R200), an 8-megapixel (MP) RGB camera and a VGA camera that uses global shutter technology. With these three cameras, you have the ability to do depth detection using the Intel RealSense camera (R200) to perform use cases such as collision avoidance and creating point cloud data. With the 8-MP camera, the user can collect and stream much higher-quality RGB data than what the Intel RealSense camera (R200) is capable of streaming. With the VGA and its global shutter, one use case could be optical flow, which a developer could implement.
More detailed information about each camera can be found here:
Intel RealSense camera (R200)
- /en-us/RealSense/R200Camera
- /sites/default/files/managed/d7/a9/realsense-camera-r200-product-datasheet.pdf
8-MP RGB camera
- http://www.ovt.com/products/sensor.php?id=139
- http://www.ovt.com/download_document.php?type=sensor&sensorid=139
VGA camera
- http://www.ovt.com/products/sensor.php?id=146
- http://www.ovt.com/download_document.php?type=part&partid=350
Intel® RealSense™ Technology
With Intel RealSense technology using the Intel RealSense camera (R200), a user can stream depth, RGB, and IR data. The Intel Aero Platform for UAVs uses the open source library LibRealSense. This open source library is analogous to being a driver for the Intel RealSense camera (R200), allowing you to easily get streaming data from the camera. The library comes with several easy-to-understand tutorials for getting streaming up and running. For more information on using LibRealSense, visit the LibRealSense GitHub* site.
GStreamer
In order to develop against GStreamer on your Ubuntu computer, you must install the proper libraries. An in-depth look into the workings of GStreamer is beyond the scope of this article. For more information, see the GStreamer documentation. We recommend starting with the “Application Development Manual." To get all the necessary GStreamer libraries, install the following on your Ubuntu machine.
- sudo apt-get update
- sudo apt-get install ubuntu-restricted-extras
- sudo apt-get install gstreamer1.0-libav
- sudo apt-get install libgstreamer-plugins-base1.0-dev
As a bit of tribal knowledge, I have two different machines I’ve been developing on and these two different Ubuntu instances have installed Gstreamer in different locations: on one machine, Gstreamer headers and libraries are installed in /user/includes and /user/lib, and on the other, they are installed in /user/lib/x86_64-linux-gnu. You will see evidence of this in how I have included libraries and header files in my Eclipse project, which will appear as having duplicates. In hindsight, I could have just transferred the source code between two different project solutions.
Setting Up Eclipse Neon*
As mentioned, you can use whatever IDE you like. I gravitated toward the C++ version of Eclipse Neon.
I assume that you know how to create an Eclipse C++ application and will just show you how I set up my include files and what libraries I chose.
Unless I’ve missed anything, you should be ready to compile the following source code.
The Source Code
When looking at this source code, you may find that the spacing is off. This is because I copied and pasted this code directly out of my IDE. I didn’t change the spacing for this document so that it wouldn’t mess up the formatting for you if you copy this into your own IDE.
//============================================================================= // AeroStreamDepth // Demonstrates how to capture RGB and depth data from the RealSense camera, // manipulate the data to create a RGB depth image, the put each individual // frame into the GStreamer pipeline // // Unlike other pipelines where the source is a physical camera, the source // to this pipeline is a appsrc element. This element gets its data // frame-by-frame from a continuous pull from the R200 camera. // // Built on Ubuntu 16.04 and Eclipse Neon. // // SOFTWARE DEPENDENCIES // * LibRealSense // * GStreamer // // Example // ./AeroStream 192.168.1.1 //============================================================================= #include <gst/gst.h> #include <gst/app/gstappsrc.h> #include <librealsense/rs.h> #include <stdio.h> #include <stdlib.h> #include <memory.h> #include <stdint.h> #include <time.h> const int WIDTH = 640; const int HEIGHT = 480; const int SIZE = 640 * 480 * 3; // camera always returns this for 1 / get_depth_scale() const uint16_t ONE_METER = 999; // Holds the RGB data coming from R200 struct rgb { unsigned char r; unsigned char g; unsigned char b; }; // Function descriptions with the definitions static void print_realsense_error( rs_error *err ); static gboolean init_realsense( rs_device **camera ); static gboolean init_streams( rs_device *camera ); static void cb_need_data ( GstAppSrc *appsrc, guint unused_size, gpointer user_data ); //======================================================================================= // The main entry into the application. DUH! // arg[] will contain the IP address of the machine running QGroundControl //======================================================================================= gint main(gint argc, gchar *argv[]) { // App requires a valid IP address to where QGroundControl is running. if( argc < 2 ) { printf( "Inform address as first parameter.\n" ); exit( EXIT_FAILURE ); } char str_pipeline[ 200 ]; // Holds the pipeline rs_device *camera = NULL; // The R200 camera GMainLoop *loop = NULL; // Main app loop keeps app alive GstElement *pipeline = NULL; // GStreamers pipeline for data flow GstElement *appsrc = NULL; // Used to inject buffers into a pipeline GstCaps *app_caps = NULL; // Define the capabilities of the appsrc element GError *error = NULL; // Holds error message if generated GstAppSrcCallbacks cbs; // Callback functions/signals for appsrc // Initialize GStreamer gst_init( &argc, &argv ); // Create the main application loop. loop = g_main_loop_new( NULL, FALSE ); // Builds the following pipeline. snprintf( str_pipeline, sizeof( str_pipeline ), "appsrc name=mysource ! videoconvert ! ""video/x-raw,width=640,height=480,format=NV12 ! vaapih264enc ! h264parse ! rtph264pay ! ""udpsink host=%s port=5600", argv[ 1 ] ); // Instruct GStreamer to construct the pipeline and get the beginning element appsrc. pipeline = gst_parse_launch( str_pipeline, &error ); if( !pipeline ) { g_print( "Parse error: %s\n", error->message ); return 1; } appsrc = gst_bin_get_by_name( GST_BIN( pipeline ), "mysource" ); // Create a caps (capabilities) struct that gets feed into the appsrc structure. app_caps = gst_caps_new_simple( "video/x-raw", "format", G_TYPE_STRING, "RGB","width", G_TYPE_INT, WIDTH, "height", G_TYPE_INT, HEIGHT, NULL ); // This is going to specify the capabilities of the appsrc. gst_app_src_set_caps( GST_APP_SRC( appsrc ), app_caps ); // Don't need it anymore, un ref it so the memory can be removed. gst_caps_unref( app_caps ); // Set a few properties on the appsrc Element g_object_set( G_OBJECT( appsrc ), "is-live", TRUE, "format", GST_FORMAT_TIME, NULL ); // GStreamer is all setup so initialize the camera and start the streaming. init_realsense( &camera ); init_streams( camera ); // play gst_element_set_state( pipeline, GST_STATE_PLAYING ); // connect signals cbs.need_data = cb_need_data; // Apply the callbacks to the appsrc Elemen / Connect the signals. // In other words, cb_need_data will constantly being called // to pull data from R200. Why? Because it needs data. =) gst_app_src_set_callbacks( GST_APP_SRC_CAST( appsrc ), &cbs, camera, NULL ); // Launch the stream to keep the app running rather than falling out and existing g_main_loop_run( loop ); // clean up gst_element_set_state( pipeline, GST_STATE_NULL ); gst_object_unref( GST_OBJECT ( pipeline ) ); gst_object_unref( GST_OBJECT ( appsrc ) ); gst_object_unref( GST_OBJECT ( app_caps ) ); g_main_loop_unref( loop ); return 0; } //============================================================================= // Print RealSense error messages. //============================================================================= static void print_realsense_error( rs_error *err ) { printf( "rs_error was raised when calling %s( %s ):\n", rs_get_failed_function( err ), rs_get_failed_args( err ) ); printf(" %s\n", rs_get_error_message( err ) ); } //============================================================================= // Initializes the R200 camera. // Returns true if successful //============================================================================= static gboolean init_realsense( rs_device **camera ) { gboolean ret = true; rs_error *e = 0; // Create the context, ensure no errors rs_context *ctx = rs_create_context( RS_API_VERSION, &e ); if( e ) { print_realsense_error( e ); ret = false; } if( ret ) { // Create the actual camera, check for no errors *camera = rs_get_device( ctx, 0, &e ); if( !camera || e ) { print_realsense_error( e ); ret = false; } } return ret; } //============================================================================= // Initializes the RealSense RGB and Depth streams and starts the camera // Returns true if successful //============================================================================= static gboolean init_streams( rs_device *camera ) { gboolean ret = true; rs_error *e = 0; /* Configure all streams to run at VGA resolution at 60 frames per second */ rs_enable_stream( camera, RS_STREAM_DEPTH, WIDTH, HEIGHT, RS_FORMAT_Z16, 60, &e ); if( e ) { print_realsense_error( e ); ret = false; } if( ret ) { rs_enable_stream( camera, RS_STREAM_COLOR, WIDTH, HEIGHT, RS_FORMAT_RGB8, 60, &e ); if( e ) { print_realsense_error( e ); ret = false; } } if( ret && camera ) { rs_start_device( camera, &e ); if( e ) { print_realsense_error( e ); ret = false; } } return ret; } //======================================================================================= // Frame by frame, try to create our own depth image by taking the RGB data // and modifying the red channels intensity value. //======================================================================================= static void cb_need_data ( GstAppSrc *appsrc, guint unused_size, gpointer user_data ) { GstFlowReturn ret; // Get the camera and wait for it to process the frame rs_device *camera = ( rs_device* ) user_data; rs_wait_for_frames( camera, NULL); // Pull the depth and RGB data from the LibRealSense streams uint16_t *depth = ( uint16_t* ) rs_get_frame_data( camera, RS_STREAM_DEPTH, NULL ); struct rgb *rgb = ( struct rgb* ) rs_get_frame_data( camera, RS_STREAM_COLOR, NULL ); // Creating a new buffer to send to the gstreamer pipeline GstBuffer *buffer = gst_buffer_new_wrapped_full( ( GstMemoryFlags )0, rgb, SIZE, 0, SIZE, NULL, NULL ); // Create the merge of depth data onto the rgb data Basically, trying to create // our own depth image here by varying the red intensity for( int i = 0, end = WIDTH * HEIGHT; i < end; ++i ) { rgb[ i ].r = depth[ i ] * 60 / ONE_METER ; rgb[ i ].g /= 4; rgb[ i ].b /= 4; } // Push the buffer back out to GStreamer so it can send it down and out to wifi g_signal_emit_by_name( appsrc, "push-buffer", buffer, &ret ); }
My Workflow
I wanted to ensure I spoke to my workflow. I developed on my Ubuntu machine using Eclipse Neon. I compiled the app to ensure there were no compilation errors. I then transferred the files over to Aero using shell scripts and compiled the application on Aero and ran it for testing.
Some Initial Thoughts
Again, I want to mention, this document is not intended to teach GStreamer; rather, it is for highlight a real working sample application. This only touches the surface for how you can construct streams in GStreamer.
If you read my other article on streaming RGB data from Aero, you will notice that this example is a lot more complex and begins to show off more capabilities of GStreamer as well as making use of the Intel LibRealSense library.
As in the simple RGB example, GStreamer automatically connected to the Intel RealSense camera (R200) using v4l2src source GstElement; however, this example goes about supplying the pipeline in a completely different manor. This source code will show you how you can inject R200 RGB and depth data into the GStreamer pipeline by using the LibRealSense library.
Let’s dig in!
gint main(gint argc, gchar *argv[])
We start off by making a few constants that define the size of the window we are planning on targeting. Next is a structure that will be used for holding the RGB data coming back from the R200 and a few function declarations.
We start off ensuring that an IP address has been supplied as in input parameter. In a real-world sample, it may be desirable to parse the input string to ensure it’s in the form of a real IP address. For sake of simplicity, I’m not worrying about that here. This is the IP address of a client computer running QGroundControl. This client computer MUST be attached to the Aero’s Wi-Fi network for this to work.
Next, we declare some variables. The pipeline will be populated by all the GstElements needed to run our sample code. The GMainLoop is not a GStreamer construct; rather, it’s part of the Gnome project. It runs in its own thread and is used to keep the application alive and from falling through to the end of the code. I won’t go into detail here, I think the comments in the code explain what they are.
GStreamer must be initialized, that’s what gst_init is doing. It takes the same input parameters from main and initializes GStreamer. We create a GMainLoop that we start running later after all the other setup.
Now, we build our GStreamer pipeline command string. The gst_parse_launch command will parse out the GStreamer string. Behind the scenes, it analyses all the elements in the string and constructs the GstElements along with other aspects of GStreamer. After checking to see that there are no errors, we set the pipeline’s state to playing.
Remember… This code runs on Aero. You are sending this data FROM Aero to a client machine somewhere on the same network.
I want to bring attention to a couple of critical aspects of the GStreamer string:
appsrc name=mysource
This is telling the GStreamer pipeline how we are getting the data. If you read my other article; Intel® Aero Compute Board and Intel® RealSense ™ Technology for Wi-Fi Streaming of RGB Data, you will see that it was specified by v4l2src device=/dev/video13. We are not doing that this time. If you recall, we want to inject the stream by using the functionality of the LibRealSense library which can give us depth data. So essentially, at runtime we are specifying the source of the stream as being a runtime-generated appsrc element.udpsink host=%s port=5600
This is telling GStreamer to use UDP and send the video stream via Wi-Fi to a particular IP address and port. Remember, the IP address is coming in via command-line parameter. There’s no reason not to include the port number on the command line as well if you wish.
After the pipeline has been parsed and created, we move on to setting up the appsrc element. We didn’t manually construct this ourselves; rather, we let GStreamer construct it for us behind the scenes and we need to go get it. This is done by the call to gst_bin_get_by_name(…). Once we access the real structure we can start modifying it.
We need to add and/or specify the capabilities of the appsrc element. We are specifying the video format, stream type, width, and height of the capabilities we require. After we have created the GstCaps structure, we assign it to the appsrc element.
Each GstElement also has properties that can be set, this is done using the g_object_set(…). This function is not a part of GStreamer; rather it is part of the Gnome project, which GStreamer relies heavily on. So, let’s set the is-live and format properties.
Ok, now we are going to initialize the Intel RealSense camera (R200) and start the RGB and depth streams. I’ll talk about those in their respective functions later. Notice that I’m doing this BEFORE telling the pipeline to start playing. If I don’t, I will get unrespecting results. In my case, AeroStream would just crash out. At this point Intel RealSense is streaming so we can go ahead and start the pipeline.
You should notice something here. Pay attention to the order of things. You may wonder why I didn’t take care of all the appsrc setup BEFORE starting RealSense and the pipeline. The simple answer is… It crashes if I do. RealSensehad to be running BEFORE setting the pipeline to playing, and the pipeline had to be set to playing BEFORE setting the callbacks on appsrc. I’m not an expert on GStreamer and I don’t know how the guts of the pipelines are running behind the scenes. However, I do know that the order is important, and if they are not in this order, AeroStream would crash. Your mileage may vary!
A word about the callbacks. The appsrc element uses callbacks for event notifications. In this case, cb_need data is fired when the appsrc pipeline is empty and we need to pull more data from Intel RealSense. The GstAppSrcCallbacks structure contains three callback functions we can set. We are only setting the one. After setting the callback, we then must set the callbacks to the appsrc element by calling gst_app_src_set_callbacks.
Now, we need to get the loop running so we don’t fall out of the app. Here, the app will continue to run and while this loop is running, GStreamer is doing its processing of pulling data from the camera, processing the data, and sending out to Wi-Fi.
When the app ends, we do some simple clean up.
NOTE: I think it’s important to mention that while I work on my Ubuntu machine, and compile the source code, it’s still required to compile on Aero.
static void print_realsense_error( rs_error *err )
This is a convenience function to display some LibRealSense error messages.
static gboolean init_realsense( rs_device **camera )
This function initializes the Intel RealSense camera (R200). The library uses the concept of a context that we create. Check for errors. If an error appears, print the error out.
If we have a good context, we attempt to create the camera device by calling the rs_get_device function and again check for errors. If everything is good, the variable ret will still contain a value of true.
static gboolean init_streams( rs_device *camera )
This function accepts an rs_device, which I’m calling a camera…. Because, that’s what it really is. The first thing we are going to try and do is get the depth stream running by calling the rs_enable_stream, specifying depth type parameters. Again, a little error checking.
If we are still good to go, we attempt to generate an RGB stream using the same function, but this time specifying color stream type parameters and again, doing some error checking.
If we are STILL good to go, we start the streaming via rs_start_device and…you guessed it, check for an error. If everything is good, the variable ret will still contain a value of true.
static void cb_need_data ( GstAppSrc *appsrc, guint unused_size, gpointer user_data )
Ok… We are at the last function to discuss. This is an interesting function. It’s the meat and potatoes function.
The GstFlowReturn is actually an enum. It will contain a value that we could have done something with but didn’t in this case. It gets used by the g_signal_emit_by_name function.
We need to typecast the user_data into an rs_device so that we can use it to work with LibRealSense. The first thing that happens is to call rs_wait_for_frames; in other words, let LibRealSense process the current frame and wait.
After that is done, we get both the depth and RGB data. We pass the RGB data into gst_buffer_new_wrapped_full to generate a buffer large enough to hold the current image.
Next, we jump into a for loop which is going to iterate over every pixel of the image, attempting to generate a composite image. An image that has both RGB and depth data at the same time, attempts to use the Red channel to represent depth. Based on the depth for the given pixel, a calculation is made to alter the red channel’s value, which represents depth.
Once that is done, the newly created image is then pushed out of appsrc onto the application’s GStreamer pipeline to be processed and sent out via Wi-Fi.
Intel® Aero Compute Board Setup
At this point, if you’re following along in order, you have a project set up in your IDE. You’ve compiled the source code. The next step is to get your board connected.
The following images show you how to connect your board.
Now you can power up your board. Once the board is fully powered up, it will automatically start a Wi-Fi access point. The next step helps walk you through setting up connectivity on Ubuntu.
Connecting Wirelessly to the Intel Aero Compute Board
Once your board has been powered up, you can now connect to it via Wi-Fi. In Ubuntu 16.04, you will see a derivative of CR_AP-xxxxx. This is the network connection you will be connecting to.
The SSID is 1234567890
Troubleshooting
If you do not see this network connection and provided you have hooked up a keyboard and monitor to your Intel Aero Compute Board, on the Intel Aero Computer Board, run the following command:
sh-4.3# lspci
This shows you a list of PCI devices. Check for the following device:
01:00.0 Network controller: Intel Corporation Wireless 8260 (rev3a)
If you do not see this connection, do a “warm” boot:
sh-4.3# reboot
Wait for the Intel Aero Compute Board to reboot. You should now see the network controller if you run lspci a second time. Attempt once again to connect via the wireless network settings in Ubuntu.
At times, I have seen an error message in Ubuntu saying:
Out of range
If you get this error, try the following:
- Make sure there are no other active network connections; if there are, disconnect from them.
- Reboot Ubuntu.
More on the Intel Aero Compute Board Wi-Fi functionality can be found at the Intel Aero Meta Wiki.
Useful Bash Shell Scripts
Now that you’ve got the code compiled on Ubuntu, it’s time to move it over to the Intel Aero Compute Board. Remember that even though you might compile on your Ubuntu machine, you will still need to compile on the Intel Aero Compute Board as well. What I found was that if I skip this step, Yocto gives me an error saying that AeroStream is not a program.
To help expedite productivity, I’ve created a couple of small shell scripts. They aren’t necessary or required; I just got tired of typing the same things over and over.
migrateAero
First, it should be obvious that you must have a Wi-Fi connection to the Intel Aero Compute Board for this script to run .
This script runs from your Ubuntu machine. I keep it at the root of my Eclipse working folder. After I’ve made changes to the AeroStream project, I run this to migrate files over to the Intel Aero Compute Board. Technically, I don’t need to push the ‘makeAero’ script every time. But because I never know when I might change it, I always copy it over.
#!/bin/bash # clean up these files or files won’t get compiled on the Aero board. At least this is what I've found to be the case rm AeroStream/Debug/src/AeroStream* rm AeroStream/Debug/AeroStream # Now push the entire AeroStream Eclipse Neon project to the Aero board. This will create the folder /home/AeroStream on the Aero board. scp -r AeroStream root@192.168.1.1:/home # makeAero script essentially runs a make and executes AeroStream scp makeAero root@192.168.1.1:/home
makeAero
Runs on the Intel Aero Compute Board itself. It gets migrated with the project and ends up at the root of /home. All it’s doing is navigating into the debug directory and running the make file, and then launching AeroStream.
#!/bin/bash #Created a shortcut script because I'm tired of typing this in every time I need to migrate cd /home/AeroStream/Debug make ./AeroStream
Instead of pushing the entire project over, you might instead just create your own make file(s) and just push the source code, however, this approach worked for me.
Also, you don’t even need to create a project on Ubuntu using Eclipse. Instead, if you feel confident enough you can just develop right there on the board itself.
How to Configure QGroundControl
There is one last step to complete: configuring QGroundControl. Downloading and installing QGroundControl is out of the scope of this document. However, I need to show you how to set up QGroundControl to talk to receive the GStreamer stream from the Intel Aero Compute Board Wi-Fi.
Note that QGroundControl also uses GStreamer for its video streaming capabilities. This is how the connection is actually being created. GStreamer has the ability to send to Wi-Fi from one location, and then listen for a signal from another location via Wi-Fi. And this is how QGC is accomplishing this.
NOTE: Make sure you are using the SAME port that you have configured in your GStreamer pipeline.
Step 1
When you launch QGroundControl, it opens into flight path mode. You need to click the QGroundControl icon to get to the configuration area.
Step 2
Click the Comm Links button. This displays the Comm Links configuration page.
Click Add.
Step 3
This displays the Create New Link Configuration page.
- Give the configuration a name. Any name is OK.
- For Type, select UDP.
- Select the Listening Port number. This port number must match the port that is being used from the GStreamer pipeline.
- Click OK.
Step 4
You will now see the new Comm Link in QGroundControl.
Launch the App
NOTE: QGroundControl MUST be running first. It has to be put into listening mode. If you launch your streaming server application first, the connection will not be made. This is just an artifact of GStreamer.
- Launch QGroundControl.
- Launch AeroStream from the Intel Aero Compute Board. If everything has gone according to plan, you will see your video stream show up in QGroundControl.
Intel® Aero Compute Board and GitHub*
Visit the Intel Aero GitHub for various software code bases to keep your Aero up to date:
https://GitHub.com/intel-aero/meta-intel-aero/wiki
Other Resources
Visit the Intel Aero Compute Board GitHub for various software code bases to keep your Intel Aero Compute Board up to date.
http://www.intel.com/content/www/us/en/technology-innovation/aerial-technology-overview.html
Summary
This article helped get you up and running with streaming capabilities with the Intel Aero Compute Board. I gave you an overview of the board itself and showed you how to connect to it. I also showed you which libraries are needed, how I set up Eclipse for my own project, and how to get Wi-Fi up, transfer files, and set up QGroundControl. At this point you are ready to explore other capabilities of the board and streaming.
Winner of the Intel® Edison Developer Challenge
In partnership with HW Academy, Intel launched the Intel® Edison Developer Challenge (an online competition) in the UK earlier this summer. Contestants from around the world—ambitious developers looking to create the next innovative Intel® Edison IoT solution—submitted their entries for projects that combined the Intel® Edison board with sensors, cloud technology and a lot of imagination to build a new working IoT prototype. Online applications closed late September and the winning project was announced December 6, 2016.
And the winner is… UK-based developer, Numaan Chaudhry, was awarded £1000 for his winning IoT prototype, a Robust Remote Solar Monitoring solution.
Numaan’s winning project combined the Intel® Edison board, sensors, IBM Bluemix* cloud platform, and developer know-how to create a solar monitoring solution intended to monitor the health and usage of solar energy systems deployed in some of most rural areas of the world.
Solar power is an inexpensive way to deliver electricity to rural communities not served by an electric grid. However, the maintenance and repair of solar deployment sites in rural areas presents a challenge. Often, there may be no reliable communication (a dependable way to transmit and receive data) between a solar site and a central location (e.g., headquarters of an organization responsible for maintaining and serving those sites).
By developing a robust solution to transmit data from a solar deployment site to a central location, headquarters can promptly be notified of failures and enabled to appropriately determine the root cause of a failure (here, sensor data is sent to a cloud platform for conversion, analysis and pattern recognition).
To learn more and for complete steps on how to build the project, check out the Robust Remote Solar Monitoring Instructables* page here.
Issue Classification
The Issue Classification dashboard provides an in-depth way to find similar issues and possible duplicates by grouping similar items in a cluster relationship. Once grouped, issues in the clusters can be further investigated.
The bubble chart below displays clusters in a given time range. Items in a cluster range from somewhat similar to mutually similar; the degree of similarity is indicated by the color of the bubble.
The selected cluster above includes two issues, #smb-148 and #smb-152. Click an issue to reveal how it is related to other issues. As shown above, the dashboard returns the following information for issue #smb-148:
- the similar issue (smb-152)
- matching attributes (matches) or terms that are common to both smb-148 and smb-152
- a rank,which indicates how relevant smb-152 is to smb-148
- a score, which indicates how relevant smb-152 is to the cluster
Group Details shows supporting data that issues in a cluster share which makes them similar. As you click a different cluster, items in Group Details change.
Clusters in the example are based on a Similarity Score Cutoff of .33 which indicates a more general relationship among issues. Adjust the Similarity Score Cutoff to reveal isolated clusters. A lower cutoff score (0.1) will show relationships among all items on a general level which might not be meaningful.
All information can be saved to a CSV file by clicking Export CSV.
Overview of Intel Protected File System Library Using Software Guard Extensions
Intel® Protected File System Library
- A new feature called Intel Protection File System Library is introduced in the Intel SGX 1.7 Release. This Library is used to create, operate and delete files inside the enclave. To make use of the Intel Protected File system Library we need the following requirements:
- Visual Studio 2015,
- Intel SGX SDK version 1.7
- Intel SGX PSW version 1.7
- The above requirements are essential for implementing Intel SGX Protected File System. In this document we will discuss regarding the architecture, API’s, Implementation and Limitations of Intel Protected File System Library.
Overview of Intel® Protected File System Library:
- Intel® Protected File System Library provides protected files API for Intel® SGX enclaves. It supports a basic subset of the regular C file API and enables you to create files and work with them as you would normally do from a regular application.
- We have 15 file operation functions API’s provided by Intel SGX. These API work almost the same as the regular C file API.
- With this API, the files are encrypted and saved on the untrusted disk during a write operation, and they are verified for confidentiality and integrity during a read operation.
- To encrypt a file, you should provide a file encryption key. This key is a 128 bits key, and is used as a key derivation key, to generate multiple encryption keys.
- The key derivation key used as an input to one of the key derivation functions is called a key derivation key, can be generated by an approved cryptographic random bit generator, or by an approved automated key establishment process. Another option is to use automatic keys derived from the enclave sealing key.
- This way we can keep our files secure and safe inside the Enclave. Since our files are encrypted and stored they are safe and secure inside the enclave.
Intel Protected File System API:
The Intel Protected File System Library provides the following functionalities.
- sgx_fopen
- sgx_fopen_auto_key
- sgx_fclose
- sgx_fread
- sgx_fwrite
- sgx_fflush
- sgx_ftell
- sgx_fseek
- sgx_feof
- sgx_ferror
- sgx_clearerr
- sgx_remove
- sgx_fexport_auto_key
- sgx_fimport_auto_key
- sgx_fclear_cache
The above mentioned API’s are present in the SGX Protected FS trusted library. And these can be called only within the trusted enclave code which makes our files secure.
Limitation of Protected File System
- Protected Files have meta-data embedded in them, only one file handle can be opened for writing at a time, or many file handles for reading.
- Operating System protection mechanism is used for protecting against accidentally opening more than one ‘write’ file handle. If this protection is bypassed, the file will get corrupted.
- An open file handle can be used by many threads inside the same enclave, the APIs include internal locks for handling this and the operations will be executed by one.
Please find the detailed information in the PDF and also i have shared sample code for Intel Protected File System Library using SGX.
Installing Intel® Parallel Studio XE Runtime 2017 Using APT Repository
Intel® Parallel Studio XE Runtime includes everything you need to run applications built with Intel® Parallel Studio XE. The packages are available, through an APT package manager, free of charge, to users who already have applications enabled with Intel® Parallel Studio XE.
Setting up the Repository
Install the GPG key for the repository
Download the public key from http://apt.repos.intel.com/2017/GPG-PUB-KEY-INTEL-PSXE-RUNTIME-2017 and save it to a file. Then add this key to the system keyring:
sudo apt-key add <PATH_TO_DOWNLOADED_GPG_KEY>
Add APT Repository
cd /etc/apt/sources.list.d
sudo vi intel-psxe-runtime-2017.list
Append the following code:
deb http://apt.repos.intel.com/2017 intel-psxe-runtime main
deb [arch=all] http://apt.repos.intel.com/2017 intel-psxe-runtime main
Update:
sudo apt-get update
Intel® Parallel Studio XE Runtime versions available in the repository
|
|
|
|
|
|
For additional versions see: Intel® Parallel Studio XE Runtime by Version
Installing the runtime packages using the APT Package Manager
The following variables are used in the installation commands:
<VERSION>: 2017, 2018, ...
<UPDATE>: 0, 1, 2, ...
<COMPONENT>: a component name from the list of available components below
Component | <COMPONENT> |
All components intel | intel-psxe-runtime |
C/C++ Compiler intel | intel-icc-runtime |
Fortran Compiler intel | intel-ifort-runtime |
IPP intel | intel-ipp-runtime |
MKL intel | intel-mkl-runtime |
DAAL intel | intel-daal-runtime |
TBB intel | intel-tbb-runtime |
MPI intel | intel-mpi-runtime |
How do I install the latest version?
- Installation of full runtime package
sudo apt-get install intel-psxe-runtime
- Installation of particular component from a full runtime package
sudo apt-get install <COMPONENT>
Example:
sudo apt-get install intel-ipp-runtime
How do I install a particular version?
- To install a particular version of a full runtime package
sudo aptitude install intel-psxe-runtime=<VERSION>.<UPDATE>-<BUILD_NUM>
Example:
sudo aptitude install intel-psxe-runtime=2017.1-132
- To install a particular version of a particular component
- Get the list of all available component versions
apt-cache madison <COMPONENT>
Example:
apt-cache madison intel-ipp-runtime
- Use the desired component version in the command
sudo aptitude install <COMPONENT>=<VERSION>.<UPDATE>-<BUILD_NUM>
Example:
sudo aptitude install intel-ipp-runtime=2017.1-132
How do I upgrade the installed packages?
- Upgrade a full runtime package to the latest version
sudo apt-get install intel-psxe-runtime
- Upgrade a particular component to the latest version
sudo apt-get install <COMPONENT>
Example:
sudo apt-get install intel-ipp-runtime
How do I uninstall a particular version?
- Uninstall a full runtime package
sudo apt-get purge --auto-remove intel-psxe-runtime
- Uninstall a particular component
sudo apt-get purge --auto-remove <COMPONENT>
Example:
sudo apt-get purge --auto-remove intel-ipp-runtime
How do I install a 32-bit runtime rpm on a 64-bit system?
- Install a 32-bit package of a full runtime
- Add i386 architecture
sudo dpkg --add-architecture i386
- Update available packages list
sudo apt-get update
- Install the 32-bit runtime package
sudo apt-get install intel-psxe-runtime:i386
- Add i386 architecture
- Install 32-bit version of particular component:
sudo dpkg --add-architecture i386 sudo apt-get update sudo apt-get install <COMPONENT>:i386
Example:
sudo apt-get install intel-ipp-runtime:i386
By downloading Intel® Parallel Studio XE Runtime you agree to the terms and conditions stated in the End-User License Agreement (EULA).
Intel® Parallel Studio XE Runtime location on a local system
/opt/intel/psxe_runtime_<VERSION>.<UPDATE>.<BUILD_NUM>
Links to the last installed runtime package:
/opt/intel/psxe_runtime_<VERSION> /opt/intel/psxe_runtime
Have Questions?
Check out the FAQ
Or ask in our User Forums