Smarter Security Camera: a POC using the Intel® IoT Gateway

September 26, 2016, 10:01 am

Latest and popular articles on Intel Technologies

≪ Previous: Intel® IPP 2017 Bug Fixes

Intro

Internet of Things is enabling our lives in new and interesting ways, but with it come the challenges of how to analyze and bring meaning to all the continuously generated data. One IoT trend in the home is the rise of security cameras, and not just one or two, but multiple ones for around the house and in each room to monitor the status. This creates massive amounts of data when images or movie files are being saved. Taking one house as an example, they have 12 cameras that generate around 5 GB per day taking over 180,000 images total. That is a massive amount of data to look through manually. Some cameras have built in motion sensors to only take images when a change is detected, and while this helps to reduce the noise, light changes, pets, fans, and things moving in the wind will still be picked up and have to be sorted through. To monitor for what is wanted OpenCV presents a promising solution, for the purposes of this paper it will be people and faces. OpenCV already has a number of pre-defined algorithms to search images for faces, people, and objects and can also be trained to recognize new ones.

This article is a proof of concept to explore quickly prototyping an analytics solution at the edge using the Intel IoT Gateway computing power to create a Smarter Security Camera.

Analyzed image from webcam with OpenCV detection markers

Figure 1: Analyzed image from webcam with OpenCV detection markers

Set-up

It all starts with a Logitech C270 Webcam with HD 720P resolution and 2.4 GHz Intel Core 2 Duo. This webcam plugs into the USB port of the Intel Edison which turns it into an IP webcam streaming the video a website. Using the webcam with the Intel Edison allows for easy duplication of the camera “sensor” to be propagated to different locations around a home. The Intel IoT Gateway then captures images from the stream and uses OpenCV to analyze them. If the algorithms detects that there is a face or a person in view, it uploads the image to Twitter*.

Intel Edison and Webcam setup

Figure 2: Intel Edison and Webcam setup

Intel Gateway Device

Figure 3: Intel® IoT Gateway

Capturing the image

The webcam must be UVC compliant to ensure that it is compatible with the Intel Edison’s USB drivers, in this case the Logitech C270 Webcam is used. For a list of UVC compliant devices see this webpage here: http://www.ideasonboard.org/uvc/#devices. To use the USB slot the Intel Edison’s micro switch must be toggled up towards the USB slot, note that this will disable the micro-USB next to it and hence disable Ethernet, power (the external power supply must be plugged now instead of using the micro-USB slot as a power source), and Arduino sketch uploads. Also connect the Edison to the Gateway’s wifi hotspot to ensure it can see the webcam.

To ensure the USB webcam is working, type the following into a serial connection.

ls -l /dev/video0

A line similar to this one should appear:

crw-rw---- 1 root video 81, 0 May  6 22:36 /dev/video0

Otherwise, this line will appear indicated the camera is not found.

ls: cannot access /dev/video0: No such file or directory

In the early stages of the project, the Intel Edison was using the FFMEG library to capture an image and then send it over MQTT to the gateway. This method had some draw backs as each image took a few seconds just to be saved which was way too slow for practical application. To combat this and make images ready to the gateway on-demand, the setup switched to have the Intel Edison continuously streaming a feed that the gateway could capture from at any time. This was done using the mjpeg-streamer library, to install it on the Intel Edison.

Add the following lines to base-feeds.conf:

echo "src/gz all http://repo.opkg.net/edison/repo/all
src/gz edison http://repo.opkg.net/edison/repo/edison
src/gz core2-32 http://repo.opkg.net/edison/repo/core2-32">> /etc/opkg/base-feeds.conf

Update the repository index:

opkg update

And install:

opkg install mjpg-streamer

To the start the stream:

mjpg_streamer -i "input_uvc.so -n -f 30 -r 800x600" -o "output_http.so -p 8080 -w ./www"

It was decided to use the MJEG compressed format to keep the frame rate high. However YUV format is uncompressed which leaves more detail for OpenCV, so experiment with the tradeoffs.

To view the stream while on the same Wi-Fi network visit: http://localhost:8080/?action=stream, a still image of the feed can also be viewed by going to: http://localhost:8080/?action=snapshot. Where localhost is the IP address of the Intel Edison connected to the Gateway’s wifi. On the Intel Gateway side, it sends an http request to the snapshot and then saves the image to disk.

Gateway

The brains of the whole security camera is on the gateway. OpenCV was installed into a virtual python environment to create a clean and segmented environment for OpenCV and not interfere with the system Python and packages. Basic install instructions for OpenCV linux can be found here: http://docs.opencv.org/2.4/doc/tutorials/introduction/linux_install/linux_install.html. These instructions need to be modified in order to install OpenCV and its dependencies on the Intel Wind River Gateway.

GCC, Git, and python2.7-dev are already installed.

Install CMake 2.6 or higher:

wget http://www.cmake.org/files/v3.2/cmake-3.2.2.tar.gz
tar xf cmake-3.2.2.tar.gz
cd cmake-3.2.2
./configure
make
make install

As the Wind River Linux environment has no apt-get command, it can quickly become a challenge to install the needed development packages. An easy way around this is to first install them on other 64 bit Linux machine (running Ubuntu in this case) and then manually copy the files to the gateway. Full file list can be found on the Ubuntu site here: http://packages.ubuntu.com/. For example, for the libtiff4-dev package, files in /usr/include/<file> should go to the same location on the gateway and files in /usr/lib/x86_64-linux-gnu/<file> should got into /usr/lib/<file>. The full list of files can be found here: http://packages.ubuntu.com/precise/amd64/libtiff4-dev/filelist. Install and copy the files over for packages listed below.

sudo apt-get install  libgtk2.0-dev pkg-config libavcodec-dev libavformat-dev libswscale-dev
sudo apt-get install libjpeg8-dev libpng12-dev libtiff4-dev libjasper-dev  libv4l-dev

Install pip, this will help install a number of other dependencies.

wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py

Install the virutalenv, this will create a separate environment for OpenCV.

pip install virtualenv virtualenvwrapper

Once the virtualenv has been installed, create one called “cv.”

export WORKON_HOME=$HOME/.virtualenvs
mkvirtualenv cv

Note that all the following steps are done while the “cv” environment is activated. Once “cv” has been created, it will activate the environment automatically in the current session. This can be seen in the command prompt at the beginning eg. (cv) root@WR-IDP-NAME. For future sessions it can be activated with the following command:

. ~/.virtualenvs/cv/bin/activate

And similarly be deactivated (do not deactivated it yet):

deactivate

Install numpy:

pip install numpy

Get the OpenCV Source Code:

cd ~
git clone https://github.com/Itseez/opencv.git
cd opencv
git checkout 3.0.0

And make it:

mkdir build
cd build
cmake -D CMAKE_BUILD_TYPE=RELEASE \
-D CMAKE_INSTALL_PREFIX=/usr/local \
-D INSTALL_C_EXAMPLES=ON \
-D INSTALL_PYTHON_EXAMPLES=ON \
-D OPENCV_EXTRA_MODULES_PATH=~/opencv_contrib/modules \
-D BUILD_EXAMPLES=ON \
-D PYTHON_INCLUDE_DIR=/usr/include/python2.7/ \
-D PYTHON_INCLUDE_DIR2=/usr/include/python2.7 \
-D PYTHON_LIBRARY=/usr/lib64/libpython2.7.so \
-D PYTHON_PACKAGES_PATH=/usr/lib64/python2.7/site-packages/ \
-D BUILD_NEW_PYTHON_SUPPORT=ON \
-D PYTHON2_LIBRARY=/usr/lib64/libpython2.7.so \
-D BUILD_opencv_python3=OFF \
-D BUILD_opencv_python2=ON ..

It may be the case that the cv2.so file is not created. If this is the case, make OpenCV on the host Linux machine as well and copy the file over to /usr/lib64/python2.7/site-packages.

Webcam capture of people outside with OpenCV detection markers

Figure 4: Webcam capture of people outside with OpenCV detection markers

To quickly create a program and connect a large number of capabilities and services together as like with this project, Node-RED* was used. Node-RED is a quick prototyping tool that allows the user to visually wire together hardware devices, APIs, and various services. It also comes pre-installed on the gateway, just make sure to update to the latest version.

Node-Red Flow

Figure 5: Node-RED Flow

Once a message is injected in at the “Start” node, the script will loop continuously after processing the image or encountering an error. A few nodes of note for the setup are the http request, the python script, and the function message for the tweet. The “Repeat” node is to visually simplify the repeat flow into one node instead of pointing all three flows back to the beginning.

The “http request” node sends a GET message to the Intel Edison IP webcam’s snapshot URL. If it is successful the flow saves the image, otherwise it tweets an error message about the webcam.

Node-Red http GET request node details

Figure 6: Node-RED http GET request node details

To run the python script, create an “exec” node (it will be in the advanced section in Node-RED) with the command “/root/.virtualenvs/cv/bin/python2.7 /root/PeopleDetection.py”. This allows the script to run in the virtual python environment where OpenCV is installed.

Node-Red exec node details

Figure 7: Node-RED exec node details

The python script itself is fairly simple, it checks the image for people using the HOG algorithm and then looks for faces using the haarcasade frontal face alt algorithm that comes installed with openCV. It also saves out an image with boxes draw around found people and faces. The code below is by no means optimized for our proof of concept beyond tweaking some of the algorithm inputs to suit our purposes. There is the option of scaling the image down before analyzing it to reduce the time, see the code snippet below for how to do that. It takes the Gateway roughly 0.33 seconds to process an image. For comparison, an Intel Edison takes around 10 seconds to process the same image. Depending on where the camera is located and how far or close people are expected to be to it, the OpenCV algorithm parameters may need to change to better fit the situation.

import numpy as np
import cv2
import sys
import datetime

def draw_detections(img, rects, rects2, thickness = 2):
  for x, y, w, h in rects:
    pad_w, pad_h = int(0.15*w), int(0.05*h)
    cv2.rectangle(img, (x+pad_w, y+pad_h), (x+w-pad_w, y+h-pad_h), (0, 255, 0), thickness)
    print("Person Detected")
  for (x,y,w,h) in rects2:
    cv2.rectangle(img,(x,y),(x+w,y+h),(255,0,0),thickness)
    print("Face Detected")

total = datetime.datetime.now()

img = cv2.imread('/root/incoming.jpg')
#optional resize of image to make processing faster
#img = cv2.resize(img, (0,0), fx=0.5, fy=0.5)

hog = cv2.HOGDescriptor()
hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())
peopleFound,a=hog.detectMultiScale(img, winStride=(8,8), padding=(16,16), scale=1.3)

faceCascade = cv2.CascadeClassifier('/root/haarcascade_frontalface_alt.xml')
facesFound = faceCascade.detectMultiScale(img,scaleFactor=1.1,minNeighbors=5,minSize=(30,30), flags = cv2.CASCADE_SCALE_IMAGE)

draw_detections(img,peopleFound,facesFound)

cv2.imwrite('/root/out_faceandpeople.jpg',img)

print("[INFO] total took: {}s".format(
 (datetime.datetime.now() - total).total_seconds()))

To send an image to Twitter, the tweet is constructed in a function node using the msg.media as the image variable and the msg.payload as the tweet string.

Node-Red function message node details

Figure 8: Node-RED function message node details

And of course, the system needs to be able to take pictures on demand as well. Node-RED monitors the same twitter feed for posts that contain “spy” or “Spy” and will post a current picture to Twitter. So posting a tweet with the word “spy” in it will trigger the Gateway to take a picture.

Node-Red flow for taking pictures on demand

Figure 8: Node-RED flow for taking pictures on demand

Summary

This concludes the proof of concept computing at the edge smarter security camera gateway. The Wind River Linux Gateway comes with a number of tools pre-installed and ready to prototype quickly. From here the project can be further optimized, made more robust with security features, and even expanded to create smart lighting for rooms when a person is detected.

About the author

Whitney Foster is a software engineer at Intel in the Software Solutions Group working on scale enabling projects for Internet of Things.

Notices

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

Intel, the Intel logo, and Intel RealSense are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others

↧

Intel® XDK FAQs - General

September 26, 2016, 12:53 pm

Latest and popular articles on Intel Technologies

≫ Next: Jumbo Frames in Open vSwitch* with DPDK

≪ Previous: Smarter Security Camera: a POC using the Intel® IoT Gateway

How can I get started with Intel XDK?

There are plenty of videos and articles that you can go through here to get started. You could also start with some of our demo apps. It may also help to read Five Useful Tips on Getting Started Building Cordova Mobile Apps with the Intel XDK, which will help you understand some of the differences between developing for a traditional server-based environment and developing for the Intel XDK hybrid Cordova app environment.

Having prior understanding of how to program using HTML, CSS and JavaScript* is crucial to using the Intel XDK. The Intel XDK is primarily a tool for visualizing, debugging and building an app package for distribution.

You can do the following to access our demo apps:

Select Project tab
Select "Start a New Project"
Select "Samples and Demos"
Create a new project from a demo

If you have specific questions following that, please post it to our forums.

How do I convert my web app or web site into a mobile app?

The Intel XDK creates Cordova mobile apps (aka PhoneGap apps). Cordova web apps are driven by HTML5 code (HTML, CSS and JavaScript). There is no web server in the mobile device to "serve" the HTML pages in your Cordova web app, the main program resources required by your Cordova web app are file-based, meaning all of your web app resources are located within the mobile app package and reside on the mobile device. Your app may also require resources from a server. In that case, you will need to connect with that server using AJAX or similar techniques, usually via a collection of RESTful APIs provided by that server. However, your app is not integrated into that server, the two entities are independent and separate.

Many web developers believe they should be able to include PHP or Java code or other "server-based" code as an integral part of their Cordova app, just as they do in a "dynamic web app." This technique does not work in a Cordova web app, because your app does not reside on a server, there is no "backend"; your Cordova web app is a "front-end" HTML5 web app that runs independent of any servers. See the following articles for more information on how to move from writing "multi-page dynamic web apps" to "single-page Cordova web apps":

Can I use an external editor for development in Intel® XDK?

Yes, you can open your files and edit them in your favorite editor. However, note that you must use Brackets* to use the "Live Layout Editing" feature. Also, if you are using App Designer (the UI layout tool in Intel XDK) it will make many automatic changes to your index.html file, so it is best not to edit that file externally at the same time you have App Designer open.

Some popular editors among our users include:

Sublime Text* (Refer to this article for information on the Intel XDK plugin for Sublime Text*)
Notepad++* for a lighweight editor
Jetbrains* editors (Webstorm*)
Vim* the editor

How do I get code refactoring capability in Brackets* (the Intel XDK code editor)?

...to be written...

Why doesn’t my app show up in Google* play for tablets?

...to be written...

What is the global-settings.xdk file and how do I locate it?

global-settings.xdk contains information about all your projects in the Intel XDK, along with many of the settings related to panels under each tab (Emulate, Debug etc). For example, you can set the emulator to auto-refresh or no-auto-refresh. Modify this file at your own risk and always keep a backup of the original!

You can locate global-settings.xdk here:

Mac OS X*
~/Library/Application Support/XDK/global-settings.xdk
Microsoft Windows*
%LocalAppData%\XDK
Linux*
~/.config/XDK/global-settings.xdk

If you are having trouble locating this file, you can search for it on your system using something like the following:

Windows:
> cd /
> dir /s global-settings.xdk
Mac and Linux:
$ sudo find / -name global-settings.xdk

When do I use the intelxdk.js, xhr.js and cordova.js libraries?

The intelxdk.js and xhr.js libraries were only required for use with the Intel XDK legacy build tiles (which have been retired). The cordova.js library is needed for all Cordova builds. When building with the Cordova tiles, any references to intelxdk.js and xhr.js libraries in your index.html file are ignored.

How do I get my Android (and Crosswalk) keystore file?

New with release 3088 of the Intel XDK, you may now download your build certificates (aka keystore) using the new certificate manager that is built into the Intel XDK. Please read the initial paragraphs of Managing Certificates for your Intel XDK Account and the section titled "Convert a Legacy Android Certificate" in that document, for details regarding how to do this.

It may also help to review this short, quick overview video (there is no audio) that shows how you convert your existing "legacy" certificates to the "new" format that allows you to directly manage your certificates using the certificate management tool that is built into the Intel XDK. This conversion process is done only once.

If the above fails, please send an email to html5tools@intel.com requesting help. It is important that you send that email from the email address associated with your Intel XDK account.

How do I rename my project that is a duplicate of an existing project?

See this FAQ: How do I make a copy of an existing Intel XDK project?

How do I recover when the Intel XDK hangs or won't start?

If you are running Intel XDK on Windows* it must be Windows* 7 or higher. It will not run reliably on earlier versions.

Delete the "project-name.xdk" file from the project directory that Intel XDK is trying to open when it starts (it will try to open the project that was open during your last session), then try starting Intel XDK. You will have to "import" your project into Intel XDK again. Importing merely creates the "project-name.xdk" file in your project directory and adds that project to the "global-settings.xdk" file.

Rename the project directory Intel XDK is trying to open when it starts. Create a new project based on one of the demo apps. Test Intel XDK using that demo app. If everything works, restart Intel XDK and try it again. If it still works, rename your problem project folder back to its original name and open Intel XDK again (it should now open the sample project you previously opened). You may have to re-select your problem project (Intel XDK should have forgotten that project during the previous session).

Clear Intel XDK's program cache directories and files.

On a Windows machine this can be done using the following on a standard command prompt (administrator is not required):

> cd %AppData%\..\Local\XDK
> del *.* /s/q

To locate the "XDK cache" directory on [OS X*] and [Linux*] systems, do the following:

$ sudo find / -name global-settings.xdk
$ cd <dir found above>
$ sudo rm -rf *

You might want to save a copy of the "global-settings.xdk" file before you delete that cache directory and copy it back before you restart Intel XDK. Doing so will save you the effort of rebuilding your list of projects. Please refer to this question for information on how to locate the global-settings.xdk file.

If you save the "global-settings.xdk" file and restored it in the step above and you're still having hang troubles, try deleting the directories and files above, along with the "global-settings.xdk" file and try it again.

Do not store your project directories on a network share (Intel XDK currently has issues with network shares that have not yet been resolved). This includes folders shared between a Virtual machine (VM) guest and its host machine (for example, if you are running Windows* in a VM running on a Mac* host). This network share issue is a known issue with a fix request in place.

There have also been issues with running behind a corporate network proxy or firewall. To check them try running Intel XDK from your home network where, presumably, you have a simple NAT router and no proxy or firewall. If things work correctly there then your corporate firewall or proxy may be the source of the problem.

Issues with Intel XDK account logins can also cause Intel XDK to hang. To confirm that your login is working correctly, go to the Intel XDK App Center and confirm that you can login with your Intel XDK account. While you are there you might also try deleting the offending project(s) from the App Center.

If you can reliably reproduce the problem, please send us a copy of the "xdk.log" file that is stored in the same directory as the "global-settings.xdk" file to html5tools@intel.com.

Is Intel XDK an open source project? How can I contribute to the Intel XDK community?

No, It is not an open source project. However, it utilizes many open source components that are then assembled into Intel XDK. While you cannot contribute directly to the Intel XDK integration effort, you can contribute to the many open source components that make up Intel XDK.

The following open source components are the major elements that are being used by Intel XDK:

Node-Webkit
Chromium
Ripple* emulator
Brackets* editor
Weinre* remote debugger
Crosswalk*
Cordova*
App Framework*

How do I configure Intel XDK to use 9 patch png for Android* apps splash screen?

Intel XDK does support the use of 9 patch png for Android* apps splash screen. You can read up more at https://software.intel.com/en-us/xdk/articles/android-splash-screens-using-nine-patch-png on how to create a 9 patch png image and link to an Intel XDK sample using 9 patch png images.

How do I stop AVG from popping up the "General Behavioral Detection" window when Intel XDK is launched?

You can try adding nw.exe as the app that needs an exception in AVG.

What do I specify for "App ID" in Intel XDK under Build Settings?

Your app ID uniquely identifies your app. For example, it can be used to identify your app within Apple’s application services allowing you to use things like in-app purchasing and push notifications.

Here are some useful articles on how to create an App ID:

Is it possible to modify the Android Manifest or iOS plist file with the Intel XDK?

You cannot modify the AndroidManifest.xml file directly with our build system, as it only exists in the cloud. However, you may do so by creating a dummy plugin that only contains a plugin.xml file containing directives that can be used to add lines to the AndroidManifest.xml file during the build process. In essence, you add lines to the AndroidManifest.xml file via a local plugin.xml file. Here is an example of a plugin that does just that:

<?xml version="1.0" encoding="UTF-8"?><plugin xmlns="http://apache.org/cordova/ns/plugins/1.0" id="my-custom-intents-plugin" version="1.0.0"><name>My Custom Intents Plugin</name><description>Add Intents to the AndroidManifest.xml</description><license>MIT</license><engines><engine name="cordova" version=">=3.0.0" /></engines><!-- android --><platform name="android"><config-file target="AndroidManifest.xml" parent="/manifest/application"><activity android:configChanges="orientation|keyboardHidden|keyboard|screenSize|locale" android:label="@string/app_name" android:launchMode="singleTop" android:name="testa" android:theme="@android:style/Theme.Black.NoTitleBar"><intent-filter><action android:name="android.intent.action.SEND" /><category android:name="android.intent.category.DEFAULT" /><data android:mimeType="*/*" /></intent-filter></activity></config-file></platform></plugin>

You can inspect the AndroidManifest.xml created in an APK, using apktool with the following command line:

$ apktool d my-app.apk
$ cd my-app
$ more AndroidManifest.xml

This technique exploits the config-file element that is described in the Cordova Plugin Specification docs and can also be used to add lines to iOS plist files. See the Cordova plugin documentation link for additional details.

Here is an example of such a plugin for modifying the iOS plist file, specifically for adding a BIS key to the plist file:

<?xml version="1.0" encoding="UTF-8"?><plugin
    xmlns="http://apache.org/cordova/ns/plugins/1.0"
    id="my-custom-bis-plugin"
    version="0.0.2"><name>My Custom BIS Plugin</name><description>Add BIS info to iOS plist file.</description><license>BSD-3</license><preference name="BIS_KEY" /><engines><engine name="cordova" version=">=3.0.0" /></engines><!-- ios --><platform name="ios"><config-file target="*-Info.plist" parent="CFBundleURLTypes"><array><dict><key>ITSAppUsesNonExemptEncryption</key><true/><key>ITSEncryptionExportComplianceCode</key><string>$BIS_KEY</string></dict></array></config-file></platform></plugin>

How can I share my Intel XDK app build?

You can send a link to your project via an email invite from your project settings page. However, a login to your account is required to access the file behind the link. Alternatively, you can download the build from the build page, onto your workstation, and push that built image to some location from which you can send a link to that image.

Why does my iOS build fail when I am able to test it successfully on a device and the emulator?

Common reasons include:

Your App ID specified in the project settings do not match the one you specified in Apple's developer portal.
The provisioning profile does not match the cert you uploaded. Double check with Apple's developer site that you are using the correct and current distribution cert and that the provisioning profile is still active. Download the provisioning profile again and add it to your project to confirm.
In Project Build Settings, your App Name is invalid. It should be modified to include only alpha, space and numbers.

How do I add multiple domains in Domain Access?

Here is the primary doc source for that feature.

If you need to insert multiple domain references, then you will need to add the extra references in the intelxdk.config.additions.xml file. This StackOverflow entry provides a basic idea and you can see the intelxdk.config.*.xml files that are automatically generated with each build for the <access origin="xxx" /> line that is generated based on what you provide in the "Domain Access" field of the "Build Settings" panel on the Project Tab.

How do I build more than one app using the same Apple developer account?

On Apple developer, create a distribution certificate using the "iOS* Certificate Signing Request" key downloaded from Intel XDK Build tab only for the first app. For subsequent apps, reuse the same certificate and import this certificate into the Build tab like you usually would.

How do I include search and spotlight icons as part of my app?

Please refer to this article in the Intel XDK documentation. Create anintelxdk.config.additions.xml file in your top level directory (same location as the otherintelxdk.*.config.xml files) and add the following lines for supporting icons in Settings and other areas in iOS*.

<!-- Spotlight Icon --><icon platform="ios" src="res/ios/icon-40.png" width="40" height="40" /><icon platform="ios" src="res/ios/icon-40@2x.png" width="80" height="80" /><icon platform="ios" src="res/ios/icon-40@3x.png" width="120" height="120" /><!-- iPhone Spotlight and Settings Icon --><icon platform="ios" src="res/ios/icon-small.png" width="29" height="29" /><icon platform="ios" src="res/ios/icon-small@2x.png" width="58" height="58" /><icon platform="ios" src="res/ios/icon-small@3x.png" width="87" height="87" /><!-- iPad Spotlight and Settings Icon --><icon platform="ios" src="res/ios/icon-50.png" width="50" height="50" /><icon platform="ios" src="res/ios/icon-50@2x.png" width="100" height="100" />

For more information related to these configurations, visit http://cordova.apache.org/docs/en/3.5.0/config_ref_images.md.html#Icons%20and%20Splash%20Screens.

For accurate information related to iOS icon sizes, visit https://developer.apple.com/library/ios/documentation/UserExperience/Conceptual/MobileHIG/IconMatrix.html

NOTE: The iPhone 6 icons will only be available if iOS* 7 or 8 is the target.

Cordova iOS* 8 support JIRA tracker: https://issues.apache.org/jira/browse/CB-7043

Does Intel XDK support Modbus TCP communication?

No, since Modbus is a specialized protocol, you need to write either some JavaScript* or native code (in the form of a plugin) to handle the Modbus transactions and protocol.

How do I sign an Android* app using an existing keystore?

New with release 3088 of the Intel XDK, you may now import your existing keystore into Intel XDK using the new certificate manager that is built into the Intel XDK. Please read the initial paragraphs of Managing Certificates for your Intel XDK Account and the section titled "Import an Android Certificate Keystore" in that document, for details regarding how to do this.

If the above fails, please send an email to html5tools@intel.com requesting help. It is important that you send that email from the email address associated with your Intel XDK account.

How do I build separately for different Android* versions?

Under the Projects Panel, you can select the Target Android* version under the Build Settings collapsible panel. You can change this value and build your application multiple times to create numerous versions of your application that are targeted for multiple versions of Android*.

How do I display the 'Build App Now' button if my display language is not English?

If your display language is not English and the 'Build App Now' button is proving to be troublesome, you may change your display language to English which can be downloaded by a Windows* update. Once you have installed the English language, proceed to Control Panel > Clock, Language and Region > Region and Language > Change Display Language.

How do I update my Intel XDK version?

When an Intel XDK update is available, an Update Version dialog box lets you download the update. After the download completes, a similar dialog lets you install it. If you did not download or install an update when prompted (or on older versions), click the package icon next to the orange (?) icon in the upper-right to download or install the update. The installation removes the previous Intel XDK version.

How do I import my existing HTML5 app into the Intel XDK?

If your project contains an Intel XDK project file (<project-name>.xdk) you should use the "Open an Intel XDK Project" option located at the bottom of the Projects List on the Projects tab (lower left of the screen, round green "eject" icon, on the Projects tab). This would be the case if you copied an existing Intel XDK project from another system or used a tool that exported a complete Intel XDK project.

If your project does not contain an Intel XDK project file (<project-name>.xdk) you must "import" your code into a new Intel XDK project. To import your project, use the "Start a New Project" option located at the bottom of the Projects List on the Projects tab (lower left of the screen, round blue "plus" icon, on theProjects tab). This will open the "Samples, Demos and Templates" page, which includes an option to "Import Your HTML5 Code Base." Point to the root directory of your project. The Intel XDK will attempt to locate a file named index.html in your project and will set the "Source Directory" on the Projects tab to point to the directory that contains this file.

If your imported project did not contain an index.html file, your project may be unstable. In that case, it is best to delete the imported project from the Intel XDK Projects tab ("x" icon in the upper right corner of the screen), rename your "root" or "main" html file to index.html and import the project again. Several components in the Intel XDK depend on this assumption that the main HTML file in your project is named index.hmtl. See Introducing Intel® XDK Development Tools for more details.

It is highly recommended that your "source directory" be located as a sub-directory inside your "project directory." This insures that non-source files are not included as part of your build package when building your application. If the "source directory" and "project directory" are the same it results in longer upload times to the build server and unnecessarily large application executable files returned by the build system. See the following images for the recommended project file layout.

I am unable to login to App Preview with my Intel XDK password.

On some devices you may have trouble entering your Intel XDK login password directly on the device in the App Preview login screen. In particular, sometimes you may have trouble with the first one or two letters getting lost when entering your password.

Try the following if you are having such difficulties:

Reset your password, using the Intel XDK, to something short and simple.
Confirm that this new short and simple password works with the XDK (logout and login to the Intel XDK).
Confirm that this new password works with the Intel Developer Zone login.
Make sure you have the most recent version of Intel App Preview installed on your devices. Go to the store on each device to confirm you have the most recent copy of App Preview installed.
Try logging into Intel App Preview on each device with this short and simple password. Check the "show password" box so you can see your password as you type it.

If the above works, it confirms that you can log into your Intel XDK account from App Preview (because App Preview and the Intel XDK go to the same place to authenticate your login). When the above works, you can go back to the Intel XDK and reset your password to something else, if you do not like the short and simple password you used for the test.

If you are having trouble logging into any pages on the Intel web site (including the Intel XDK forum), please see the Intel Sign In FAQ for suggestions and contact info. That login system is the backend for the Intel XDK login screen.

How do I completely uninstall the Intel XDK from my system?

Take the following steps to completely uninstall the XDK from your Windows system:

From the Windows Control Panel, remove the Intel XDK, using the Windows uninstall tool.
Then:
> cd %LocalAppData%\Intel\XDK
> del *.* /s/q
Then:
> cd %LocalAppData%\XDK
> copy global-settings.xdk %UserProfile%
> del *.* /s/q
> copy %UserProfile%\global-settings.xdk .
Then:
-- Goto xdk.intel.com and select the download link.
-- Download and install the new XDK.

To do the same on a Linux or Mac system:

On a Linux machine, run the uninstall script, typically /opt/intel/XDK/uninstall.sh.
Remove the directory into which the Intel XDK was installed.
-- Typically /opt/intel or your home (~) directory on a Linux machine.
-- Typically in the /Applications/Intel XDK.app directory on a Mac.
Then:
$ find ~ -name global-settings.xdk $ cd <result-from-above> (for example ~/Library/Application Support/XDK/ on a Mac) $ cp global-settings.xdk ~ $ rm -Rf * $ mv ~/global-settings.xdk .
Then:
-- Goto xdk.intel.com and select the download link.
-- Download and install the new XDK.

Is there a tool that can help me highlight syntax issues in Intel XDK?

Yes, you can use the various linting tools that can be added to the Brackets editor to review any syntax issues in your HTML, CSS and JS files. Go to the "File > Extension Manager..." menu item and add the following extensions: JSHint, CSSLint, HTMLHint, XLint for Intel XDK. Then, review your source files by monitoring the small yellow triangle at the bottom of the edit window (a green check mark indicates no issues).

How do I delete built apps and test apps from the Intel XDK build servers?

You can manage them by logging into: https://appcenter.html5tools-software.intel.com/csd/controlpanel.aspx. This functionality will eventually be available within Intel XDK after which access to app center will be removed.

I need help with the App Security API plugin; where do I find it?

Visit the primary documentation book for the App Security API and see this forum post for some additional details.

When I install my app or use the Debug tab Avast antivirus flags a possible virus, why?

If you are receiving a "Suspicious file detected - APK:CloudRep [Susp]" message from Avast anti-virus installed on your Android device it is due to the fact that you are side-loading the app (or the Intel XDK Debug modules) onto your device (using a download link after building or by using the Debug tab to debug your app), or your app has been installed from an "untrusted" Android store. See the following official explanation from Avast:

Your application was flagged by our cloud reputation system. "Cloud rep" is a new feature of Avast Mobile Security, which flags apks when the following conditions are met:
The file is not prevalent enough; meaning not enough users of Avast Mobile Security have installed your APK.
The source is not an established market (Google Play is an example of an established market).
If you distribute your app using Google Play (or any other trusted market) your users should not see any warning from Avast.

Following are some of the Avast anti-virus notification screens you might see on your device. All of these are perfectly normal, they are due to the fact that you must enable the installation of "non-market" apps in order to use your device for debug and the App IDs associated with your never published app or the custom debug modules that the Debug tab in the Intel XDK builds and installs on your device will not be found in a "established" (aka "trusted") market, such as Google Play.

If you choose to ignore the "Suspicious app activity!" threat you will not receive a threat for that debug module any longer. It will show up in the Avast 'ignored issues' list. Updates to an existing, ignored, custom debug module should continue to be ignored by Avast. However, new custom debug modules (due to a new project App ID or a new version of Crosswalk selected in your project's Build Settings) will result in a new warning from the Avast anti-virus tool.

How do I add a Brackets extension to the editor that is part of the Intel XDK?

The number of Brackets extensions that are provided in the built-in edition of the Brackets editor are limited to insure stability of the Intel XDK product. Not all extensions are compatible with the edition of Brackets that is embedded within the Intel XDK. Adding incompatible extensions can cause the Intel XDK to quit working.

Despite this warning, there are useful extensions that have not been included in the editor and which can be added to the Intel XDK. Adding them is temporary, each time you update the Intel XDK (or if you reinstall the Intel XDK) you will have to "re-add" your Brackets extension. To add a Brackets extension, use the following procedure:

exit the Intel XDK
download a ZIP file of the extension you wish to add
on Windows, unzip the extension here:
%LocalAppData%\Intel\XDK\xdk\brackets\b\extensions\dev
on Mac OS X, unzip the extension here:
/Applications/Intel\ XDK.app/Contents/Resources/app.nw/brackets/b/extensions/dev
start the Intel XDK

Note that the locations given above are subject to change with new releases of the Intel XDK.

Why does my app or game require so many permissions on Android when built with the Intel XDK?

When you build your HTML5 app using the Intel XDK for Android or Android-Crosswalk you are creating a Cordova app. It may seem like you're not building a Cordova app, but you are. In order to package your app so it can be distributed via an Android store and installed on an Android device, it needs to be built as a hybrid app. The Intel XDK uses Cordova to create that hybrid app.

A pure Cordova app requires the NETWORK permission, it's needed to "jump" between your HTML5 environment and the native Android environment. Additional permissions will be added by any Cordova plugins you include with your application; which permissions are includes are a function of what that plugin does and requires.

Crosswalk for Android builds also require the NETWORK permission, because the Crosswalk image built by the Intel XDK includes support for Cordova. In addition, current versions of Crosswalk (12 and 14 at the time this FAQ was written)also require NETWORK STATE and WIFI STATE. There is an extra permission in some versions of Crosswalk (WRITE EXTERNAL STORAGE) that is only needed by the shared model library of Crosswalk, we have asked the Crosswalk project to remove this permission in a future Crosswalk version.

If you are seeing more than the following five permissions in your XDK-built Crosswalk app:

android.permission.INTERNET
android.permission.ACCESS_NETWORK_STATE
android.permission.ACCESS_WIFI_STATE
android.permission.INTERNET
android.permission.WRITE_EXTERNAL_STORAGE

then you are seeing permissions that have been added by some plugins. Each plugin is different, so there is no hard rule of thumb. The two "default" core Cordova plugins that are added by the Intel XDK blank templates (device and splash screen) do not require any Android permissions.

BTW: the permission list above comes from a Crosswalk 14 build. Crosswalk 12 builds do not included the last permission; it was added when the Crosswalk project introduced the shared model library option, which started with Crosswalk 13 (the Intel XDK does not support 13 builds).

How do I make a copy of an existing Intel XDK project?

If you just need to make a backup copy of an existing project, and do not plan to open that backup copy as a project in the Intel XDK, do the following:

Exit the Intel XDK.
Copy the entire project directory:
- on Windows, use File Explorer to "right-click" and "copy" your project directory, then "right-click" and "paste"
- on Mac use Finder to "right-click" and then "duplicate" your project directory
- on Linux, open a terminal window, "cd" to the folder that contains your project, and type "cp -a old-project/ new-project/" at the terminal prompt (where "old-project/" is the folder name of your existing project that you want to copy and "new-project/" is the name of the new folder that will contain a copy of your existing project)

If you want to use an existing project as the starting point of a new project in the Intel XDK. The process described below will insure that the build system does not confuse the ID in your old project with that stored in your new project. If you do not follow the procedure below you will have multiple projects using the same project ID (a special GUID that is stored inside the Intel XDK <project-name>.xdk file in the root directory of your project). Each project in your account must have a unique project ID.

Exit the Intel XDK.
Make a copy of your existing project using the process described above.
Inside the new project that you made (that is, your new copy of your old project), make copies of the <project-name>.xdk file and <project-name>.xdke files and rename those copies to something like project-new.xdk and project-new.xdke (anything you like, just something different than the original project name, preferably the same name as the new project folder in which you are making this new project).
Using a TEXT EDITOR (only) (such as Notepad or Sublime or Brackets or some other TEXT editor), open your new "project-new.xdk" file (whatever you named it) and find the projectGuid line, it will look something like this:
"projectGuid": "a863c382-ca05-4aa4-8601-375f9f209b67",
Change the "GUID" to all zeroes, like this: "00000000-0000-0000-000000000000"
Save the modified "project-new.xdk" file.
Open the Intel XDK.
Goto the Projects tab.
Select "Open an Intel XDK Project" (the green button at the bottom left of the Projects tab).
To open this new project, locate the new "project-new.xdk" file inside the new project folder you copied above.
Don't forget to change the App ID in your new project. This is necessary to avoid conflicts with the project you copied from, in the store and when side-loading onto a device.

My project does not include a www folder. How do I fix it so it includes a www or source directory?

The Intel XDK HTML5 and Cordova project file structures are meant to mimic a standard Cordova project. In a Cordova (or PhoneGap) project there is a subdirectory (or folder) named www that contains all of the HTML5 source code and asset files that make up your application. For best results, it is advised that you follow this convention, of putting your source inside a "source directory" inside of your project folder.

This most commonly happens as the result of exporting a project from an external tool, such as Construct2, or as the result of importing an existing HTML5 web app that you are converting into a hybrid mobile application (eg., an Intel XDK Corodova app). If you would like to convert an existing Intel XDK project into this format, follow the steps below:

Exit the Intel XDK.
Copy the entire project directory:
- on Windows, use File Explorer to "right-click" and "copy" your project directory, then "right-click" and "paste"
- on Mac use Finder to "right-click" and then "duplicate" your project directory
- on Linux, open a terminal window, "cd" to the folder that contains your project, and type "cp -a old-project/ new-project/" at the terminal prompt (where "old-project/" is the folder name of your existing project that you want to copy and "new-project/" is the name of the new folder that will contain a copy of your existing project)
Create a "www" directory inside the new duplicate project you just created above.
Move your index.html and other source and asset files to the "www" directory you just created -- this is now your "source" directory, located inside your "project" directory (do not move the <project-name>.xdk and xdke files and any intelxdk.config.*.xml files, those must stay in the root of the project directory)
Inside the new project that you made above (by making a copy of the old project), rename the <project-name>.xdk file and <project-name>.xdke files to something like project-copy.xdk and project-copy.xdke (anything you like, just something different than the original project, preferably the same name as the new project folder in which you are making this new project).
Using a TEXT EDITOR (only) (such as Notepad or Sublime or Brackets or some other TEXT editor), open the new "project-copy.xdk" file (whatever you named it) and find the line named projectGuid, it will look something like this:
"projectGuid": "a863c382-ca05-4aa4-8601-375f9f209b67",
Change the "GUID" to all zeroes, like this: "00000000-0000-0000-000000000000"
A few lines down find: "sourceDirectory": "",
Change it to this: "sourceDirectory": "www",
Save the modified "project-copy.xdk" file.
Open the Intel XDK.
Goto the Projects tab.
Select "Open an Intel XDK Project" (the green button at the bottom left of the Projects tab).
To open this new project, locate the new "project-copy.xdk" file inside the new project folder you copied above.

Can I install more than one copy of the Intel XDK onto my development system?

Yes, you can install more than one version onto your development system. However, you cannot run multiple instances of the Intel XDK at the same time. Be aware that new releases sometimes change the project file format, so it is a good idea, in these cases, to make a copy of your project if you need to experiment with a different version of the Intel XDK. See the instructions in a FAQ entry above regarding how to make a copy of your Intel XDK project.

Follow the instructions in this forum post to install more than one copy of the Intel XDK onto your development system.

On Apple OS X* and Linux* systems, does the Intel XDK need the OpenSSL* library installed?

Yes. Several features of the Intel XDK require the OpenSSL library, which typically comes pre-installed on Linux and OS X systems. If the Intel XDK reports that it could not find libssl, go to https://www.openssl.org to download and install it.

I have a web application that I would like to distribute in app stores without major modifications. Is this possible using the Intel XDK?

Yes, if you have a true web app or “client app” that only uses HTML, CSS and JavaScript, it is usually not too difficult to convert it to a Cordova hybrid application (this is what the Intel XDK builds when you create an HTML5 app). If you rely heavily on PHP or other server scripting languages embedded in your pages you will have more work to do. Because your Cordova app is not associated with a server, you cannot rely on server-based programming techniques; instead, you must rewrite any such code to user RESTful APIs that your app interacts with using, for example, AJAX calls.

What is the best training approach to using the Intel XDK for a newbie?

First, become well-versed in the art of client web apps, apps that rely only on HTML, CSS and JavaScript and utilize RESTful APIs to talk to network services. With that you will have mastered 80% of the problem. After that, it is simply a matter of understanding how Cordova plugins are able to extend the JavaScript API for access to features of the platform. For HTML5 training there are many sites providing tutorials. It may also help to read Five Useful Tips on Getting Started Building Cordova Mobile Apps with the Intel XDK, which will help you understand some of the differences between developing for a traditional server-based environment and developing for the Intel XDK hybrid Cordova app environment.

What is the best platform to start building an app with the Intel XDK? And what are the important differences between the Android, iOS and other mobile platforms?

There is no one most important difference between the Android, iOS and other platforms. It is important to understand that the HTML5 runtime engine that executes your app on each platform will vary as a function of the platform. Just as there are differences between Chrome and Firefox and Safari and Internet Explorer, there are differences between iOS 9 and iOS 8 and Android 4 and Android 5, etc. Android has the most significant differences between vendors and versions of Android. This is one of the reasons the Intel XDK offers the Crosswalk for Android build option, to normalize and update the Android issues.

In general, if you can get your app working well on Android (or Crosswalk for Android) first you will generally have fewer issues to deal with when you start to work on the iOS and Windows platforms. In addition, the Android platform has the most flexible and useful debug options available, so it is the easiest platform to use for debugging and testing your app.

Is my password encrypted and why is it limited to fifteen characters?

Yes, your password is stored encrypted and is managed by https://signin.intel.com. Your Intel XDK userid and password can also be used to log into the Intel XDK forum as well as the Intel Developer Zone. the Intel XDK does not store nor does it manage your userid and password.

The rules regarding allowed userids and passwords are answered on this Sign In FAQ page, where you can also find help on recovering and changing your password.

Why does the Intel XDK take a long time to start on Linux or Mac?

...and why am I getting this error message? "Attempt to contact authentication server is taking a long time. You can wait, or check your network connection and try again."

At startup, the Intel XDK attempts to automatically determine the proxy settings for your machine. Unfortunately, on some system configurations it is unable to reliably detect your system proxy settings. As an example, you might see something like this image when starting the Intel XDK.

On some systems you can get around this problem by setting some proxy environment variables and then starting the Intel XDK from a command-line that includes those configured environment variables. To set those environment variables, similar to the following:

$ export no_proxy="localhost,127.0.0.1/8,::1"
$ export NO_PROXY="localhost,127.0.0.1/8,::1"
$ export http_proxy=http://proxy.mydomain.com:123/
$ export HTTP_PROXY=http://proxy.mydomain.com:123/
$ export https_proxy=http://proxy.mydomain.com:123/
$ export HTTPS_PROXY=http://proxy.mydomain.com:123/

IMPORTANT! The name of your proxy server and the port (or ports) that your proxy server requires will be different than those shown in the example above. Please consult with your IT department to find out what values are appropriate for your site. Intel has no way of knowing what configuration is appropriate for your network.

If you use the Intel XDK in multiple locations (at work and at home), you may have to change the proxy settings before starting the Intel XDK after switching to a new network location. For example, many work networks use a proxy server, but most home networks do not require such a configuration. In that case, you need to be sure to "unset" the proxy environment variables before starting the Intel XDK on a non-proxy network.

After you have successfully configured your proxy environment variables, you can start the Intel XDK manually, from the command-line.

On a Mac, where the Intel XDK is installed in the default location, type the following (from a terminal window that has the above environment variables set):

$ open /Applications/Intel\ XDK.app/

On a Linux machine, assuming the Intel XDK has been installed in the ~/intel/XDK directory, type the following (from a terminal window that has the above environment variables set):

$ ~/intel/XDK/xdk.sh &

In the Linux case, you will need to adjust the directory name that points to the xdk.sh file in order to start. The example above assumes a local install into the ~/intel/XDK directory. Since Linux installations have more options regarding the installation directory, you will need to adjust the above to suit your particular system and install directory.

How do I generate a P12 file on a Windows machine?

See these articles:

How do I change the default dir for creating new projects in the Intel XDK?

You can change the default new project location manually by modifying a field in the global-settings.xdk file. Locate the global-settings.xdk file on your system (the precise location varies as a function of the OS) and find this JSON object inside that file:

"projects-tab": {"defaultPath": "/Users/paul/Documents/XDK","LastSortType": "descending|Name","lastSortType": "descending|Opened","thirdPartyDisclaimerAcked": true
  },

The example above came from a Mac. On a Mac the global-settings.xdk file is located in the "~/Library/Application Support/XDK" directory.

On a Windows machine the global-settings.xdk file is normally found in the "%LocalAppData%\XDK" directory. The part you are looking for will look something like this:

"projects-tab": {"thirdPartyDisclaimerAcked": false,"LastSortType": "descending|Name","lastSortType": "descending|Opened","defaultPath": "C:\\Users\\paul/Documents"
  },

Obviously, it's the defaultPath part you want to change.

BE CAREFUL WHEN YOU EDIT THE GLOBAL-SETTINGS.XDK FILE!! You've been warned...

Make sure the result is proper JSON when you are done, or it may cause your XDK to cough and hack loudly. Make a backup copy of global-settings.xdk before you start, just in case.

Where I can find recent and upcoming webinars list?

How can I change the email address associated with my Intel XDK login?

Login to the Intel Developer Zone with your Intel XDK account userid and password and then locate your "account dashboard." Click the "pencil icon" next to your name to open the "Personal Profile" section of your account, where you can edit your "Name & Contact Info," including the email address associated with your account, under the "Private" section of your profile.

What network addresses must I enable in my firewall to insure the Intel XDK will work on my restricted network?

Normally, access to the external servers that the Intel XDK uses is handled automatically by your proxy server. However, if you are working in an environment that has restricted Internet access and you need to provide your IT department with a list of URLs that you need access to in order to use the Intel XDK, then please provide them with the following list of domain names:

appcenter.html5tools-software.intel.com (for communication with the build servers)
s3.amazonaws.com (for downloading sample apps and built apps)
download.xdk.intel.com (for getting XDK updates)
debug-software.intel.com (for using the Test tab weinre debug feature)
xdk-feed-proxy.html5tools-software.intel.com (for receiving the tweets in the upper right corner of the XDK)
signin.intel.com (for logging into the XDK)
sfederation.intel.com (for logging into the XDK)

Normally this should be handled by your network proxy (if you're on a corporate network) or should not be an issue if you are working on a typical home network.

I cannot create a login for the Intel XDK, how do I create a userid and password to use the Intel XDK?

If you have downloaded and installed the Intel XDK but are having trouble creating a login, you can create the login outside the Intel XDK. To do this, go to the Intel Developer Zone and push the "Join Today" button. After you have created your Intel Developer Zone login you can return to the Intel XDK and use that userid and password to login to the Intel XDK. This same userid and password can also be used to login to the Intel XDK forum.

Installing the Intel XDK on Windows fails with a "Package signature verification failed." message.

If you receive a "Package signature verification failed" message (see image below) when installing the Intel XDK on your system, it is likely due to one of the following two reasons:

Your system does not have a properly installed "root certificate" file, which is needed to confirm that the install package is good.
The install package is corrupt and failed the verification step.

The first case can happen if you are attempting to install the Intel XDK on an unsupported version of Windows. The Intel XDK is only supported on Microsoft Windows 7 and higher. If you attempt to install on Windows Vista (or earlier) you may see this verification error. The workaround is to install the Intel XDK on a Windows 7 or greater machine.

The second case is likely due to a corruption of the install package during download or due to tampering. The workaround is to re-download the install package and attempt another install.

If you are installing on a Windows 7 (or greater) machine and you see this message it is likely due to a missing or bad root certificate on your system. To fix this you may need to start the "Certificate Propagation" service. Open the Windows "services.msc" panel and then start the "Certificate Propagation" service. Additional links related to this problem can be found here > https://technet.microsoft.com/en-us/library/cc754841.aspx

See this forum thread for additional help regarding this issue > https://software.intel.com/en-us/forums/intel-xdk/topic/603992

Troubles installing the Intel XDK on a Linux or Ubuntu system, which option should I choose?

Choose the local user option, not root or sudo, when installing the Intel XDK on your Linux or Ubuntu system. This is the most reliable and trouble-free option and is the default installation option. This will insure that the Intel XDK has all the proper permissions necessary to execute properly on your Linux system. The Intel XDK will be installed in a subdirectory of your home (~) directory.

Inactive account/ login issue/ problem updating an APK in store, How do I request account transfer?

As of June 26, 2015 we migrated all of Intel XDK accounts to the more secure intel.com login system (the same login system you use to access this forum).

We have migrated nearly all active users to the new login system. Unfortunately, there are a few active user accounts that we could not automatically migrate to intel.com, primarily because the intel.com login system does not allow the use of some characters in userids that were allowed in the old login system.

If you have not used the Intel XDK for a long time prior to June 2015, your account may not have been automatically migrated. If you own an "inactive" account it will have to be manually migrated -- please try logging into the Intel XDK with your old userid and password, to determine if it no longer works. If you find that you cannot login to your existing Intel XDK account, and still need access to your old account, please send a message to html5tools@intel.com and include your userid and the email address associated with that userid, so we can guide you through the steps required to reactivate your old account.

Alternatively, you can create a new Intel XDK account. If you have submitted an app to the Android store from your old account you will need access to that old account to retrieve the Android signing certificates in order to upgrade that app on the Android store; in that case, send an email to html5tools@intel.com with your old account username and email and new account information.

Connection Problems? -- Intel XDK SSL certificates update

On January 26, 2016 we updated the SSL certificates on our back-end systems to SHA2 certificates. The existing certificates were due to expire in February of 2016. We have also disabled support for obsolete protocols.

If you are experiencing persistent connection issues (since Jan 26, 2016), please post a problem report on the forum and include in your problem report:

the operation that failed
the version of your XDK
the version of your operating system
your geographic region
and a screen capture

How do I resolve build failure: "libpng error: Not a PNG file"?

f you are experiencing build failures with CLI 5 Android builds, and the detailed error log includes a message similar to the following:

Execution failed for task ':mergeArmv7ReleaseResources'.> Error: Failed to run command: /Developer/android-sdk-linux/build-tools/22.0.1/aapt s -i .../platforms/android/res/drawable-land-hdpi/screen.png -o .../platforms/android/build/intermediates/res/armv7/release/drawable-land-hdpi-v4/screen.png

Error Code: 42

Output: libpng error: Not a PNG file

You need to change the format of your icon and/or splash screen images to PNG format.

The error message refers to a file named "screen.png" -- which is what each of your splash screens were renamed to before they were moved into the build project resource directories. Unfortunately, JPG images were supplied for use as splash screen images, not PNG images. So the files were renamed and found by the build system to be invalid.

Convert your splash screen images to PNG format. Renaming JPG images to PNG will not work! You must convert your JPG images into PNG format images using an appropriate image editing tool. The Intel XDK does not provide any such conversion tool.

Beginning with Cordova CLI 5, all icons and splash screen images must be supplied in PNG format. This applies to all supported platforms. This is an undocumented "new feature" of the Cordova CLI 5 build system that was implemented by the Apache Cordova project.

Why do I get a "Parse Error" when I try to install my built APK on my Android device?

Because you have built an "unsigned" Android APK. You must click the "signed" box in the Android Build Settings section of the Projects tab if you want to install an APK on your device. The only reason you would choose to create an "unsigned" APK is if you need to sign it manually. This is very rare and not the normal situation.

My converted legacy keystore does not work. Google Play is rejecting my updated app.

The keystore you converted when you updated to 3088 (now 3240 or later) is the same keystore you were using in 2893. When you upgraded to 3088 (or later) and "converted" your legacy keystore, you re-signed and renamed your legacy keystore and it was transferred into a database to be used with the Intel XDK certificate management tool. It is still the same keystore, but with an alias name and password assigned by you and accessible directly by you through the Intel XDK.

If you kept the converted legacy keystore in your account following the conversion you can download that keystore from the Intel XDK for safe keeping (do not delete it from your account or from your system). Make sure you keep track of the new password(s) you assigned to the converted keystore.

There are two problems we have experienced with converted legacy keystores at the time of the 3088 release (April, 2016):

Using foreign (non-ASCII) characters in the new alias name and passwords were being corrupted.
Final signing of your APK by the build system was being done with RSA256 rather than SHA1.

Both of the above items have been resolved and should no longer be an issue.

If you are currently unable to complete a build with your converted legacy keystore (i.e., builds fail when you use the converted legacy keystore but they succeed when you use a new keystore) the first bullet above is likely the reason your converted keystore is not working. In that case we can reset your converted keystore and give you the option to convert it again. You do this by requesting that your legacy keystore be "reset" by filling out this form. For 100% surety during that second conversion, use only 7-bit ASCII characters in the alias name you assign and for the password(s) you assign.

IMPORTANT: using the legacy certificate to build your Android app is ONLY necessary if you have already published an app to an Android store and need to update that app. If you have never published an app to an Android store using the legacy certificate you do not need to concern yourself with resetting and reconverting your legacy keystore. It is easier, in that case, to create a new Android keystore and use that new keystore.

If you ARE able to successfully build your app with the converted legacy keystore, but your updated app (in the Google store) does not install on some older Android 4.x devices (typically a subset of Android 4.0-4.2 devices), the second bullet cited above is likely the reason for the problem. The solution, in that case, is to rebuild your app and resubmit it to the store (that problem was a build-system problem that has been resolved).

How can I have others beta test my app using Intel App Preview?

Apps that you sync to your Intel XDK account, using the Test tab's green "Push Files" button, can only be accessed by logging into Intel App Preview with the same Intel XDK account credentials that you used to push the files to the cloud. In other words, you can only download and run your app for testing with Intel App Preview if you log into the same account that you used to upload that test app. This restriction applies to downloading your app into Intel App Preview via the "Server Apps" tab, at the bottom of the Intel App Preview screen, or by scanning the QR code displayed on the Intel XDK Test tab using the camera icon in the upper right corner of Intel App Preview.

If you want to allow others to test your app, using Intel App Preview, it means you must use one of two options:

give them your Intel XDK userid and password
create an Intel XDK "test account" and provide your testers with that userid and password

For security sake, we highly recommend you use the second option (create an Intel XDK "test account").

A "test account" is simply a second Intel XDK account that you do not plan to use for development or builds. Do not use the same email address for your "test account" as you are using for your main development account. You should use a "throw away" email address for that "test account" (an email address that you do not care about).

Assuming you have created an Intel XDK "test account" and have instructed your testers to download and install Intel App Preview; have provided them with your "test account" userid and password; and you are ready to have them test:

sign out of your Intel XDK "development account" (using the little "man" icon in the upper right)
sign into your "test account" (again, using the little "man" icon in the Intel XDK toolbar)
make sure you have selected the project that you want users to test, on the Projects tab
goto the Test tab
make sure "MOBILE" is selected (upper left of the Test tab)
push the green "PUSH FILES" button on the Test tab
log out of your "test account"
log into your development account

Then, tell your beta testers to log into Intel App Preview with your "test account" credentials and instruct them to choose the "Server Apps" tab at the bottom of the Intel App Preview screen. From there they should see the name of the app you synced using the Test tab and can simply start it by touching the app name (followed by the big blue and white "Launch This App" button). Staring the app this way is actually easier than sending them a copy of the QR code. The QR code is very dense and is hard to read with some devices, dependent on the quality of the camera in their device.

Note that when running your test app inside of Intel App Preview they cannot test any features associated with third-party plugins, only core Cordova plugins. Thus, you need to insure that those parts of your apps that depend on non-core Cordova plugins have been disabled or have exception handlers to prevent your app from either crashing or freezing.

I'm having trouble making Google Maps work with my Intel XDK app. What can I do?

There are many reasons that can cause your attempt to use Google Maps to fail. Mostly it is due to the fact that you need to download the Google Maps API (JavaScript library) at runtime to make things work. However, there is no guarantee that you will have a good network connection, so if you do it the way you are used to doing it, in a browser...

<script src="https://maps.googleapis.com/maps/api/js?key=API_KEY&sensor=true"></script>

...you may get yourself into trouble, in an Intel XDK Cordova app. See Loading Google Maps in Cordova the Right Way for an excellent tutorial on why this is a problem and how to deal with it. Also, it may help to read Five Useful Tips on Getting Started Building Cordova Mobile Apps with the Intel XDK, especially item #3, to get a better understanding of why you shouldn't use the "browser technique" you're familiar with.

An alternative is to use a mapping tool that allows you to include the JavaScript directly in your app, rather than downloading it over the network each time your app starts. Several Intel XDK developers have reported very good luck with the open-source JavaScript library named LeafletJS that uses OpenStreet as it's map database source.

You can also search the Cordova Plugin Database for Cordova plugins that implement mapping features, in some cases using native SDKs and libraries.

How do I fix "Cannot find the Intel XDK. Make sure your device and intel XDK are on the same wireless network." error messages?

You can either disable your firewall or allow access through the firewall for the Intel XDK. To allow access through the Windows firewall goto the Windows Control Panel and search for the Firewall (Control Panel > System and Security > Windows Firewall > Allowed Apps) and enable Node Webkit (nw or nw.exe) through the firewall

See the image below (this image is from a Windows 8.1 system).

Google Services needs my SHA1 fingerprint. Where do I get my app's SHA fingerprint?

Your app's SHA fingerprint is part of your build signing certificate. Specifically, it is part of the signing certificate that you used to build your app. The Intel XDK provides a way to download your build certificates directly from within the Intel XDK application (see the Intel XDK documentation for details on how to manage your build certificates). Once you have downloaded your build certificate you can use these instructions provided by Google, to extract the fingerprint, or simply search the Internet for "extract fingerprint from android build certificate" to find many articles detailing this process.

Why am I unable to test or build or connect to the old build server with Intel XDK version 2893?

This is an Important Note Regarding the use of Intel XDK Versions 2893 and Older!!

As of June 13, 2016, versions of the Intel XDK released prior to March 2016 (2893 and older) can no longer use the Build tab, the Test tab or Intel App Preview; and can no longer create custom debug modules for use with the Debug and Profile tabs. This change was necessary to improve the security and performance of our Intel XDK cloud-based build system. If you are using version 2893 or older, of the Intel XDK, you must upgrade to version 3088 or greater to continue to develop, debug and build Intel XDK Cordova apps.

The error message you see below, "NOTICE: Internet Connection and Login Required," when trying to use the Build tab is due to the fact that the cloud-based component that was used by those older versions of the Intel XDK work has been retired and is no longer present. The error message appears to be misleading, but is the easiest way to identify this condition.

How do I run the Intel XDK on Fedora Linux?

See the instructions below, copied from this forum post:

$ sudo find xdk/install/dir -name libudev.so.0
$ cd dir/found/above
$ sudo rm libudev.so.0
$ sudo ln -s /lib64/libudev.so.1 libudev.so.0

Note the "xdk/install/dir" is the name of the directory where you installed the Intel XDK. This might be "/opt/intel/xdk" or "~/intel/xdk" or something similar. Since the Linux install is flexible regarding the precise installation location you may have to search to find it on your system.

Once you find that libudev.so file in the Intel XDK install directory you must "cd" to that directory to finish the operations as written above.

Additional instructions have been provided in the related forum thread; please see that thread for the latest information regarding hints on how to make the Intel XDK run on a Fedora Linux system.

The Intel XDK generates a path error for my launch icons and splash screen files.

If you have an older project (created prior to August of 2016 using a version of the Intel XDK older than 3491) you may be seeing a build error indicating that some icon and/or splash screen image files cannot be found. This is likely due to the fact that some of your icon and/or splash screen image files are located within your source folder (typically named "www") rather than in the new package-assets folder. For example, inspecting one of the auto-generated intelxdk.config.*.xml files you might find something like the following:

<icon platform="windows" src="images/launchIcon_24.png" width="24" height="24"/><icon platform="windows" src="images/launchIcon_434x210.png" width="434" height="210"/><icon platform="windows" src="images/launchIcon_744x360.png" width="744" height="360"/><icon platform="windows" src="package-assets/ic_launch_50.png" width="50" height="50"/><icon platform="windows" src="package-assets/ic_launch_150.png" width="150" height="150"/><icon platform="windows" src="package-assets/ic_launch_44.png" width="44" height="44"/>

where the first three images are not being found by the build system because they are located in the "www" folder and the last three are being found, because they are located in the "package-assets" folder.

This problem usually comes about because the UI does not include the appropriate "slots" to hold those images. This results in some "dead" icon or splash screen images inside the <project-name>.xdk file which need to be removed. To fix this, make a backup copy of your <project-name>.xdk file and then, using a CODE or TEXT editor (e.g., Notepad++ or Brackets or Sublime Text or vi, etc.), edit your <project-name>.xdk file in the root of your project folder.

Inside of your <project-name>.xdk file you will find entries that look like this:

"icons_": [
  {"relPath": "images/launchIcon_24.png","width": 24,"height": 24
  },
  {"relPath": "images/launchIcon_434x210.png","width": 434,"height": 210
  },
  {"relPath": "images/launchIcon_744x360.png","width": 744,"height": 360
  },

Find all the entries that are pointing to the problem files and remove those problem entries from your <project-name>.xdk file. Obviously, you need to do this when the XDK is closed and only after you have made a backup copy of your <project-name>.xdk file, just in case you end up with a missing comma. The <project-name>.xdk file is a JSON file and needs to be in proper JSON format after you make changes or it will not be read properly by the XDK when you open it.

Then move your problem icons and splash screen images to the package-assets folder and reference them from there. Use this technique (below) to add additional icons by using the intelxdk.config.additions.xml file.

<!-- alternate way to add icons to Cordova builds, rather than using XDK GUI --><!-- especially for adding icon resolutions that are not covered by the XDK GUI --><!-- Android icons and splash screens --><platform name="android"><icon src="package-assets/android/icon-ldpi.png" density="ldpi" width="36" height="36" /><icon src="package-assets/android/icon-mdpi.png" density="mdpi" width="48" height="48" /><icon src="package-assets/android/icon-hdpi.png" density="hdpi" width="72" height="72" /><icon src="package-assets/android/icon-xhdpi.png" density="xhdpi" width="96" height="96" /><icon src="package-assets/android/icon-xxhdpi.png" density="xxhdpi" width="144" height="144" /><icon src="package-assets/android/icon-xxxhdpi.png" density="xxxhdpi" width="192" height="192" /><splash src="package-assets/android/splash-320x426.9.png" density="ldpi" orientation="portrait" /><splash src="package-assets/android/splash-320x470.9.png" density="mdpi" orientation="portrait" /><splash src="package-assets/android/splash-480x640.9.png" density="hdpi" orientation="portrait" /><splash src="package-assets/android/splash-720x960.9.png" density="xhdpi" orientation="portrait" /></platform>

Back to FAQs Main

↧

Jumbo Frames in Open vSwitch* with DPDK

September 26, 2016, 1:49 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® Xeon Phi™ Delivers Competitive Performance For Deep Learning—And Getting Better Fast

≪ Previous: Intel® XDK FAQs - General

This article describes the concept of jumbo frames and how support for that feature is implemented in Open vSwitch* with the Data Plane Development Kit (OvS-DPDK). It outlines how to configure jumbo frame support for DPDK-enabled ports on an OvS bridge and also provides insight into how OvS-DPDK memory management for jumbo frames works. Finally, it details two tests that demonstrate jumbo frames in action on an OvS-DPDK deployment and looks at another that demonstrates performance gains achieved through the use of jumbo frames. This guide was written with general OvS users in mind, who want to know more about the jumbo frame feature and apply it in their OvS-DPDK deployment.

At the time of this writing, jumbo frame support for OvS-DPDK is available on the OvS master branch, and also the 2.6 branch. Installation steps for OvS with DPDK can be found here.

Jumbo Frames

A jumbo frame is distinguished from a “standard” frame by its size: any frame larger than the standard Ethernet MTU (Maximum Transmission Unit) of 1500B is characterized as a jumbo frame. The MTU is the largest amount of data that a network interface can send in a single unit. If the network interface wants to transmit a large block of data, it needs to fragment the data into multiple units of size MTU, each unit containing part of the data, plus the required network layer encapsulation headers. If instead, the network devices take advantage of jumbo frames, a significantly larger amount of application data can be carried in a single frame, eliminating much of the overhead incurred by duplication of encapsulation headers.

Thus, the primary benefit of using jumbo frames is the improved data-to-overhead ratio that they provide—the same amount of data can be communicated with significantly less overhead. As a corollary, the resultant reduced packet count also means that the kernel needs to handle fewer interrupts, which reduces the CPU load (N/A in DPDK).

Usage of Jumbo Frames

Jumbo frames are typically beneficial in environments in which large amounts of data need to be transferred, such as Storage Area Networks (SANs), where they improve transfer rates for large files. Many SANs use the Fibre Channel over Ethernet (FCoE) protocol to consolidate their storage and network traffic on a single network; FCoE frames have a minimum payload size of 2112B, so jumbo frames are crucial if fragmentation is to be avoided. Jumbo frames are also useful in overlay networks, where the amount of data that a frame can carry is reduced below the standard Ethernet MTU, as a result of the addition of tunneling headers; boosting the MTU can negate the effects of the additional encapsulation overhead.

Jumbo Frames in OVS

Network devices (netdevs) generally don’t support jumbo frames by default but can be easily configured to do so. Jurisdiction over the MTU of traditional logical network devices is typically beyond the remit of OvS and is instead governed by the kernel’s network stack. A netdev’s MTU can be queried and modified using standard network management tools, such as ifconfig in Linux*. Figure 1 illustrates how ifconfig may be used to increase the MTU of network device p3p3 from 1500 to 9000. The MTU of kernel-governed netdevs is subsequently honored by OVS when those devices are added to an OvS bridge.

Figure 1: Configuring the MTU of a network device using ifconfig.

OvS-DPDK devices cannot avail of ifconfig, however, as control of DPDK-enabled netdevs is maintained by DPDK poll mode drivers (PMDs) and not standard kernel drivers. The OvS-DPDK jumbo frames feature provides a mechanism which OvS employs to modify the MTU of OvS-DPDK netdevs, thus increasing their maximum supported frame size.

Jumbo Frames in OvS-DPDK

This section provides an overview of how frames are represented in both OvS and DPDK, and how DPDK manages packet buffer memory. It then describes how support for jumbo frames is actually implemented in OvS-DPDK.

In OvS, frames are represented in the OvS datapath (dpif) layer as dp_packets (datapath packets), as illustrated in Figure 2. A dp_packet contains a reference to the packet buffer itself, as well as some additional metadata and offsets that OvS uses to process the frame as it traverses the vSwitch.

Figure 2: Simplified view of Open vSwitch* datapath packet buffer.

In DPDK, a frame is represented by the message buffer data structure (rte_mbuf, or just mbuf for short), as illustrated in Figure 3. An mbuf contains metadata which DPDK uses to process the frame, and a pointer to the message buffer itself, which is stored in contiguous memory just after the mbuf. The mbuf’s buf_addr attribute points to the start of the message buffer, but the frame data itself actually begins at an offset of data_off from buf_addr. The additional data_off bytes, which is typically RTE_PKTMBUF_HEADROOM (128 bytes) long, are allocated in case additional headers need to be prepended before the packet during processing.

Figure 3: Data Plane Development Kit message buffer (‘mbuf’).

Unsurprisingly then, in OvS-DPDK, a frame is represented by a dp_packet, which contains an rte_mbuf. The resultant packet buffer memory layout is shown in Figure 4.

Figure 4: Open vSwitch Data Plane Development Kit packet buffer.

DPDK is targeted for optimized packet processing applications; for such applications, allocation of packet buffer memory from the heap at runtime is much too slow. Instead, DPDK allocates application memory upfront during initialization. To do this, it creates one or more memory pools (mempools) that DPDK processes can subsequently use to create mbufs at runtime with minimum overhead. Mempools are created with the DPDK rte_mempool_create function.

struct rte_mempool *
rte_mempool_create(const char *name, unsigned n, unsigned elt_size,
unsigned cache_size, unsigned private_data_size,
rte_mempool_ctor_t *mp_init, void *mp_init_arg,
rte_mempool_obj_cb_t *obj_init, void *obj_init_arg,
int socket_id, unsigned flags)

The function returns a reference to a mempool containing a fixed number of elements; all elements within the mempool are the same size. The number of elements and their size are determined by the respective values of the cache_size and elt_size parameters provided to rte_mempool_create.

In the case of OvS-DPDK, elt_size needs to be big enough to store all of the data that we observed in Figure 4: Open vSwitch Data Plane Development Kit packet buffer; this includes the dp_packet (and the mbuf that it contains), the L2 header and CRC, the IP payload, and the mbuf headroom (and tailroom, if this is required). By default, the value of elt_size is only large enough to accommodate standard-sized frames (i.e., 1518B or less); however, if it were possible to specify a much larger value, it would allow OvS-DPDK to support jumbo frames in a single mbuf segment.

In OvS, a subset of a net device’s properties can be modified on the command line using the ovs-vsctl utility; OvS 2.6 introduces a new Interface attribute, mtu_request, which users can leverage to adjust the MTU of DPDK devices. For example, to add a physical DPDK port (termed dpdk port in OvS-DPDK) with a Layer 3 MTU of 9000B to OvS bridge br0:

ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk -- set Interface dpdk0 mtu_request=9000

Alternatively, to reduce the MTU of the same port to 6000 after it has been added to the bridge:

ovs-vsctl -- set Interface dpdk0 mtu_request=6000

Note that mtu_request refers to the Layer 3 MTU; OvS-DPDK allows an additional 18B for Layer 2 header and CRC, so the maximum permitted frame size in the above examples is 9018B and 6018B, respectively. Additionally, ports that use the same MTU share the same mempool; if a port has a different MTU than existing ports, OvS creates an additional mempool for it (assuming that there is sufficient memory to do so). Mempools for MTUs that are no longer used are freed.

Functional Test Configuration

This section outlines two functional tests that demonstrate jumbo frame support across OvS-DPDK physical and guest (dpdkvhostuser) ports. The first test simply demonstrates support for jumbo frames across disparate DPDK port types, while the second additionally shows the effects of dynamically altering a port’s MTU at runtime. Both tests utilize a “hairpin” traffic path, as illustrated in Figure 5. During testing, validation of jumbo frame traffic integrity occurs in two places: (1) in the guest’s network stack via tcpdump, and (2) on the traffic generator’s RX interface, via packet capture and inspection.

Figure 5: Jumbo frame test configuration.

Test Environment

The DUT used during jumbo frame testing is configured as per Table 1. Where applicable, the software component used is listed with its corresponding commit ID or tag.

Table 1: DUT jumbo frame test environment.

Traffic Configuration

Dummy TCP traffic for both tests is produced by a physical generator; salient traffic attributes are outlined below in Table 2.

Table 2: Jumbo frame test traffic configuration.

9018B frames are used during testing. Note the IP packet size of 9000B and the data size of 8960B, as described in Figure 6; they’ll be important later on during testing.

Figure 6: Jumbo frame test traffic breakdown.

NIC Configuration

No specific configuration of the NIC is necessary in order to support jumbo frames, as the DPDK PMD configures the NIC to support oversized frames as per the user-supplied MTU (mtu_request). The only limitation is that the user-supplied MTU must not exceed the maximum frame size that the hardware itself supports. Consult your NIC datasheet for details. At the time of writing, the maximum frame size supported by the Intel® Ethernet Controller XL710 network adapter is 9728B¹, which yields a maximum mtu_request value of 9710.

vSwitch Configuration

Compile DPDK and OvS, mount hugepages, and start up the switch as normal, ensuring that the dpdk-init, dpdk-lcore-mask, and dpdk-socket-mem parameters are set. Note that in order to accommodate jumbo frames at the upper end of the size spectrum, ovs-vswitchd may need additional memory; in this test, 4 GB of hugepages are used.

ovs-vsctl –no-wait set Open_vSwitch.other_config:dpdk-socket-mem=4096,0

Create an OvS bridge of datapath_type netdev, and add 2 x DPDK phy ports, and 2 x guest ports. When adding the ports, specify the mtu_request parameter as 9000. This will allow frames up to a maximum of 9018B to be supported. Incidentally, the value of mtu_request may be modified dynamically at runtime, as we’ll observe later in Test Case #2

ovs-vsctl add-br br0 –- set Bridge br0 datapath_type=netdev ovs-vsctl –no-wait set Open_vSwitch.other_config:pmd-cpu-mask=6 ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk -- set Interface dpdk0 mtu_request=9000 ovs-vsctl add-port br0 dpdk1 -- set Interface dpdk0 type=dpdk -- set Interface dpdk1 mtu_request=9000 ovs-vsctl add-port br0 dpdkvhostuser0 -- set Interface dpdkvhostuser0 type=dpdkvhostuser -- set Interface dpdkvhostuser0 mtu_request=9000 ovs-vsctl add-port br0 dpdkvhostuser1 -- set Interface dpdkvhostuser1 type=dpdkvhostuser -- set Interface dpdkvhostuser1 mtu_request=9000

Inspect the bridge to ensure that MTU has been set appropriately for all ports. Note that all of the ports listed in Figure 7 display an MTU of 9000.

ovs-appctl dpctl/show

Figure 7: Open vSwitch* ports configured with 9000B MTU.

Alternatively, inspect the MTU of each port in turn.

ovs-vsctl get Interface [dpdk0|dpdk1|dpdkvhostuser0|dpdkvhostuser1] mtu

Sample output for this command is displayed in Figure 8.

Figure 8: 9000B MTU for port 'dpdkvhostuser0'.

Start the Guest

sudo -E $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 -name us-vhost-vm1 -cpu host -enable-kvm \
-m $MEM -object memory-backend-file,id=mem,size=$MEM,mem-path=$HUGE_DIR,share=on -numa node,memdev=mem -mem-prealloc -smp 2 -drive file=/$VM1 \
-chardev socket,id=char0,path=$SOCK_DIR/dpdkvhostuser0 \
-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on \
-chardev socket,id=char1,path=$SOCK_DIR/dpdkvhostuser1 \
-netdev type=vhost-user,id=mynet2,chardev=char1,vhostforce -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mrg_rxbuf=on \
--nographic -vnc :1 \

Guest Configuration

Jumbo frames requires end-to-end configuration, so we’ll need to set the MTU of the relevant guest network devices to 9000, to avoid fragmentation of jumbo frames in the VM’s network stack.

ifconfig eth1 mtu 9000 ifconfig eth2 mtu 9000

Configure IP addresses for the network devices, and then bring them up.

ifconfig eth1 5.5.5.2/24 up ifconfig eth2 7.7.7.2/24 up

Enable IP forwarding; traffic destined for the 7.7.7.0/24 network will be returned to the vSwitch via the guest’s network stack.

sysctl net.ipv4.ip_forward=1

Depending on the traffic generator setup, a static ARP entry for the traffic destination IP address may be required:

arp -s 7.7.7.3 00:00:de:ad:be:ef

Test Case #1

This test simply demonstrates the jumbo frame feature on OvS-DPDK for dpdk and dpdkvhostuser port types.

Initial setup is as previously described. Simply start continuous traffic to begin the test.

In the guest, turn on tcpdump for the relevant network devices while traffic is live. The output from the tool confirms the presence of jumbo frames in the guest’s network stack. In the sample command lines below, tcpdump output is limited to 20 frames on each port to prevent excessive log output.

tcpdump -i eth1 –v –c 20 # view ingress traffic tcpdump -i eth2 –v –c 20 # view egress traffic

The output of tcpdump is demonstrated in Figure 9: tcpdump of guest network interfaces. It shows that the length of the IP packets received and subsequently transmitted by the guest is 9000B (circled in blue) and the length of the corresponding data in the TCP segment is 8960B (circled in green). Note that these figures match the traffic profile described in Figure 6: Jumbo frame test traffic breakdown.

Figure 9: tcpdump of guest network interfaces, demonstrating ingress/egress 9000B IP packets containing 8960B of data.

Figure 10 shows the contents of a packet captured at the test endpoint, the traffic generator’s RX port. Note that the Ethernet frame length is 9018B as expected (circled in orange). Additionally, the IP packet length and data length remain 9000B and 8960B, respectively. Since these values remain unchanged for frames that traverse the vSwitch and through a guest, we can conclude that the 9018B frames sent by the generator were not fragmented, thus demonstrating support for jumbo frames for OVS-DPDK dpdk and vhostuser ports.

Figure 10: Packet capture at traffic generator Rx endpoint, demonstrating receipt of 9000B IP packets, containing 8960B of data.

Test Case #2

This test demonstrates runtime modification of a DPDK-based netdev’s MTU, using the ovs-vsctl mtu_request parameter.

Setup is identical to the previous test case; to kick off the test, just start traffic (9018B frames, as per Table 2: Jumbo frame test traffic configuration) on the generator’s Tx interface.

Observe that 9k frames are supported throughout the entire traffic path, as per Test Case #1.

Now reduce the MTU of one of the dpdk (that is, Phy) ports to 6000. This configures the NIC’s Rx port to accept frames with a maximum size of 6018B.²

ovs-vsctl set Interface dpdk0 mtu_request=6000

Verify that MTU was set correctly for dpdk0 and that the MTU for the remaining ports remain unchanged, as per Figure 11.
ovs-vsctl dpctl/show

Figure 11: 6000B MTU for port ‘dpdk0’.

Observe that traffic is no longer received by the vSwitch, as it was dropped by the NIC due to its size, as per Figure 12. The lack of flows installed in the datapath indicates that it is not currently handling any flows.

ovs-appctl dpctl/dump-flows

Figure 12: Empty set of flows processed by OvS userspace datapath.

Running tcpdump in the guest provides additional confirmation that packets are not reaching the guest.

Next reduce the traffic frame size to 6018B in the generator; this frame size is permitted by the NIC’s configuration, as per the previously supplied value of mtu_request. Observe that these frames now pass through to the guest; as expected, the IP packet size is 6000B, and the TCP segment contains 5960B of data (Figure 13).

Figure 13: tcpdump of guest network interfaces, demonstrating ingress/egress 6000B IP packets containing 5960B of data.

Examining traffic captured at the test endpoint, it is confirmed that 6018B frames were received, with IP packet and data lengths as expected.

Figure 14: Packet capture at traffic generator Rx endpoint, demonstrating receipt of 6000B IP packets, containing 5960B of data.

Performance Test Configuration

This section demonstrates the performance benefits of jumbo frames in OVS-DPDK. In the described sample test, two VMs are spawned on the same host, and traffic is transmitted between them. One VM runs an iperf3 server, while the other runs an iperf3 client. iperf3 initiates a TCP connection between the client and server, and transfers large blocks of TCP data between them. Test setup is illustrated in Figure 15.

Figure 15: VM-VM jumbo frame test setup.

Test Environment

The host environment is as described previously, in the “Functional Test Configuration” section.

The guest environment is as described below, in Figure 16.

Figure 16: Jumbo frame test guest environment

vSwitch Configuration

Start OVS, ensuring that the relevant OVSDB DPDK fields are set appropriately.

sudo -E $OVS_DIR/utilities/ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
sudo -E $OVS_DIR/utilities/ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
sudo -E $OVS_DIR/utilities/ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
sudo -E $OVS_DIR/vswitchd/ovs-vswitchd unix:$DB_SOCK --pidfile --detach --log-file &

Create an OVS bridge, and add two dpdkvhostuser ports.

sudo -E $OVS_DIR/utilities/ovs-vsctl --timeout 10 --may-exist add-br br0 -- set Bridge br0 datapath_type=netdev -- br-set-external-id br0 bridge-id br0 -- set bridge br0 fail-mode=standalone
sudo -E $OVS_DIR/utilities/ovs-vsctl --timeout 10 set Open_vSwitch . other_config:pmd-cpu-mask=6
sudo -E $OVS_DIR/utilities/ovs-vsctl --timeout 10 add-port br0 $PORT0_NAME -- set Interface $PORT0_NAME type=dpdkvhostuser
sudo -E $OVS_DIR/utilities/ovs-vsctl --timeout 10 add-port br0 $PORT1_NAME -- set Interface $PORT1_NAME type=dpdkvhostuser

Start the guests, ensuring that mergeable buffers are enabled.

VM1

sudo -E taskset 0x60 $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 -name us-vhost-vm1 -cpu host -enable-kvm -m 4096M -object memory-backend-file,id=mem,size=4096M,mem-path=$HUGE_DIR,share=on -numa node,memdev=mem -mem-prealloc -smp 2 -drive file=$VM1 -chardev socket,id=char0,path=$SOCK_DIR/dpdkvhostuser0 -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on,csum=off,gso=off,guest_csum=off,guest_tso4=off,guest_tso6=off,guest_ecn=off --nographic -vnc :1

VM2

sudo -E taskset 0x180 $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 -name us-vhost-vm2 -cpu host -enable-kvm -m 4096M -object memory-backend-file,id=mem,size=4096M,mem-path=$HUGE_DIR,share=on -numa node,memdev=mem -mem-prealloc -smp 2 -drive file=$VM2 -chardev socket,id=char1,path=$SOCK_DIR/dpdkvhostuser1 -netdev type=vhost-user,id=mynet2,chardev=char1,vhostforce -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mrg_rxbuf=on,csum=off,gso=off,guest_csum=off,guest_tso4=off,guest_tso6=off,guest_ecn=off --nographic -vnc :2

Guest Configuration

Set an IP address for, and bring up, the virtio network device in each guest.

VM1

ifconfig eth1 5.5.5.1/24 up

VM2

ifconfig eth2 5.5.5.2/24 up

Establish Performance Baseline

Start an iperf3 client on VM2.

iperf3 –s

Start an iperf3 client on VM1 and point it to the iperf3 server on VM2.

iperf3 –c 5.5.5.2

Observe the performance of both server and client. Figure 17 demonstrates an average TX rate of 6.98Gbps for data transfers between client and server, which serves as our baseline performance.

Figure 17: Guest iperf3 transfer rates, using standard Ethernet MTU.<

Measure Performance with Jumbo Frames

Note: This test can be done after the previous test. It’s not necessary to tear down the existing setup.

Additional Host Configuration

Increase the MTU for the dpdkvhostuser ports to 9710B (max supported mtu_request).

ovs-vsctl set Interface dpdkvhostuser0 mtu_request=9710 ovs-vsctl set Interface dpdkvhostuser1 mtu_request=9710

Check the bridge to verify that the MTU for each port has increased to 9710B, as per Figure 18.

ovs-appctl dpctl/show

Figure 18: dpdkvhostuser ports with 9710B MTU.

Additional Guest Configuration

In each VM, increase the MTU of the relevant network interface to 9710B, as per Figure 19 and Figure 20.

ifconfig eth1 mtu 9710 ifconfig eth1 | grep mtu

Figure 19: Set 9710B MTU for eth1 on VM1 with ifconfig.

ifconfig eth2 mtu 9710 ifconfig eth2 | grep mtu

Figure 20: Set 9710B MTU for eth2 on VM2 with ifconfig.

Start the iperf3 server in VM2 and kick off the client in VM1, as before. Observe now that throughput has doubled, from its initial rate of ~7 Gbps to 15.6 Gbps (Figure 21: guest iperf3 transfer rates using 9710B MTU).

Figure 21: guest iperf3 transfer rates using 9710B MTU.

Conclusion

In this article, we have described the concept of jumbo frames and observed how they may be enabled at runtime for DPDK-enabled ports in OvS. We’ve also seen how packet buffer memory is organized in OVS-DPDK and learned how to set up and test OVS-DPDK jumbo frame support. Finally, we’ve observed how enabling jumbo frames in OVS-DPDK can dramatically improve throughput for specific use cases.

About the Author

Mark Kavanagh is a network software engineer with Intel. His work is primarily focused on accelerated software switching solutions in user space running on Intel® architecture. His contributions to Open vSwitch with DPDK include incremental DPDK version enablement, Jumbo Frame support³, and TCP Segmentation Offload (TSO) RFC⁴.

References

http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xl710-10-40-controller-datasheet.pdf, p.72
6000B IP packet + 14B L2 header + 4B L2 CRC
http://openvswitch.org/pipermail/dev/2016-August/077585.html
http://openvswitch.org/pipermail/dev/2016-June/072871.html

↧

Intel® Xeon Phi™ Delivers Competitive Performance For Deep Learning—And Getting Better Fast

September 26, 2016, 2:36 pm

Latest and popular articles on Intel Technologies

≫ Next: Bringing Up Arduino 101* (branded Genuino 101* outside the U.S.) on Ubuntu* under VMware*

≪ Previous: Jumbo Frames in Open vSwitch* with DPDK

Baidu’s recently announced deep learning benchmark, DeepBench, documents performance for the lowest-level compute and communication primitives for deep learning (DL) applications. The goal is to provide a standard benchmark to evaluate different hardware platforms using the vendor’s DL libraries.

Intel continues to optimize its Intel® Xeon and Intel® Xeon Phi™ processors for DL via the Intel Math Kernel Library (Intel MKL). Intel MKL 2017 includes a collection of performance primitives for DL applications. The library supports the most commonly used primitives necessary to accelerate image recognition topologies and GEMM primitives necessary to accelerate various types of RNNs. The functionality includes convolution, inner product, pooling, normalization and activation primitives, with support for forward (inference) and backward (gradient propagation) operations. The MKL 2017 library is freely available with a community license, and some of these optimizations are also available as part of open source Intel MKL-DNN project.

Intel® Xeon Phi™ processors were used for this benchmark. In this paper, we point out the DL operations where Intel® Xeon Phi™ shines—and is rapidly improving.

DeepBench background

DeepBench aims to include primitives such GEMMs (General Matrix-Matrix Multiplication), convolution layers, and recurrent layers with specific configurations used across different type of networks/applications. The current release is a first attempt at this and is not yet a complete set—the hope is that with active participation from the community this will become a comprehensive benchmark of primitives used in deep neural networks (DNNs), convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory networks (LSTMs) across a number of applications such as image and speech recognition, and natural language processing (NLP).

Further, DeepBench includes cases with varying mini-batch sizes to capture the impact of scaling (data parallelism). DeepBench is primarily intended as a tool for comparing different hardware platforms. The metrics Baidu published is absolute performance (in terms of TeraFLOPS/s).

In the remainder of this paper, we address the performance and suitability of Intel Xeon Phi processors for various DL operations.

GEMM

The majority of the computations involved in deep learning can be formulated as matrix-matrix multiplication operations, making GEMM a core compute primitive. Typically, the matrices coming from DL applications – fully connected (FC) layers, recurrent neural networks (RNNs), and long short-term memory networks (LSTMs)—are skewed and small, resulting in dense, small and oddly shaped (tall-skinny, fat-short) GEMM operations. This is captured by the GEMM kernel test cases in DeepBench. Intel Math Kernel Library (Intel MKL) 2017 includes optimized GEMM implementations that achieve high performance on both the Intel Xeon processor and Intel Xeon Phi processor for matrices that are typically seen in DL applications, exposed through the packed GEMM application programming interface (API).

Unlike the conventional GEMM operations (large and square matrices), these dense, small, and oddly shaped operations are particularly challenging due to their limited parallelism and highly skewed dimensions. Conventional methods fail to achieve peak performance, as outlined by this Baidu paper on optimizing GEMM performance for RNNs [1].

The specialized packed API implements an optimized block-GEMM operation, with a block formulation to increase reuse of blocks, without additional data rearrangement and fine-grained parallelization with minimal on-demand synchronization to increase the available concurrency for small matrices. These optimizations allow designers to effectively exploit the full-cache hierarchy in both Intel® Xeon and Intel® Xeon Phi™, extracting sufficient parallelism to ensure that all cores are kept busy to achieve significantly improved (near-peak) performance for such typical DL matrices.

The DeepBench GEMM kernel results on the Intel Xeon Phi processor include both the conventional Intel MKL GEMM and also the new packed GEMM API. These numbers are measured on Intel Xeon Phi processor 7250 (codenamed Knights Landing, or KNL) with Intel MKL 2017, which is publicly available. It can be seen from the DeepBench GEMM results (Fig. 1) (Nvidia performance measured by Baidu) that Intel Xeon Phi processor performance is higher than the performance of the Nvidia* M40 GPU (whose peak FLOPs are comparable with Intel Xeon Phi processor) across almost every single configuration, and is higher than the performance on the Nvidia* Pascal TitanX across some smaller, medium (N <= 64) matrices. With the next generation of Intel Xeon Phi processor (codenamed Knights Mill) offering significantly higher raw compute power, we would expect to see better performance when it is released next year.

Fig. 1 Source: Data from Baidu as of Sept 26, 2016

Convolution

The convolution operation is the other primary compute kernel in DL applications. For image-based application (CNNs), convolution layers contribute to the majority of the compute task. Also, increasingly, convolutional layers are also being used for speech and NLP (as acoustic models) applications, as well.

The convolutional-layer operation consists of a six-level nested loop over output feature maps (K), input feature maps (C), height and width of feature map (H, W), and kernel height and widths (R, S). Additionally, this operation is done over all the samples of the mini-batch (N). Hence, for typical layer configurations, this can result in significant computation.

This operation written as nested for loops in naïve order, does not leverage the available reuse (weights remaining unchanged across recurrent iterations) and becomes limited by memory bandwidth, thus not utilizing all the available compute on the Intel Xeon Phi processor. MKL offers optimized implementation of the convolution layers using the direct convolution operator, which gets close to achievable peak performance.

The direct convolution kernel includes the following optimizations:

Reformulating the convolution operation to better utilize the cache hierarchy. The loop-over output and input feature maps (K, C) are blocked for on-die caches and allow for inner-loop vectorization over output feature maps and independent fused multiply-add computations
Data is laid out so that the innermost loop accesses are contiguous, ensuring better utilization of cache lines and, therefore, bandwidth, while also improving prefetcher performance
Register blocking to improve reuse of data in register files, decrease on-core cache traffic, and hide the latency of the FMA operations
Optimal work partitioning to ensure that all cores are busy, with minimal load imbalance.

More detailed information on the implementation detail can be found under the machine learning chapter of the Intel Xeon Phi processor reference book [2].

These numbers are measured on Intel Xeon Phi processor 7250 with Intel MKL 2017. For the DeepBench convolution kernel (Fig. 2) we also include results for the open-source Intel optimized convolution layer implementation using libxsmm [3]. The absolute performance for the convolution layers are competitive compared to the Nvidia M40 (with comparative FLOPs). However, since the current version of Intel MKL supports only the direct convolution operator, the marked kernels with larger differences are for those convolutions layers where the direct convolution kernels on Intel Xeon Phi processor are being compared to the Winograd-based implementation. Intel MKL does not provide optimized Winograd convolution implementations at this point. Incorporating the Winograd-based implementation into MKL is work-in-progress. Despite the fact that the Winograd convolution algorithm shows significant speedups on certain convolution shapes and sizes, it does not significantly contribute to full topology performance.

Fig. 2 Source: Data from Baidu as of Sept 26, 2016

AllReduce

AllReduce is the communications primitive in DeepBench that covers message sizes commonly seen in deep learning networks and applications. This benchmark consists of measuring MPI_AllReduce latencies for five different message sizes (in floats): 100K, 3MB, 4MB, 6.4MB, and 16MB on 2, 4, 8, 16, and 32 nodes. This uses the AllReduce benchmark from the Ohio State University micro-benchmarks suite [4], with minor modifications.

We report the MPI_AllReduce time measured on Intel Xeon Phi processor 7250 on our internal Endeavor cluster with Intel® Omni-Path Architecture (Intel® OPA) series 100 fabric with fat-tree topology, using Intel MPI 5.1.3.181. The competitive data from Baidu is measured on a GPU cluster with 8 NVIDIA TitanX-Maxwell cards per node (with optimized intra-node comms). This is compared with a single Intel Xeon Phi processor consisting of a node, so a 32-node Intel Xeon Phi processor measurement below is comparable to 4 GPGPU nodes, with each node having a maximum of 8 cards.

The Intel Xeon Phi processor DeepBench results are with the stock Intel MPI Library 2017 on the above-mentioned cluster (Fig. 3). The latencies are better for 8 GPUs (in a single node) compared to 8 Intel Xeon Phi processor-based nodes since this constitutes to within node (peer-to-peer) communication. However, for communication across nodes, the Intel Xeon Phi processor AllReduce latencies are significantly better. Latencies for 16 Intel Xeon Phi processor-based nodes were better than that for 2 GPU (x8) nodes across most message sizes. Fig. 3 is normalized against TitanX-Maxwell, which Baidu measured as being number-1.

Fig. 3 Source: Data from Baidu as of Sept 26, 2016

Further, we also present results using our optimized communication library (which was also presented in Pradeep Dubey’s IDF16 Technical Session) which further improves AllReduce latencies by an average of 3.5X across the message sizes and node counts of interest (Fig. 4). This benchmark only captures the latency for a single MPI_AllReduce operation. Typically, in any application context we can expect to have multiple such operations in flight, and in such situations we can expect to see further performance improvements.

Fig. 4 Source: Intel internal measurements, September 2016 [5]

Recurrent Layers – RNN/LSTM

DeepBench also includes recurrent layers—vanilla RNN and LSTM layers, primarily based on the DeepSpeech2 model configurations and for different mini-batch sizes—to capture the impact of scaling. The core compute kernel for these recurrent layers are still the GEMM operations, and the matrix sizes corresponding to these layers are already captured in the GEMM benchmark. For these cases, we can see from Fig. 1 that Intel Xeon Phi consistently performs better than current Nvida Maxwell GPUs and in many cases also better than Nvidia Pascal GPU which has almost 2X more peak flops. However, these layers are included as independent primitives to showcase RNN/LSTM-specific optimizations. For the current benchmark release, we do not include Intel Xeon Phi processor results for these cases. We are working on an optimized implementation for the RNN layers, which exploits the specific usage pattern to more efficiently leverage available caches. The Intel Xeon Phi processor results for these cases will be added once RNN layers support is introduced to Intel MKL.

While these benchmark results are a snapshot in time, Intel continues to invest in software optimizations that would further improve the performance of Intel Xeon and Intel Xeon Phi family processors especially on the convolution benchmark. Intel will continue to update the results to ensure end customers have a choice of silicon when it comes to deep learning workloads.

Authors

Dheevatsa Mudigere, Dipankar Das, Vadim Pirogov, Murat Guney, Srinivas Sridharan, and Andres Rodriguez, Intel Corporation

[1] http://svail.github.io/rnn_perf/

[2] http://lotsofcores.com/KNLbook

[3] https://github.com/hfp/libxsmm

[4] http://mvapich.cse.ohio-state.edu/benchmarks/

[5] FTC Disclaimer: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.

Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: www.intel.com/benchmarks.

Configuration: Intel® Xeon Phi™ Processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM), 128 GB DDR4-2400 MHz, Intel® Omni-Path Host Fabric Interface Adapter 100 Series 1 Port, Red Hat* Enterprise Linux 6.7, Intel® ICC version 16.0.3, Intel® MPI Library 5.1.3 for Linux, Intel® Optimized DNN Framework

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804

↧

Bringing Up Arduino 101* (branded Genuino 101* outside the U.S.) on Ubuntu* under VMware*

September 22, 2016, 10:44 am

Latest and popular articles on Intel Technologies

≫ Next: Open vSwitch* with DPDK Overview

≪ Previous: Intel® Xeon Phi™ Delivers Competitive Performance For Deep Learning—And Getting Better Fast

Introduction

The Arduino 101* (branded Genuino 101* outside the U.S.) is a learning and development platform that uses a low-power Intel® Curie™ module powered by the Intel® Quark™ SE microcontroller. The Intel® Quark™ SE microcontroller contains a single core 32 MHz x86 (Intel® Quark™ processor core) and the 32 MHz Argonaut RISC Core (ARC)* EM processor. The Arduino 101* platform runs on Windows, Macintosh OSX, and Linux operating systems. This guide demonstrates how to run the Arduino 101* platform on Ubuntu using a VMware* Workstation. The VMware* Workstation is a virtual machine that allows you to run applications from other OSes in Linux from the desktop.

Hardware components

The hardware components used in this project are listed below:

Arduino 101* module
A standard A plug to B plug USB cable

Setting up VMware* workstation on Ubuntu*

Go to the VMware website to download and install the latest VMware workstation player for Windows. Then go to the Ubuntu* website and download the latest version of Ubuntu Desktop.

Open VMware and create a new virtual machine using the downloaded Ubuntu image.

Development board download

Visit https://www.arduino.cc/en/Main/Software to download the Arduino Software IDE version 1.6.7 or later for Linux. As of this writing, the latest Linux Arduino IDE version supported by Arduino 101 is arduino-1.6.11-linux64.tar.xz.

Copy arduino-1.6.11-linux64.tar.xz to the Ubuntu folder in the VMWare environment.

Set up the environment for Arduino 101*

Untar arduino-1.6.11-linux64.tar.xz and install the Arduino IDE software.

sudo apt-get update
tar -xvf arduino-1.6.9-linux64.tar.xz
sudo mv arduino-1.6.9 /opt
cd /opt/arduino-1.6.9
~/install.sh

Bring up Arduino on Ubuntu*

1. Connect the Arduino 101 platform to the virtual machine that is running the VMWare workstation.

cd /opt/arduino-1.6.11
sudo ./arduino

Figure 1: Bringing up the Arduino IDE* on the Ubuntu* command line

2. Choose Tools > Board > Boards Manager to launch the board manager to install the Intel® Curie board.

Figure 2: Launching the Boards Manager

Figure 3: Installing Intel® Curie boards

3. Choose Tools > Port and select the Arduino 101 port.

Figure 4: Selecting the Arduino 101* port

4. Choose Tools > Board and select the Arduino 101 board.

Figure 5: Selecting the Arduino 101* board

5. Choose File> Examples> Basics> Blink and open the blink sketch.

Figure 6: Uploading the Blink sketch

The LED on the Arduino 101 platform should now blink.

Figure 7: Arduino 101* with LED Blinking

Arduino 101* Libraries

The Arduino 101* Libraries are a collection of code that provide extra functionality for sketches. They make it easy to connect to Bluetooth LE, sensors, and timers. To experiment with the built-in Arduino 101 libraries, visit https://www.arduino.cc/en/Guide/Libraries. The Arduino 101 libraries are based on the open source corelibs. If you are interested in experimenting the corelibs, visit 01.org’s GitHub*, but these are not required to use the Arduino 101 libraries.

Summary

We have described how to launch the Arduino 101 platform on Ubuntu in VMware. Experiment with the Arduino 101 libraries, Grov*e - Starter Kit Plus, more sensors and shields to enjoy the power of the Intel Curie module.

Helpful References

Intel® Developer Zone:
- https://software.intel.com/en-us/iot/home
Arduino forum:
- http://forum.arduino.cc
Intel® Curie module:
Arduino 101* hardware:
- https://www.arduino.cc/en/Main/ArduinoBoard101
- https://www.arduino.cc/en/Guide/Arduino101
Grove* Starter Kit Plus:
- https://software.intel.com/en-us/iot/hardware/devkit
Order Arduino 101* platform:
- http://www.intel.com/buy/us/en/product/emergingtechnologies/intel-arduino-101-497161

About the author

Nancy Le is a software engineer at Intel Corporation in the Software and Services Group working on Intel® Atom™ processor scale-enabling projects.

↧

Open vSwitch* with DPDK Overview

September 27, 2016, 8:18 am

Latest and popular articles on Intel Technologies

≫ Next: Getting Started with Intel® Software Optimization for Theano* and Intel® Distribution for Python*

≪ Previous: Bringing Up Arduino 101* (branded Genuino 101* outside the U.S.) on Ubuntu* under VMware*

This article presents a high-level overview of Open vSwitch* with the Data Plane Development Kit (OvS-DPDK)—the high performance, open source virtual switch—and links to further technical articles that dive deeper into individual OvS-DPDK features. This article was written for users of OvS who want to know more about DPDK integration.

Note: Users can download a zip file of the OVS master branch or the 2.6 branch, as well as installation steps for the master branch or the 2.6 branch.

OvS-DPDK High-level Architecture

Open vSwitch is a production quality, multilayer virtual switch licensed under the open source Apache* 2.0 license. It supports SDN control semantics via the OpenFlow* protocol and its OVSDB management interface. It is available from openvswitch.org, GitHub*, and is also consumable through Linux* distributions.

Native Open vSwitch generally forwards packets via the kernel space data path (see Figure 1). In the kernel data path, the switching “fastpath” consists of a simple flow table indicating forwarding/action rules for packets that are received. Exception packets (first packet in a flow) do not match any existing entries in the kernel fastpath table and are sent to the user space daemon for processing (slowpath). After user space handles the first packet in the flow, the daemon will then update the flow table in kernel space so that subsequent packets in the flow can be processed in the fastpath and not sent to user space. Following this approach, native OvS can eliminate the costly context switch between kernel and user space for a large percentage of received packets. However, the achievable packet throughput is limited by the forwarding bandwidth of the Linux network stack, which is not suited for use cases requiring a high rate of packet processing; for example, Telco.

DPDK is a set of user space libraries that enable a user to create optimized performant packet processing applications (information available at DPDK.org). In practice, it offers a series of Poll Mode Drivers (PMDs), which enable direct transferral of packets between user space and the physical interface, bypassing the kernel network stack. This offers a significant performance boost over kernel forwarding, through the elimination of both interrupt handling and traversal of the kernel network stack. By integrating OvS with DPDK, the switching fastpath is in user space, and the exception path is the same path that is traversed by packets in the kernel fastpath case. The integration of DPDK with OvS is illustrated at a high level in Figure 1.

Integration of Data Plane Development Kit data plane with native Open vSwitch*

Figure 1: Integration of Data Plane Development Kit data plane with native Open vSwitch*.

Figure 2 below shows the high-level architecture of OvS-DPDK. OvS switching ports are represented by network devices (or netdevs). Netdev-dpdk is a DPDK-accelerated network device that uses DPDK to accelerate switch I/O, through three separate interfaces: one physical interface (handled by the librte_eth library within DPDK), and two virtual interfaces (librte_vhost and librte_ring). These interface with the physical and virtual devices connected to the virtual switch.

Other OvS architectural layers provide further functionality and interface with, for example, the SDN controller. Dpif-netdev provides user space forwarding and ofproto is the OvS library that implements an OpenFlow switch. It talks to OpenFlow controllers over the network and to switch hardware or software through an ofproto provider. The ovsdb server maintains the up-to-date switching table information for this OvS instance and communicates this to the SDN controller. The following section provides details of the switching/forwarding tables, with further information on the OvS architecture available through the openvswitch.org website.

Open vSwitch* with Data Plane Development Kit high-level architecture

Figure 2: Open vSwitch* with Data Plane Development Kit high-level architecture.

OvS-DPDK Switching Table Hierarchy

A packet entering OvS-DPDK from a physical or virtual interface receives a unique identifier or hash, based on its header fields, which is then matched against an entry in one of three main switching tables: the exact match cache (EMC), the data path classifier (dpcls), or the ofproto classifier. A packet’s identifier will traverse each of these three tables in order, unless a match is found, in which case the appropriate actions indicated by the match rule in the table will be executed and the packet forwarded out of the switch upon completion of all actions. This scheme is illustrated in Figure 3.

Open vSwitch* with Data Plane Development Kit switching table hierarchy

Figure 3: Open vSwitch* with Data Plane Development Kit switching table hierarchy.

The three tables have different characteristics and associated throughput performance/latency. The EMC offers fastest processing for a limited number of table entries. The packet’s identifier must exactly match the entry in this table for all fields—the 5-tuple of source IP and port, destination IP and port, and protocol—for highest speed processing or it will “miss” on the EMC and pass through to the dpcls. The dpcls contains many more table entries (arranged in multiple subtables) and enables wildcard matching of the packet identifier (for example, destination IP and port are specified but any source is allowed). This gives approximately half the throughput performance of the EMC and caters to a much larger number of table entries. Packet flows matched in the dpcls are installed in the EMC so that subsequent packets with the same identifier can be processed at the highest speed.

A miss on the dpcls results in the packet identifier being sent to the ofproto classifier so that the OpenFlow controller can decide on the action. This path is the least performant, >10x slower than the EMC. Matches in the ofproto classifier result in new table entries being established in the faster switching tables so that subsequent packets in the same flow can be processed more quickly.

OvS-DPDK Features and Performance

At the time of this writing, the following high-level OvS-DPDK features are available on the OvS master code branch:

DPDK support for v16.07 (supported version increments with each new DPDK release)
vHost user support
vHost reconnect
vHost multiqueue
Native tunneling support: VxLAN, GRE, Geneve
VLAN support
MPLS support
Ingress/egress QoS policing
Jumbo frame support
Connection tracking
Statistics: DPDK vHost and extended DPDK stats
Debug: DPDK pdump support
Link bonding
Link status
VFIO support
ODL/OpenStack detection of DPDK ports
vHost user NUMA awareness

A recent performance comparison between native OvS and OvS-DPDK is highlighted in Figure 4. This shows the throughput in packets-per-second for the Phy-OvS-Phy use case, indicating a ~10x performance enhancement for OvS-DPDK over native OvS, increasing to ~12x with Intel® Hyper-Threading Technology (Intel® HT Technology) enabled (labelled 1C2T, or one physical core with two logical threads, in the figure legend). Similarly, the Phy-OvS-VM-OvS-Phy use case demonstrates a ~9x performance enhancement for OvS-DPDK over native OvS.

Performance comparison - native Open vSwitch* (OvS) and OvS with Data Plane Development Kit

Figure 4: Performance comparison - native Open vSwitch* (OvS) and OvS with Data Plane Development Kit.

The hardware and software configuration for this data, along with further use case results, can be found in the Intel® Open Network Platform (Intel® ONP) performance report.

OvS-DPDK Availability

OvS-DPDK is available in the upstream openvswitch.org repository and is also available through Linux distributions as below. The latest milestone release is OvS 2.6 (September 2016), and releases are made with a six-month cadence.

Code is available for download as follows: OvS master branch; OvS 2.6 release branch. Installation steps for the master branch are available as well as installation steps for the 2.6 release branch.

Packaged versions of OvS with DPDK are available from:

Red Hat* OpenStack Platform

Ubuntu*

Mirantis* OpenStack

Open Platform for NFV*

Additional Information

To learn more about OvS-DPDK, check out the following videos and articles on Intel® Developer Zone, 01.org, Intel® Network Builders and Intel® Network Builders University.

User guides:

Developer guides:

Articles:

OvS with DPDK milestone release webinars:

INB university:

White paper:

OvS with DPDK enables SDN and NFV transformation

Have a question? Feel free to follow up with the query on the Open vSwitch discussion mailing thread.

About the Author

Robin Giller is a program manager with the Intel Network Platforms Group.

↧

Getting Started with Intel® Software Optimization for Theano* and Intel® Distribution for Python*

September 27, 2016, 12:14 pm

Latest and popular articles on Intel Technologies

≫ Next: How to Emulate Persistent Memory on an Intel® Architecture Server

≪ Previous: Open vSwitch* with DPDK Overview

Contents

Summary

Theano is a Python* library developed at the LISA lab to define, optimize, and evaluate mathematical expressions, including the ones with multi-dimensional arrays (numpy.ndarray). Intel® optimized-Theano is a new version based on Theano 0.0.8rc1, which is optimized for Intel® architecture and enables Intel® Math Kernel Library (Intel® MKL) 2017. The latest version of the Intel MKL includes optimizations for Intel® Advanced Vector Extensions 2 (Intel® AVX2) and AVX-512 instructions which are supported in Intel® Xeon® processor and Intel® Xeon Phi™ processors.

Theano can be installed and used with several combinations of development tools and libraries on a variety of platforms. This tutorial provides one such recipe describing steps to build and install Intel optimized-Theano with Intel® compilers and Intel MKL 2017 on CentOS*- and Ubuntu*-based systems. We also verify the installation by running common industry-standard benchmarks like MNIST*, DBN-Kyoto*, LSTM* and ImageNet*.

Prerequisites

Intel® Compilers and Intel® Math Kernel Library 2017

This tutorial assumes that Intel compilers(C/C++ and Fortran) are already installed and verified. If not, Intel compilers can be downloaded and installed as part of the Intel® Parallel Studio XE or can be independently installed.

Installing Intel MKL 2017 is optional when using Intel® Distribution for Python*. For other python distributions Intel MKL 2017 can be downloaded as part of Intel Parallel Studio XE 2017 or can be downloaded and installed for free using the community license. To download it, first register here for a free community license and follow the installation instructions.

Python* Tools

In this tutorial, the Intel® Distribution for Python* will be used as it provides ready access to tools and techniques which are enabled and verified for higher performance on Intel architecture. This will allow usage of Intel-optimized precompiled tools like NumPy* and SciPy* without worrying about building and installing them.

Intel Distribution for Python can be available as part of Intel Parallel Studio XE or can be also independently downloaded for free from here.

Instructions to install Intel Distribution for Python are given below. This article assumes that the Python installation is completed in the local user account

Python 2.7
tar -xvzf l_python27_p_2017.0.028.tgz
cd l_python27_p_2017.0.028
./install.sh

Python 3.5
tar -xvzf l_python35_p_2017.0.028.tgz
cd l_python35_p_2017.0.028
./install.sh

Using anaconda, create an independent user environment using the steps given below. Here the required NumPy, SciPy and Cython packages are also being installed with the .

Python 2.7
conda create -n pcs_theano_2 -c intel python=2 numpy scipy cython
source activate pcs_theano_2

Python 3.5
conda create -n pcs_theano_2 -c intel python=3 numpy scipy cython
source activate pcs_theano_2

Alternatively, NumPy and SciPy can also be built and installed from the source as given inAppendix A. Steps to install other python development tools is also shown which may be required in case a non-intel distribution of python is used.

Building and installing Intel® Software Optimization for Theano*

Branch of theano optimized for Intel architecture can be checked out and installed from the following git repository.

git clone https://github.com/intel/theano.git theano
cd theano
python setup.py build
python setup.py install
theano-cache clear

An example of the Theano configuration file is given below for reference. In order to use Intel compilers and specify the compiler flags to be used with Theano, create a copy of this file in user's home directory.

vi ~/.theanorc

[cuda]
root = /usr/local/cuda

[global]
device = cpu
floatX = float32
cxx = icpc
mode = FAST_RUN
openmp = True
openmp_elemwise_minsize = 10
[gcc]
cxxflags = -qopenmp -march=native -O3 -vec-report3 -fno-alias -opt-prefetch=2 -fp-trap=none
[blas]
ldflags = -lmkl_rt

Verify Theano and NumPy Installation

It is important to verify which versions of Theano and NumPy libraries are referenced once they are imported in python. The versions of NumPy and Theano referenced in this article are verified as follows:

python -c "import numpy; print (numpy.__version__)"
->1.11.1
python -c "import theano; print (theano.__version__)"
-> 0.9.0dev1.dev-*

It is also important to verify that the installed versions of NumPy and Theano are using Intel MKL.

python -c "import theano; print (theano.numpy.show_config())"

NumPy_Config

Fig 1. Desired output for theano.numpy.show_config()

Benchmarks

DBN-Kyoto and ImageNet benchmarks are available in the theano/democase directory.

DBN-Kyoto

Procuring the Dataset for Running DBN-Kyoto

The sample dataset can be downloaded for DBN-Kyoto from Dropbox via the following link:https://www.dropbox.com/s/ocjgzonmxpmerry/dataset1.pkl.7z?dl=0. Unzip the file and save it in the theano/democase/DBN-Kyoto directory.

Prerequisites

Dependencies for training DBN-Kyoto can be installed using Anaconda or built using the provided source in the tools directory. Due to some conflicts in the pandas library and Python 3, this benchmark is validated only for Python 2.7.

Python 2.7
conda install -c intel --override-channels pandas
conda install imaging

Alternatively the dependencies can also be installed from source as given in Appendix B.

Running DBN-Kyoto on CPU

The provided run.sh script can be used to download the dataset (if not already present) and start the training.

cd theano/democase/DBN-Kyoto/
./run.sh

MNIST

In this article, we show how to train a neural network on MNIST using Lasagne, which is a lightweight library to build and train neural networks in Theano. The Lasagne library will be built and installed using Intel compilers.

Download the MNIST Database

The MNIST database can be downloaded from http://yann.lecun.com/exdb/mnist/. We downloaded images and labels for both training and validation data.

Installing Lasagne Library

The latest version of the Lasagne library can be built and installed from the Lasagne git repository as given below:

Python 2.7 and Python 3.5
git clone https://github.com/Lasagne/Lasagne.git
cd Lasagne
python setup.py build
python setup.py install

Training

cd Lasagne/examples
python mnist.py [model [epochs]]
                    --  where model can be mlp - simple multi layer perceptron (default) or
                         cnn - simple convolution neural network.
                         and epochs = 500 (default)

AlexNet

Procuring the ImageNet dataset for AlexNet training

The ImageNet dataset can be obtained from the image-net website.

Prerequisites

Dependencies for training AlexNet can be installed using Anaconda or installed from the fedora epel source repository. Currently, support for Hickle (required dependency for preprocessing data) is only available in Python 2 and not supported on Python 3.

Installing h5py, pyyaml, pyzmq using Anaconda:

conda install h5py
conda install -c intel --override-channels pyyaml pyzmq

Installing Hickle (HDF5-based clone of Pickle):

git clone https://github.com/telegraphic/hickle.git
cd hickle
python setup.py build
python setup.py install

Alternatively, the dependencies can also be installed using the source as given in appendix B.

Preprocessing the ImageNet Dataset

Preprocessing is required to dump Hickle files and create labels for training and validation data.

Modify the paths.yaml file in the preprocessing directory to update the path for the dataset. One example of paths.yaml file is given below for reference.

cat theano/democase/alexnet_grp1/preprocessing/paths.yaml

train_img_dir: '/mnt/DATA2/TEST/ILSVRC2012_img_train/'
# the dir that contains folders like n01440764, n01443537, ...

val_img_dir: '/mnt/DATA2/TEST/ILSVRC2012_img_val/'
# the dir that contains ILSVRC2012_val_00000001~50000.JPEG

tar_root_dir: '/mnt/DATA2/TEST/parsed_data_toy'  # dir to store all the preprocessed files
tar_train_dir: '/mnt/DATA2/TEST/parsed_data_toy/train_hkl'  # dir to store training batches
tar_val_dir: '/mnt/DATA2/TEST/parsed_data_toy/val_hkl'  # dir to store validation batches
misc_dir: '/mnt/DATA2/TEST/parsed_data_toy/misc'
# dir to store img_mean.npy, shuffled_train_filenames.npy, train.txt, val.txt

meta_clsloc_mat: '/mnt/DATA2/imageNet-2012-images/ILSVRC2014_devkit/data/meta_clsloc.mat'
val_label_file: '/mnt/DATA2/imageNet-2012-images/ILSVRC2014_devkit/data/ILSVRC2014_clsloc_validation_ground_truth.txt'
# although from ILSVRC2014, these 2 files still work for ILSVRC2012

# caffe style train and validation labels
valtxt_filename: '/mnt/DATA2/TEST/parsed_data_toy/misc/val.txt'
traintxt_filename: '/mnt/DATA2/TEST/parsed_data_toy/misc/train.txt'

Toy data set can be created using the provided script - generate_toy_data.sh¹.

cd theano/democase/alexnet_grp1/preprocessing
chmod u+x make_hkl.py make_labels.py make_train_val_txt.py
./generate_toy_data.sh

AlexNet training on CPU

Modify the config.yaml file to update the path to the preprocessed dataset:

cd theano/democase/alexnet_grp1/

# Sample changes to the path for input(label_folder, mean_file) and output(weights_dir)
label_folder: /mnt/DATA2/TEST/parsed_data_toy/labels/
mean_file: /mnt/DATA2/TEST/parsed_data_toy/misc/img_mean.npy
weights_dir: ./weight/  # directory for saving weights and results

Similarly, modify the spec.yaml file to update the path to the parsed toy data set:

# Directories
train_folder: /mnt/DATA2/TEST/parsed_data_toy/train_hkl_b256_b256_bchw/
val_folder: /mnt/DATA2/TEST/parsed_data_toy/val_hkl_b256_b256_bchw/

Start the training:

./run.sh

Large Movie Review Dataset (IMDB)

The Large Movie Review Dataset is an example of a Recurring Neural Network using a Long Short-Term Memory (LSTM) model. The IMDB data set is used for sentiment analysis on movie reviews using the LSTM model.

Procuring the dataset:

Obtain the imdb.pkl file from http://www-labs.iro.umontreal.ca/~lisa/deep/data/ and extract the file to a local folder.

Preprocessing

The http://deeplearning.net/tutorial/lstm.html page provides two scripts:

Imdb.py – This handles the loading the preprocessing of the IMDB dataset.

Lstm.py – This is the primary script that defines and trains the model.

Copy both of the above files into the same folder where we have the imdb.pkl file.

Training

Training can be started using the following command:

THEANO_FLAGS="floatX=float32" python lstm.py

Troubleshooting

Error 1: In some cases, you might get errors like libmkl_rt.so or libimf.so, which cannot be opened. In this case try the below:

find /opt/intel -name library_name.so

Add the paths to get to the /etc/ ld.so.conf file and run the ldconfig command to link the libraries. Also make sure the MKL installation paths are set correctly in the LD_LIBRARY_PATH environment variable.

Error 2: AlexNet preprocessing error for toy data

python make_hkl.py toy
generating toy dataset ...
Traceback (most recent call last):
  File "make_hkl.py", line 293, in <module>
    train_batchs_per_core)
ValueError: xrange() arg 3 must not be zero

The default number of processes used to preprocess ImageNet is currently set to 16. For the toy dataset this will create more processes than required, causing the application to crash. To resolve this issue, change the number of processes in file Alexnet_CPU/preprocessing/make_hkl.py:258 from 16 to 2. However, while preprocessing the full data set it is recommended to use a higher value for num_process for faster preprocessing.

num_process = 2

Error 3: Referencing the current version of Numpy when installing Intel(R) Distribution of Python* through Conda

If installing the Intel(R) Distribution of Python from within Conda instead of through the Intel(R) Distribution of Python installer, make sure that you set the PYTHONNOUSERSITE environment variable to True. This will enable the Conda environment to reference the correct version of Numpy. This is a known error in Conda. More information can be found here.

export PYTHONNOUSERSITE=True

Resources

GitHub repo - Intel optimized Theano
GitHub rep - Lasagne
GitHub repo - Intel optimized NumPy (if building from source)

Appendix A

Installing Python* Tools For Other Python Distribution

CentOS:
Python 2.7 - sudo yum install python-devel python-setuptools
Python 3.5 - sudo yum install python35-libs python35-devel python35-setuptools
//Note - Python 3.5 packages can be obtained from Fedora EPEL source repository
Ubuntu:
Python 2.7 - sudo apt-get install python-dev python-setuptools
Python 3.5 - sudo apt-get install libpython3-dev python3-dev python3-setuptools

Incase pip and cython are not installed on the system, they can be installed using the following commands:

sudo -E easy_install pip
sudo -E pip install cython

Installing NumPy

NumPy is the fundamental package needed for scientific computing with Python. This package contains:

A powerful N-dimensional array object
Sophisticated (broadcasting) functions
Tools for integrating C/C++ and Fortran code
Useful linear algebra, Fourier transform, and random number capabilities.

Note: An older version of the NumPy library can be removed by verifying its existence and deleting the related files. However, in this tutorial all the remaining libraries will be installed in user’s local directory, so this step is optional. If required, old versions can be cleaned as follows:

Verify if old version exists:

python -c "import numpy; print numpy.version"<module 'numpy.version' from '/home/plse/.local/lib/python2.7/site-packages/numpy-1.11.0rc1-py2.7-linux-x86_64.egg/numpy/version.pyc'>

Delete any previously installed NumPy packages:

rm -r /home/plse/.local/lib/python2.7/site-packages/numpy-1.11.0rc1-py2.7-linux-x86_64.egg

Building and installing NumPy optimized for Intel architecture:

git clone https://github.com/pcs-theano/numpy.git
//update site.cfg file to point to required MKL directory. This step is optional if parallel studio or MKL were installed in default /opt/intel directory.
python setup.py config --compiler=intelem build_clib --compiler=intelem build_ext --compiler=intelem install --user

Installing SciPy

SciPy is an open source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering

Building and installing SciPy:

tar -xvzf scipy-0.16.1.tar.gz    (can be downloaded from: https://sourceforge.net/projects/scipy/files/scipy/0.16.1/  or
     obtain the latest sources from https://github.com/scipy/scipy/releases)
cd scipy-0.16.1/
python setup.py config --compiler=intelem --fcompiler=intelem build_clib --compiler=intelem --fcompiler=intelem build_ext --compiler=intelem --fcompiler=intelem install --user

Appendix B

Building and installing benchmark dependencies from source

DBN-Kyoto

//Untar and install all the provided tools:

cd theano/democase/DBN-Kyoto/tools
tar -xvzf Imaging-1.1.7.tar.gz
cd Imaging-1.1.7
python setup.py build
python setup.py install --user

cd theano/democase/DBN-Kyoto/tools
tar -xvzf python-dateutil-2.4.1.tar.gz
cd python-dateutil-2.4.1
python setup.py build
python setup.py install --user

cd theano/democase/DBN-Kyoto/tools
tar -xvzf pytz-2014.10.tar.gz
cd pytz-2014.10
python setup.py build
python setup.py install --user

cd theano/democase/DBN-Kyoto/tools
tar -xvzf pandas-0.15.2.tar.gz
cd pandas-0.15.2
python setup.py build
python setup.py install --user

AlexNet

Installing dependencies for AlexNet from source

Access to some of the add-on packages from the fedrora epel source repository may be required for running AlexNet on CPU.

wget http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-8.noarch.rpm
sudo rpm -ihv epel-release-7-8.noarch.rpm
sudo yum install hdf5-devel
sudo yum install zmq-devel
sudo yum install zeromq-devel
sudo yum install python-zmq

Installing Hickle (HDF5-based clone of Pickle):

git clone https://github.com/telegraphic/hickle.git
python setup.py build install --user

Installing h5py (Python interface to HDF5 binary data format):

git clone https://github.com/h5py/h5py.git
python setup.py build install --user

References

LSTM tutorial
DBN tutorial
Superior Performance Commits Kyoto University to CPUs Over GPUs
Introduction of the LSTM model:
- [pdf] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
Addition of the forget gate to the LSTM model:
- [pdf] Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual prediction with LSTM. Neural computation, 12(10), 2451-2471.
LSTM paper:
- [pdf] Graves, Alex. Supervised sequence labelling with recurrent neural networks. Vol. 385. Springer, 2012.
Theano:
- [pdf] Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud, Bouchard, Nicolas, and Bengio, Yoshua. Theano: new features and speed improvements. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2012.
- [pdf] Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010.
- https://github.com/Theano
ImageNet
- ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
- http://www.image-net.org/
NumPy
SciPy
- http://www.scipy.org/
- https://en.wikipedia.org/wiki/SciPy

About The Authors

Sunny Gogar
Software Engineer

Sunny Gogar received a Master’s degree in Electrical and Computer Engineering from the University of Florida, Gainesville and a Bachelor’s degree in Electronics and Telecommunications from the University of Mumbai, India. He is currently a software engineer with Intel Corporation's Software and Services Group. His interests include parallel programming and optimization for Multi-core and Many-core Processor Architectures.

Meghana Rao received a Master’s degree in Engineering and Technology Management from Portland State University and a Bachelor’s degree in Computer Science and Engineering from Bangalore University, India. She is a Developer Evangelist with the Software and Services Group at Intel focused on Machine Learning and Deep Learning.

↧

How to Emulate Persistent Memory on an Intel® Architecture Server

September 27, 2016, 3:54 pm

Latest and popular articles on Intel Technologies

≫ Next: Introduction to Heterogeneous Streams Library

≪ Previous: Getting Started with Intel® Software Optimization for Theano* and Intel® Distribution for Python*

Introduction

This tutorial provides a method for setting up persistent memory (PMEM) emulation using regular dynamic random access memory (DRAM) on an Intel® processor using a Linux* kernel version 4.3 or higher. The article covers the hardware configuration and walks you through setting up the software. After following the steps in this article, you'll be ready to try the PMEM programming examples in the NVM Library at pmem.io.

Why do this?

If you’re a software developer who wants to get started early developing software or preparing your applications to have PMEM awareness, you can use this emulation for development before PMEM hardware is widely available.

What is persistent memory?

Traditional applications organize their data between two tiers: memory and storage. Emerging PMEM technologies introduces a third tier. This tier can be accessed like volatile memory, using processor load and store instructions, but it retains its contents across power loss like storage. Because the emulation uses DRAM, data will not be retained across power cycles.

Hardware and System Requirements

Emulation of persistent memory is based on DRAM memory that will be seen by the operating system (OS) as a Persistent Memory region. Because it is a DRAM-based emulation it is very fast, but will lose all data upon powering down the machine. The following hardware was used for this tutorial:

CPU and Chipset	Intel® Xeon® processor E5-2699 v4 processor, 2.2 GHz # of cores per chip: 22 (only used single core) # of sockets: 2 Chipset: Intel® C610 chipset, QS (B-1 step) System bus: 9.6 GT/s Intel® QuickPath Interconnect
Platform	Platform: Intel® Server System R2000WT product family (code-named Wildcat Pass) BIOS: GRRFSDP1.86B.0271.R00.1510301446 ME:V03.01.03.0018.0 BMC:1.33.8932 DIMM slots: 24 Power supply: 1x1100W
Memory	Memory size: 256 GB (16X16 GB) DDR4 2133P Brand/model: Micron* – MTA36ASF2G72PZ2GATESIG
Storage	Brand and model: 1 TB Western Digital* (WD1002FAEX)
Operating system	CentOS* 7.2 with kernel 4.5.3

Table 1 - System configuration used for the PMEM emulation.

Linux* Kernel

Linux Kernel 4.5.3 was used during development of this tutorial. Support for persistent memory devices and emulation have been present in the kernel since version 4.0, however a kernel newer than 4.2 is recommended for easier configuration. The emulation should work with any Linux distribution able to handle an official kernel. To configure the proper driver installation, run make nconfig and enable the driver. Per the instructions below, Figures 1 to 5 show the correct setting for the NVDIMM Support in the Kernel Configuration menu.

$ make nconfig

        -> Device Drivers -> NVDIMM Support -><M>PMEM; <M>BLK; <*>BTT

Set up the device drivers.
Figure 1:Set up device drivers.

Figure 2:Set up the NVDIMM device.

Setup the file system for Direct Access support.
Figure 3:Set up the file system for Direct Access support.

Setting for Direct Access support.
Figure 4: Set up for Direct Access (DAX) support.

Property of the NVDIMM support.
Figure 5:NVDIMM Support property.

The kernel will offer these regions to the PMEM driver so they can be used for persistent storage. Figures 6 and 7 show the correct setting for the processor type and features in the Kernel Configuration menu.

$ make nconfig

        -> Processor type and features<*>Support non-standard NVDIMMs and ADR protected memory

Figures 4 and 5 show the selections in the Kernel Configuration menu.

Figure 6:Set up the processor to support NVDIMMs.

Enable the NON-standard NVDIMMs and ADR protected memory.
Figure 7:Enable NON-standard NVDIMMs and ADR protected memory.

Now you are ready to build your kernel using the instructions below.

$ make -jX

        Where X is the number of cores on the machine

During the new kernel build process, there is a performance benefit to compiling the new kernel in parallel. An experiment with one thread to multiple threads shows that the compilation can be up to 95 percent faster than a single thread. With the time saved using multiple thread compilation for the kernel, the whole new kernel setup goes much faster. Figures 8 and 9 show the CPU utilization and the performance gain chart for compiling at different numbers of threads.

Figure 8:Compiling the kernel sources.

Figure 9:Performance gain for compiling the source in parallel.

Install the Kernel

# make modules_install install

Figure 10:Installing the kernel.

Reserve a memory region by modifying kernel command line parameters so it appears as a persistent memory location to the OS. The region of memory to be used is from ss to ss+nn. [KMG] refers to kilo, mega, giga.

memmap=nn[KMG]!ss[KMG]

For example, memmap=4G!12G reserves 4 GB of memory between 12th and 16th GB. Configuration is done within GRUB and varies between Linux distributions. Here are two examples of a GRUB configuration.

Under CentOS 7.0

# vi /etc/default/grub
GRUB_CMDLINE_LINUX="memmap=nn[KMG]!ss[KMG]"
On BIOS-based machines:
# grub2-mkconfig -o /boot/grub2/grub.cfg

Figure 11 shows the added PMEM statement in the GRUB file. Figure 12 shows the instructions to make the GRUB configuration.

Figure 11:Define PMEM regions in the /etc/default/grub file.

Figure 12:Generate the boot configuration file bases on the grub template.

After the machine reboots, you should be able to see the emulated device as /dev/pmem0…pmem3. Trying to get reserved memory regions for persistent memory emulation will result in split memory ranges defining persistent (type 12) regions as shown in Figure 13. A general recommendation would be to either use memory from the 4GB+ range (memmap=nnG!4G) or to check the e820 memory map upfront and fitting within. If you don’t see the device, verify the memmap setting correctness in the grub file as shown in Figure 9, followed by dmesg(1) analysis as shown in Figure 13. You should be able to see reserved ranges as shown on the dmesg output snapshot: dmesg.

Figure 13:Persistent memory regions are highlighted as (type 12).

You'll see that there can be multiple non-overlapping regions reserved as a persistent memory. Putting multiple memmap="...!..." entries will result in multiple devices exposed by the kernel and visible as /dev/pmem0, /dev/pmem1, /dev/pmem2, …

DAX - Direct Access

The DAX (direct access) extensions to the filesystem creates a PMEM-aware environment. Some distros, such as Fedora* 24 and later, already have DAX/PMEM built in as a default, and have NVML available as well. One quick way to check to see if the kernel has DAX and PMEM built into it is to grep the kernel’s config file which is usually provided by the distro under /boot. Use the command below:

# egrep ‘(DAX|PMEM)’ /boot/config-`uname –r`

The result should be something like:

CONFIG_X86_PMEM_LEGACY_DEVICE=y
CONFIG_X86_PMEM_LEGACY=y
CONFIG_BLK_DEV_RAM_DAX=y
CONFIG_BLK_DEV_PMEM=m
CONFIG_FS_DAX=y
CONFIG_FS_DAX_PMD=y
CONFIG_ARCH_HAS_PMEM_API=y

To install a filesystem with DAX (available today for ext4 and xfs):

# mkdir /mnt/pmemdir
# mkfs.ext4 /dev/pmem3
# mount -o dax /dev/pmem3 /mnt/pmemdir
Now files can be created on the freshly mounted partition, and given as an input to NVML pools.

Figure 14:Persistent memory blocks.

Figure 15:Making a file system.

It is additionally worth mentioning that you can emulate persistent memory with ramdisk (i.e., /dev/shm) or force PMEM-like behavior by setting environment variable PMEM_IS_PMEM_FORCE=1. This would eliminate performance hit caused by msync(2).

Conclusion

By now, you know how to set up an environment where you can build a PMEM application without actual PMEM hardware. With the additional cores on an Intel® architecture server, you can quickly build a new kernel with PMEM support for your emulation environment.

References

Persistent Memory Programming

Author(s)

Thai Le is the software engineer focusing on cloud computing and performance computing analysis at Intel Corporation.

↧

Introduction to Heterogeneous Streams Library

September 27, 2016, 4:29 pm

Latest and popular articles on Intel Technologies

≫ Next: libstdc++ Source Files

≪ Previous: How to Emulate Persistent Memory on an Intel® Architecture Server

Introduction

To efficiently utilize all available resources for the task concurrency application on heterogeneous platforms, designers need to understand the memory architecture, the thread utilization on each platform, the pipeline to offload the workload to different platforms, and to coordinate all these activities.

To relieve designers of the burden of implementing the necessary infrastructures, the Heterogeneous Streaming (hStreams) library provides a set of well-defined APIs to support a task-based parallelism model on heterogeneous platforms. hStreams explores the use of the Intel® Coprocessor Offload Infrastructure (Intel® COI) to implement these infrastructures. That is, the host decomposes the workload into tasks, one or more tasks are executed in separate targets, and finally the host gathers the results from all of the targets. Note that the host can also be a target too.

Intel® Manycore Platform Software Stack (Intel® MPSS) version 3.6 contains the hStreams library, documentation, and sample codes. Starting from Intel MPSS 3.7, hStreams is removed from Intel MPSS software and becomes an open source project. The current version 1.0 supports the Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor as targets. hStreams binaries version 1.0.0 can be downloaded:

Users can contribute to hStreams development at https://github.com/01org/hetero-streams. The following tables summarize the tools that support hStreams in Linux and Windows:

Name of Tool (Linux*)	Supported Version
Intel® Manycore Platform Software Stack	3.4, 3.5, 3.6, 3.7
Intel® C++ Compiler	15.0, 16.0
Intel® Math Kernel Library	11.2, 11.3

Name of Tool (Windows*)	Supported Version
Intel MPSS	3.4, 3.5, 3.6, 3.7
Intel C++ Compiler	15.0, 16.0
Intel Math Kernel Library	11.2, 11.3
Visual Studio*	11.0 (2012)

This whitepaper briefly introduces hStreams and highlights its concepts. For a full description, readers are encouraged to read the tutorial included in the hStreams package mentioned above.

Execute model concepts

This section highlights some basic concepts of hStreams: source and sink, domains, streams, buffers, and actions:

Streams are FIFO queues where actions are enqueued. Streams are associated with logical domains. Each stream has two endpoints: source and sink, which is bound to a logical domain.
Source is where the work is enqueued and sink where the work is executed. In the current implementation, the source process runs on an Intel Xeon processor-based machine, and the sink process runs on a machine that can be the host itself, an Intel Xeon Phi coprocessor, or in the future, even any hardware platform. The library allows the source machine to invoke the user’s defined function on the target machine.

Domains represent the resources of hetero platforms. A physical domain is the set of all resources available in a platform (memory and computing). For example, an Intel Xeon processor-based machine and an Intel Xeon Phi coprocessor are two different physical domains. A logical domain is a subset of a given physical domain; it uses any subset of available cores in a physical domain. The only restriction is that two logical domains cannot be partially overlapping.
Buffers represent memory resources to transfer data between source and sink. In order to transfer data, the user must create a buffer by calling an appropriate API, and a corresponding physical buffer is instantiated at the sink. Buffers can have properties such as memory type (for example, DDR or HBW) and affinity (for example, sub-NUMA clustering).
Actions are requests to execute functions at the sinks (compute action), to transfer data from source to sink or vise-versa (memory movement action), and to synchronize tasks among streams (synchronization action). Actions enqueued in a stream are proceeded in first in, first out (FIFO) semantics: The source places the action in and the sink removes the action. All actions are non-blocking (asynchronous) and have completion events. Remote invocation can be user-defined functions or optimized convenient functions (for example, dgemm). Thus, a FIFO stream queue handles dependencies within a stream while synchronization actions handle dependencies among streams.

In a typical scenario, the source-side code allocates stream resources, allocates memory, transfers data to the sink, invokes the sink to execute a predefined function, handles synchronization, and eventually terminates streams. Note that actions such as data transferring, remote invocation, and synchronization are handled in FIFO streams. The sink-side code simply executes the function that the source requested.

For example, consider the pseudo-code of a simple hStreams application that creates two streams, the source transfers data to the sinks, performs remote invocation at the sinks, and then transfers results back to the source host:

Step 1: Initialize two streams 0 and 1

Step 2: Allocate buffers A0, B0, C0, A1, B1, C1

Step 3: Use stream i, transfer memory Ai, Bi to sink (i=0,1)

Step 4: Invoke remote computing in stream i: Ai + Bi -> Ci (i=0,1)

Step 5: Transfer memory Ci back to host (i=0,1)

Step 6: Synchronize

Step 7: Terminate streams

The following figure illustrates the actions generated at the host:

Actions are placed in the corresponding streams and removed at the sinks:

hStreams provides two levels of APIs: the app API and the core API. The app API offers simple interfaces; it is targeted to novice users to quickly ramp on hStreams library. The core API gives advanced users the full functionality of the library. The app APIs in fact call the core layer APIs, which in turn use Intel COI and the Symmetric Communication Interface (SCIF). Note that users can mix these two levels of API when writing their applications. For more details on the hStreams API, refer to the document Programing Guide and API Reference. The following figure illustrates the relation between the hStreams app API and the core API.

Refer to the document “Hetero Streams Library 1.0 Programing Guide and API” and the tutorial included in the hStreams download package for more information.

Building and running a sample hStreams program

This section illustrates a sample code that makes use of the hStreams app API. It also demonstrates how to build and run the application. The sample code is an MPI program running on an Intel Xeon processor host with two Intel Xeon Phi coprocessors connected.

First, download the package from https://github.com/01org/hetero-streams. Then, follow the instruction to build and install the hStreams library on an Intel Xeon processor-based host machine that runs Intel MPSS 3.7.2 in this case. This host machine has two Intel Xeon Phi coprocessors installed and connects to a remote Intel Xeon processor-based machine. This remote machine (10.23.3.32) also has two Intel Xeon Phi coprocessors.

This sample code creates two streams; each stream runs explicitly on a separate coprocessor. An MPI rank manages these two streams.

The application consists of two parts: The source-side code is shown in Appendix A and the corresponding sink-side code is shown in Appendix B. The sink-side code contains a user-defined function vector_add, which is to be invoked by the source.

This sample MPI program is designed to run with two MPI ranks. Each MPI rank runs on a different domain (Intel Xeon processor host) and initializes two streams; each stream is responsible for communicating with a coprocessor. The MPI ranks enqueues the required actions into the streams in the following order: Memory transfer action from source to sink action, remote invocation action, and memory transfer action from sink to source. The following app APIs are called in the source-side code:

hStreams_app_init: Initialize and create streams across all available Intel Xeon Phi coprocessors. This API assumes one logical domain per physical domain.
hStreams_app_create_buf: Create an instantiation of buffers in all currently existing logical domains.
hStreams_app_xfer_memory: Enqueue memory transfer action in a stream; depending on the specified direction, memory is transferred from source to sink or sink to source.
hStreams_app_invoke: Enqueue a user-defined function in a stream. This function is executed at the stream sink. Note that the user also needs to implement the remote target function in the sink-side program.
hStreams_app_event_wait: This sync action blocks until the set of specified events is completed. In this example, only the last transaction in a stream is required, since all other actions should be completed.
hStreams_app_fini: Destroy hStreams internal structures and clear the library state.

Intel MPSS 3.7.2 and Intel® Parallel Studio XE 2016 update 3 are installed on the host machine Intel® Xeon® processor E5-2600. First, bring the Intel MPSS service up and set up compiler environment variables on the host machine:

$ sudo service mpss start

$ source /opt/intel/composerxe/bin/compilervars.sh intel64

To compile the source-side code, link the source-side code with the dynamic library hstreams_source which provides source functionality:

$ mpiicpc hstream_sample_src.cpp –O3 -o hstream_sample -lhstreams_source \ -I/usr/include/hStreams -qopenmp

The above command generates the executable hstream_sample. To generate the user kernel library for the coprocessor (as sink), compile with the flag –mmic:

$ mpiicpc -mmic -fPIC -O3 hstream_sample_sink.cpp –o ./mic/hstream_sample_mic.so \ -I/usr/include/hStreams -qopenmp -shared

To follow the convention, the target library takes the form <exec_name>_mic.so for the Intel Xeon Phi coprocessor and <exec_name>_host.so for the host. This generates the library named hstream_sample_mic.so under the folder /mic.

To run this application, set the environment variable SINK_LD_LIBRARY_PATH so that hStreams runtime can find the user kernel library hstream_sample_mic.so

$ export SINK_LD_LIBRARY_PATH=/opt/mpss/3.7.2/sysroots/k1om-mpss-linux/usr/lib64:~/work/hStreams/collateral/delivery/mic:$MIC_LD_LIBRARY_PATH

Run this program with two ranks, one rank running on this current host and one rank running on the host whose IP address is 10.23.3.32, as follows:

$ mpiexec.hydra -n 1 -host localhost ~/work/hstream_sample : -n 1 -wdir ~/work -host 10.23.3.32 ~/work/hstream_sample

Hello world! rank 0 of 2 runs on knightscorner5 Hello world! rank 1 of 2 runs on knightscorner0.jf.intel.com Rank 0: stream 0 moves A Rank 0: stream 0 moves B Rank 0: stream 1 moves A Rank 0: stream 1 moves B Rank 0: compute on stream 0 Rank 0: compute on stream 1 Rank 0: stream 0 Xtransfer data in C back knightscorner5-mic0 knightscorner5-mic1 Rank 1: stream 0 moves A Rank 1: stream 0 moves B Rank 1: stream 1 moves A Rank 1: stream 1 moves B Rank 1: compute on stream 0 Rank 1: compute on stream 1 Rank 1: stream 0 Xtransfer data in C back knightscorner0-mic0.jf.intel.com knightscorner0-mic1.jf.intel.com Rank 0: stream 1 Xtransfer data in C back Rank 1: stream 1 Xtransfer data in C back sink: compute on sink in stream num: 0 sink: compute on sink in stream num: 0 sink: compute on sink in stream num: 1 sink: compute on sink in stream num: 1 C0=97.20 C1=90.20 C0=36.20 C1=157.20 PASSED!

Conclusion

hStreams provides a well-defined set of APIs allowing users to design a task-based application on heterogeneous platforms quickly. Two levels of hStreams API co-exist: The app API offers simple interfaces for novice users to quickly ramp on the hStreams library, and the core API gives advanced users the full functionality of the rich library. This paper presents some basic hStreams concepts and illustrates how to build and run an MPI program that takes advantages of the hStreams interface.

About the Author

Loc Q Nguyen received an MBA from University of Dallas, a master’s degree in Electrical Engineering from McGill University, and a bachelor's degree in Electrical Engineering from École Polytechnique de Montréal. He is currently a software engineer with Intel Corporation's Software and Services Group. His areas of interest include computer networking, parallel computing, and computer graphics.

↧

libstdc++ Source Files

September 28, 2016, 10:56 am

Latest and popular articles on Intel Technologies

≫ Next: Celebrating 10 Years of Intel® Threading Building Blocks

≪ Previous: Introduction to Heterogeneous Streams Library

Please find the libstdc++ sources used by PSET for the Linux* product here.

↧

Celebrating 10 Years of Intel® Threading Building Blocks

September 28, 2016, 12:28 pm

Latest and popular articles on Intel Technologies

≫ Next: How University of Bristol Accelerated Rational Drug Design

≪ Previous: libstdc++ Source Files

What a Journey It's Been.

Intel® Threading Building Blocks (Intel® TBB) has come a long way from where it started in 2006 to its10-year anniversary in 2016. But on this long and winding journey, we've never lost sight of our core values of innovation and customer satisfaction.

Intel TBB is a powerful tool that lets developers leverage multi-core performance and heterogeneous computing without having to be threading or parallel programming experts. It is:

A tool to parallelize computationally intensive work, delivering higher-level and simpler solutions using standard C++.
The most feature-rich and comprehensive solution for parallel application development
Highly portable, composable, affordable, and approachable, providing future-proof scalability.
Compile agnostic, supporting multiple operating systems, and optimized for all Intel® architectures

If you've been with us on this journey, we thank you for your help and support in making Intel TBB the best tool it can be.

If you haven't yet seen what Intel TBB can do for you, now's the time:

We're looking forward to the road ahead.

↧

How University of Bristol Accelerated Rational Drug Design

September 28, 2016, 1:35 pm

Latest and popular articles on Intel Technologies

≫ Next: Go-To-Market Strategies for Your Small Business B2B App

≪ Previous: Celebrating 10 Years of Intel® Threading Building Blocks

Task-based parallel programming is the future. The University of Bristol Advanced Computing Research Centre wants to be part of that future. It provides advanced computing support to researchers, with a team of research software engineers who work with academics across a range of disciplines to help optimize research software that can be applied in industry.

With help from Intel® Threading Building Blocks (Intel® TBB), the University is able to provide a simple abstraction that will enable research software to adapt to the massively multicore future. To perform some of the calculations needed for drug design, the University uses the LigandSwap* program with a task-based parallel programming approach―with help from Intel TBB and its efficient task scheduling. The researchers found that parallelizing LigandSwap using Intel TBB can take less than 100 lines of Intel TBB-specific code from a code base of more than 100,000 lines—and enable a calculation that would ordinarily take 25 days to complete in just one day.

Learn all about it in the new University of Bristol case study.

↧

Go-To-Market Strategies for Your Small Business B2B App

September 29, 2016, 3:34 pm

Latest and popular articles on Intel Technologies

≫ Next: Advanced Bitrate Control Methods in Intel® Media SDK

≪ Previous: How University of Bristol Accelerated Rational Drug Design

In previous articles, we’ve discussed go-to-market strategies for selling your app to consumers, but as you consider the B2B market, how would your go-to-market plan need to differ? Since B2B means Business to Business, the key difference is that you’re selling your app directly to a business, or a person representing a business, rather than selling it to a person who only represents themselves. That’s going to change the way your customer makes decisions, and how you reach them. In this article, we’ll look specifically at go-to-market strategies for B2B apps.

Imagine that you’ve created an app to help dentists with scheduling reminders. It automatically generates reminder emails and texts, enables quick patient confirmation, and even allows the office to include additional promotional materials as needed. You can’t just post your app somewhere and hope they’ll find it and download it, and you can’t just market it like you would a game or a utility app—you’ll need to reach out to dental practices, in the places where they’re most likely to listen, and demonstrate that your product can help them to run their business better.

Know Your Customer—Their Responsibilities and Their Journey

In this example, your product is very specific to dental offices, so your primary target will likely be the dentist or an office manager working closely with the dentist. The general principles involved in knowing your customer—defining your audience, picking your channels, and customer acquisition—are mostly the same as they would be with a consumer app, but your B2B customer is more complicated because they have to represent their company, and the company’s interests, beyond their own. They also have a different journey than an individual consumer would have—with more external considerations and more focus on hard numbers. With a consumer app, you may just need to pique someone’s interest in a fun-sounding game, but with a B2B app, you have to understand how the app improves their bottom line or fits into their business plan.

Here are some questions to answer about your target customer:

Know the industry -
- Are there particular times of the year that will affect their interest or ability to implement new software?
- Are there industry-specific processes that you should know in order to address their needs?
- Are there conferences or regular industry events that would be a good place to introduce your product?
- How do they usually make decisions about this aspect of the business (for example, patient communication or scheduling)?
- Are there any relevant service providers that might be interested in distributing your app?
Know the benefits -
- What pain points does this business have?
- How can your app address those pain points?
- How will your app help them increase sales/reduce cost/improve retention?

ROI Is King

One key thing to remember is that business consumers are extremely interested in the return on investment, or ROI. Your app needs to solve a pain point in order to be worth their time and money, and you’ll need to be able to communicate that clearly to the business. For example, the appointment reminder app could cut down on potentially lost revenue due to missed appointments, while also freeing up the office manager to work on other aspects of the business.

Relationships are Key

B2B apps tend to use a subscription model in which customers pay a monthly or annual fee to use your app within their business. This is great for your bottom line—but because the cost is higher, and because integrating your tool will likely result in procedural changes within the business, the sales effort is also likely to be longer. All this is to say, relationships are a really important part of marketing and selling B2B apps. Business customers expect there to be ongoing support and communication, and you simply have to be able to talk to people and maintain long-term relationships for this business model to work. If the dental office signs up for a one-year contract, you might plan on quarterly updates, and be available to hear feedback and provide support.

Where Can You Find Them?

Finding the audience for a B2B app will really depend on the particular market or business you’re trying to serve, but here are a few ideas to get you started:

Industry events/continuing education. It’s a good idea to be wherever members of your target audience will be, like the annual ADA convention, and it’s even better if you can find events that are directly tied to the pain points your product addresses, like office systems management courses geared toward dental offices. Consider a table at a conference, a banner ad on an online course, presenting at a conference, or buying ad space in a catalog.
Technology service providers/resellers. Some small businesses would prefer to hire a technology service provider to make sure all of their systems are working and up to date—and your app might be something they can include in their offering. A service provider who works with multiple dental offices would be able to sell and distribute your app to multiple customers at once.
PR. Pitch your business story to industry publications. If you’re able to get a write-up in a leading dental industry magazine, you’ll build name recognition and interest.
Videos and content. Create materials they can view on their own, and then contact you if they’re interested. Remember, they’re running a business and they're busy, so you want to make it as easy for them to learn about your product as possible.
Meetups and seminars for industry/new business owners. Beyond big industry events, look for local meet-ups and seminars for new business owners. Your local BBB or Chamber of Commerce can also be a great resource.

The Importance of Word of Mouth

We’ve already discussed the importance of relationships, but with B2B it’s also important to remember another kind of relationship—the one your clients have with one another. Word of mouth is essential, and it’s very likely that people within your targeted industry rely and trust each other to provide recommendations—and warnings—about products and apps on the market. You might want to give away samples or trials in order to get your app out there and earn good reviews. Start with a few dentists who might want to be early adopters, and offer them incentives for trying your product, and for spreading the word. You might even want to offer a specific referral program, where they can get a month free, or a discounted premium service. Reassurance from peers is important in every industry, and once people start talking about your app, it’s no longer unknown—and business owners will be more likely to try it.

Businesses are always looking to improve their efficiency and performance, so when you're marketing to a business—particularly a small business—make sure you keep those end goals in mind. How can your app solve their pain points? What will the benefits be? The increased vetting and focus on ROI might seem like a lot at first, when you're used to working on consumer apps, but building long-term relationships with targeted customers and developing high-value apps that really meet their needs can be a very satisfying path to success.

↧

Advanced Bitrate Control Methods in Intel® Media SDK

September 27, 2016, 2:31 pm

Latest and popular articles on Intel Technologies

≫ Next: Improve Vectorization Performance using Intel® Advanced Vector Extensions 512

≪ Previous: Go-To-Market Strategies for Your Small Business B2B App

Introduction

In the world of media, there is a great demand to increase encoder quality but this comes with tradeoffs between quality and bandwidth consumption. This article addresses some of those concerns by discussing advanced bitrate control methods, which provide the ability to increase quality (relative to legacy rate controls) while maintaining the bitrate constant using Intel® Media SDK/ Intel® Media Server Studio tools.

The Intel Media SDK encoder offers many bitrate control methods, which can be divided into legacy and advanced/special purpose algorithms. This article is the 2nd part of 2-part series of Bitrate Control Methods in Intel® Media SDK. The legacy rate control algorithms are detailed in the 1st part, which is Bitrate Control Methods (BRC) in Intel® Media SDK; the advanced rate control methods (summarized in the table below) will be explained in this article.

Rate Control	HRD/VBV Compliant	OS supported	Usage
LA	No	Windows/Linux	Storage transcodes
LA_HRD	Yes	Windows/Linux	Storage transcodes; Streaming solution (where low latency is not a requirement)
ICQ	No	Windows	Storage transcodes (better quality with smaller file size)
LA_ICQ	No	Windows	Storage transcodes

The following tools (along with the downloadable links) are what we used to explain the concepts and generate performance data for this article:

Software- Intel® Media Server Studio and Intel® Media SDK
Code Samples - Version 6.0.0.142
Analysis tool- Intel® Video Pro Analyzer(VPA) and Video Quality Caliper(VQC), a component in Intel® Media Server Studio Professional Edition and Intel® Video Pro Analyzer.
Raw input stream -Sintel 1080p
System Used -
- CPU: Intel® Core® i5-5300U CPU @ 2.30GHz
- OS: Microsoft Windows 8.1 Enterprise
- Architecture: 64-bit
- Graphics Devices: Intel® HD Graphics 5500

Look Ahead (LA) Rate Control

As the name explains, this bitrate control method looks at successive frames, or the frames to be encoded next, and stores them in a look-ahead buffer. The number of frames or the length of the look ahead buffer can be specified by the LookAheadDepth parameter. This rate control is recommended for transcoding/encoding in a storage solution.

Generally, many parameters can be used to modify the quality/performance of the encoded stream. In this particular rate control, the encoding performance can be controlled by changing the size of the look ahead buffer. The LookAheadDepth parameter value can be changed between 10 - 100 to specify the size of the look ahead buffer. The LookAheadDepth parameter specifies the number of frames that the SDK encoder analyzes before encoding. As the LookAheadDepth increases, so does the number of frames that the encoder looks into; this results in an increase in quality of the encoded stream, however the performance (encoding frames per second) will decrease. In our experiments, this performance tradeoff was negligible for small input streams such as SIntel1080p.

Look Ahead rate control is enabled by default in sample_encode and sample_multi_transcode, part of code samples. The example below describes how to use this rate control method using the sample_encode application.

sample_encode.exe h264 -i sintel_1080p.yuv -o LA_out.264 -w 1920 -h 1080 -b 10000 –f 30 -lad 100 -la

As the value of LookAheadDepth increases, encoding quality improves, because the number of frames stored in the look ahead buffer has also increased, and the encoder will have more visibility to upcoming frames.

It should be noted that LA is not HRD (Hypothetical Reference Decoder) compliant. The following picture, obtained from Intel® Video Pro Analyzer shows a HRD buffer fullness view with “Buffer” mode enabled where sub-mode “HRD” is greyed out. This means no HRD parameters were passed in the stream headers, which indicates LA rate control is not HRD compliant. The left axis of the plot shows frame sizes and the right axis of the plot shows the slice QP (Quantization Parameter) values.

LA BRC — Figure 1: Snapshot of Intel Video Pro Analyzer analyzing H264 stream(Sintel -1080p), encoded using LA rate control method.

Sliding Window condition:

Sliding window algorithm is a part of the Look Ahead rate control method. This algorithm is applicable for both LA and LA_HRD rate control methods by defining WinBRCMaxAvgKbps and WinBRCSize through the mfxExtCodingOption3 structure.

Sliding window condition is introduced to strictly constrain the maximum bitrate of the encoder by changing two parameters: WinBRCSize and WinBRCMaxAvgKbps. This helps in limiting the achieved bitrate which makes it a good fit in limited bandwidth scenarios such as live streaming.

WinBRCSize parameter specifies the sliding window size in frames. A setting of zero means that sliding window condition is disabled.
WinBRCMaxAvgKbps specifies the maximum bitrate averaged over a sliding window specified by WinBRCSize.

In this technique, the average bitrate in a sliding window of WinBRCSize must not exceed WinBRCMaxAvgKbps. The above condition becomes weaker as the sliding window size increases and becomes stronger if the sliding window size value decreases. Whenever this condition fails, the frame will be automatically re-encoded with a higher quantization parameter and performance of the encoder decreases as we keep encountering failures. To reduce the number of failures and to avoid re-encoding, frames within the look ahead buffer will be analyzed by the encoder. A peak will be detected when there is a condition failure by encountering a large frame in the look ahead buffer. Whenever a peak is predicted, the quantization parameter value will be increased, thus reducing the frame size.

Sliding window can be implemented by adding the following code to the pipeline_encode.cpp program in the sample_encode application.

m_CodingOption3.WinBRCMaxAvgKbps = 1.5*TargetKbps;
m_CodingOption3.WinBRCSize = 90; //3*framerate
m_EncExtParams.push_back((mfxExtBuffer *)&m_CodingOption3);

The above values were chosen when encoding sintel_1080p.yuv of 1253 frames with H.264 codec, TargetKbps = 10000, framerate = 30fps. Sliding window parameter values (WinBRCMaxAvgKbps and WinBRCSize) are subject to change when using different input options.

If WinBRCMaxAvgKbps is close to TargetKbps and WinBRCSize almost equals 1, the sliding window will degenerate into the limitation of the maximum frame size (TargetKbps/framerate).

Sliding window condition can be evaluated by checking in any WinBRCSize consecutive frames, the total encoded size doesn't exceed the value set by WinBRCMaxAvgKbps. The following equation explains the sliding window condition.

The condition of limiting frame size can be checked after the asynchronous encoder run and encoded data is written back to the output file in pipeline_encode.cpp.

Look Ahead with HRD Compliance (LA_HRD) Rate Control

As Look Ahead bitrate control is not HRD compliant, there is a dedicated mode to achieve HRD compliance with the LookAhead algorithm, known as LA_HRD mode (MFX_RATECONTROL_LA_HRD). With HRD compliance, the Coded Picture Buffer should neither overflow nor underflow. This rate control is recommended in storage transcoding solutions and streaming scenarios, where low latency is not a major requirement.

To use this rate control in sample_encode, it will require code changes as illustrated below -

Statements to be added in sample_encode.cpp file within ParseInputString() function

else if (0 == msdk_strcmp(strInput[i], MSDK_STRING("-hrd")))
pParams->nRateControlMethod = MFX_RATECONTROL_LA_HRD;

LookAheadDepth value can be mentioned in the command line when executing the sample_encode binary. The example below describes how to use this rate control method using the sample_encode application.

sample_encode.exe h264 -i sintel_1080p.yuv -o LA_out.264 -w 1920 -h 1080 -b 10000 –f 30 -lad 100 –hrd

In the following graph, the LookAheadDepth(lad) value is 100.

Look Ahead HRD

Figure 2: a snapshot of Intel® Video Pro Analyzer(VPA), which verifies that LA_HRD rate control is HRD compliant. The buffer fullness mode is activated by selecting “Buffer” mode and “HRD” is chosen in sub-mode.

The above figure shows HRD buffer fullness view with “Buffer” mode enabled in Intel VPA, in which the sub-mode “HRD” is selected. The horizontal red lines show the upper and lower limits of the buffer and green line shows the instantaneous buffer fullness. The buffer fullness didn’t cross the upper and lower limits of the buffer. This means neither overflow nor underflow occurred in this rate control.

Extended Look Ahead (LA_EXT) Rate Control

For 1:N transcoding scenarios (1 decode and N encode session), there is an optimized lookahead algorithm knows as Extended Look Ahead Rate Control algorithm (MFX_RATECONTROL_LA_EXT), available only in Intel® Media Server Studio (not part of the Intel® Media SDK). This is recommended for broadcasting solutions.

An application should be able to load the plugin ‘mfxplugin64_h264la_hw.dll’ to support MFX_RATECONTROL_LA_EXT. This plugin can be found in the following location in the local system, where the Intel® Media Server Studio is installed.

“\Program Installed\Software Development Kit\bin\x64\588f1185d47b42968dea377bb5d0dcb4”.

The path of this plugin needs to be mentioned explicitly because it is not part of the standard installation directory. This capability can be used in either of two ways:

Preferred Method - Register the plugin with registry and point all necessary attributes such as API version, plugin type, path etc; so the dispatcher, which is a part of the software, can find it through the registry and connect to a decoding/encoding session.
Have all binaries (Media SDK, plugin, and app) in a directory and execute from the same directory.

LookAheadDepth parameter must be mentioned only once and considered to be the same value of LookAheadDepth of all N transcoded streams. LA_EXT rate control can be implemented using sample_multi_transcode, below is the example cmd line -

sample_multi_transcode.exe -par file_1.par

Contents of the par file are

-lad 40 -i::h264 input.264 -join -la_ext -hw_d3d11 -async 1 -n 300 -o::sink
-h 1088 -w 1920 -o::h264 output_1.0.h264 -b 3000 -join -async 1 -hw_d3d11 -i::source -l 1 -u 1 -n 300
-h 1088 -w 1920 -o::h264 output_2.h264 -b 5000 -join -async 1 -hw_d3d11 -i::source -l 1 -u 1 -n 300
-h 1088 -w 1920 -o::h264 output_3.h264 -b 7000 -join -async 1 -hw_d3d11 -i::source -l 1 -u 1 -n 300
-h 1088 -w 1920 -o::h264 output_4.h264 -b 10000 -join -async 1 -hw_d3d11 -i::source -l 1 -u 1 -n 300

Intelligent Constant Quality (ICQ) Rate Control

The ICQ bitrate control algorithm is designed to improve subjective video quality of an encoded stream: it may or may not improve video quality objectively - depending on the content. ICQQuality is a control parameter which defines the quality factor for this method. ICQQuality parameter can be changed between 1 - 51, where 1 corresponds to the best quality. The achieved bitrate and encoder quality (PSNR) can be adjusted by increasing or decreasing ICQQuality parameter. This rate control is recommended for storage solutions, where high quality is required while maintaining a smaller file size.

To use this rate control in sample_encode, it will require code changes as explained below -

Statements to be added in sample_encode.cpp within ParseInputString() function

else if (0 == msdk_strcmp(strInput[i], MSDK_STRING("-icq")))
pParams->nRateControlMethod = MFX_RATECONTROL_ICQ;

ICQQuality is available in the mfxInfoMFX structure. The desired value can be entered for this variable in InitMfxEncParams() function, e.g.:

m_mfxEncParams.mfx.ICQQuality = 12;

The example below describes how to use this rate control method using the sample_encode application.

sample_encode.exe h264 -i sintel_1080p.yuv -o ICQ_out.264 -w 1920 -h 1080 -b 10000 -icq

VBR vs ICQ RD Graph — Figure 3: Using Intel Media SDK samples and Video Quality Caliper, compare VBR and ICQ (ICQQuality varied between 13 and 18) with H264 encoding for 1080p, 30fps sintel.yuv of 1253 frames

Using about the same bitrate, ICQ shows improved Peak Signal to Noise Ratio (PSNR) in the above plot. The RD-graph data for the above plot is captured using the Video Quality Caliper, which compares two different streams encoded with ICQ and VBR.

Observation from above performance data:

At the same achieved bitrate, ICQ shows much improved quality (PSNR) compared to VBR, while maintaining the same encoding FPS.
The encoding bitrate and quality of the stream decreases as the ICQQuality parameter value increases.

The snapshot below shows a subjective comparison between encoded frames using VBR (on the left) and ICQ (on the right). Highlighted sections demonstrate missing details in VBR and improvements in ICQ.

VBR and ICQ subjective comparison — Figure 4: Using Video Quality Caliper, compare encoded frames subjectively for VBR vs ICQ

Look Ahead & Intelligent Constant Quality (LA_ICQ) Rate Control

This method is the combination of ICQ with Look Ahead. This rate control is also recommended for storage solutions. ICQQuality and LookAheadDepth are the two control parameters where the qualify factor is specified by mfxInfoMFX::ICQQuality and look ahead depth is controlled by the mfxExtCodingOption2: LookAheadDepth parameter.

To use this rate control in sample_encode, it requires code changes as explained below -

Statements to be added in sample_encode.cpp within ParseInputString() function

else if (0 == msdk_strcmp(strInput[i], MSDK_STRING("-laicq")))
pParams->nRateControlMethod = MFX_RATECONTROL_LA_ICQ;

ICQQuality is available in the mfxInfoMFX structure. Desired values can be entered for this variable in InitMfxEncParams() function

m_mfxEncParams.mfx.ICQQuality = 12;

LookAheadDepth can be mentioned in command line as lad.

sample_encode.exe h264 -i sintel_1080p.yuv -o LAICQ_out.264 -w 1920 -h 1080 -b 10000 –laicq -lad 100

VBR vs LAICQ RD-graph — Figure 5: Using Intel Media SDK samples and Video Quality Caliper, compare VBR and LA_ICQ (LookAheadDepth 100, ICQQuality varied between 20 and 26) with H264 encoding for 1080p, 30fps sintel.yuv of 1253 frames

At similar bitrate, better PSNR is observed for LA_ICQ compared to VBR as shown in the above plot. By keeping LookAheadDepth value at 100, the ICQQuality parameter value was changed between 1 - 51. The RD-graph data for this plot was captured using the Video Quality Caliper, which compares two different streams encoded with LA_ICQ and VBR.

Conclusion

There are several advanced bitrate control methods available to play with, to see if higher quality encoded streams can be achieved while maintaining bandwidth requirements constant. Each rate control has its own advantages and can be used in specific industry level use-cases depending on the requirement. To implement the Bitrate Control methods, refer also to the Intel® Media SDK Reference Manual, which comes with an installation of the Intel® Media SDK or Intel® Media Server Studio, and the Intel® Media Developer’s Guide from the documentation website. Visit Intel’s media support forum for further questions.

Resources

Bitrate Control Methods in Intel® Media SDK

↧

Improve Vectorization Performance using Intel® Advanced Vector Extensions 512

September 28, 2016, 4:54 pm

Latest and popular articles on Intel Technologies

≫ Next: Accelerating Your NVMe Drives with SPDK

≪ Previous: Advanced Bitrate Control Methods in Intel® Media SDK

This article shows a simple example of a loop that was not vectorized by the Intel® C++ Compiler due to possible data dependencies, but which has now been vectorized using the Intel® Advanced Vector Extensions 512 instruction set on an Intel® Xeon Phi™ processor. We will explore why the compiler using this instruction set automatically recognizes the loop as vectorizable and will discuss some issues about the vectorization performance.

Introduction

When optimizing code, the first efforts should be focused on vectorization. The most fundamental way to efficiently utilize the resources in modern processors is to write code that can run in vector mode by taking advantage of special hardware like vector registers and SIMD (Single Instruction Multiple Data) instructions. Data parallelism in the algorithm/code is exploited in this stage of the optimization process.

Making the most of fine grain parallelism through vectorization will allow the performance of software applications to scale with the number of cores in the processor by using multithreading and multitasking. Efficient use of single-core resources will be critical in the overall performance of the multithreaded application, because of the multiplicative effect of vectorization and multithreading.

The new Intel® Xeon Phi™ processor features 512-bit wide vector registers. The new Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set architecture (ISA), which is supported by the Intel Xeon Phi processor (and future Intel® processors), offers support for vector-level parallelism, which allows the software to use two vector processing units (each capable of simultaneously processing 16 single precision (32-bit) or 8 double precision (64-bit) floating point numbers) per core. Taking advantage of these hardware and software features is the key to optimal use of the Intel Xeon Phi processor.

This document describes a way to take advantage of the new Intel AVX-512 ISA in the Intel Xeon Phi processor. An example of an image processing application will be used to show how, with Intel AVX-512, the Intel C++ Compiler now automatically vectorizes a loop that was not vectorized with Intel® Advanced Vector Extensions 2 (Intel® AVX2). We will discuss performance issues arising with this vectorized code.

The full specification of the Intel AVX-512 ISA consists of several subsets. Some of those subsets are available in the Intel Xeon Phi processor. Some subsets will also be available in future Intel® Xeon® processors. A detailed description of the Intel AVX-512 subsets and their presence in different Intel processors is described in (Zhang, 2016).

In this document, the focus will be on the subsets of the Intel AVX-512 ISA, which provides vectorization functionality present both in current Intel Xeon Phi processor and future Intel Xeon processors. These subsets include the Intel AVX-512 Foundation Instructions (Intel AVX-512F) subset (which provides core functionality to take advantage of vector instructions and the new 512-bit vector registers) and the Intel AVX-512 Conflict Detection Instructions (Intel AVX-512CD) subset (which adds instructions that detect data conflicts in vectors, allowing vectorization of certain loops with data dependences).

Vectorization Techniques

There are several ways to take advantage of vectorization capabilities on an Intel Xeon Phi processor core:

Use optimized/vectorized libraries, like the Intel® Math Kernel Library (Intel® MKL).
Write vectorizable high-level code, so the compiler will create corresponding binary code using the vector instructions available in the hardware (this is commonly called automatic vectorization).
Use language extensions (compiler intrinsic functions) or direct calling to vector instructions in assembly language.

Each one of these methods has advantages and disadvantages, and which method to use will depend on the particular case we are working with. This document focuses on writing vectorizable code, which lets our code be more portable and ready for future processors. We will explore a simple example (a histogram) for which the new Intel AVX-512 instruction set will create executable code that will run in vector mode on the Intel Xeon Phi processor. The purpose of this example is to give insight on why the compiler can now vectorize source code containing data dependencies using the Intel AVX-512 ISA, which was not recognized as vectorizable when the compiler uses previous instruction sets, like Intel AVX2. Detailed information about Intel® AVX-512 ISA can be found at (Intel, 2016).

In future documents, techniques to explicitly guide vectorization using the language extensions and compiler intrinsics will be discussed. Those techniques will be helpful in complex loops for which the compiler is not able to safely vectorize the code due to complex flow or data dependencies. However the relatively simple example shown in this document will be helpful in understanding how the compiler is using the new features present in the AVX-512 ISA to improve the performance of some common loop structures.

Example: histogram computation in images.

To understand the new features offered by the AVX512F and AVX512CD subsets, we will use the example of computing an image histogram.

An image histogram is a graphical representation of the distribution of pixel values in an image (Wikipedia, n.d.). The pixel values can be single scalars representing grayscale values or vectors containing values representing colors, as in RGB images (where the color is represented using a combination of three values: red, green, and blue).

In this document, we used a 3024 x 4032 grayscale image. The total number of pixels in this image is 12,192,768. The original image and the corresponding histogram (computed using 1-pixel 256 grayscale intensity intervals) are shown in Figure 1.

Figure 1: Image used in this document (image credit: Alberto Villarreal), and its corresponding histogram.

A basic algorithm to compute the histogram is the following:

Read image
Get number of rows and columns in the image
Set image array [1: rows x columns] to image pixel values
Set histogram array [0: 255] to zero

For every pixel in the image

{
       histogram [ image [ pixel ] ] = histogram [ image [ pixel ] ] + 1
}

Notice that in this basic algorithm, the image array is used as an index to the histogram array (a type conversion to an integer is assumed). This kind of indirect referencing cannot be unconditionally parallelized, because neighboring pixels in the image might have the same intensity value, in which case the results of processing more than one iteration of the loop simultaneously might be wrong.

In the next sections, this algorithm will be implemented in C++, and it will be shown that the compiler, when using the AVX-512 ISA, will be able to safely vectorize this structure (although only in a partial way, with performance depending on the image data).

It should be noticed that this implementation of a histogram computation is used in this document for pedagogical purposes only. It does not represent an efficient way to perform the histogram computation, for which there are efficient libraries available. Also, our purpose is to show, using a simple code, how the new AVX-512 ISA is adding vectorization opportunities, and to help us understand the new functionality provided by the AVX-512 ISA.

There are other ways to implement parallelism for specific examples of histogram computations. For example in (Colfax International, 2015) the authors describe a way to automatically vectorize a similar algorithm (a binning application) by modifying the code using a strip-mining technique.

Hardware

To test our application, the following system will be used:

Processor: Intel Xeon Phi processor, model 7250 (1.40 GHz)
Number of cores: 68
Number of threads: 272

The information above can be checked in a Linux* system using the command

cat /proc/cpuinfo.

Notice that when using the command shown above, the “flags” section in the output will include the “avx512f” and “avx512cd” processor flags. Those flags indicate that Intel AVX512F and Intel AVX512CD subsets are supported by this processor. Notice that the flag “avx2” is defined also, which means the Intel AVX2 ISA is also supported (although it does not take advantage of the 512-bit vector registers in this processor).

Vectorization Results Using The Intel® C++ Compiler

This section shows a basic vectorization analysis of a fragment of the histogram code. Specifically, two different loops in this code will be analyzed:

LOOP 1: A loop implementing a histogram computation only. This histogram is computed on the input image, stored in floating point single precision in array image1.

LOOP 2: A loop implementing a convolution filter followed by a histogram computation. The filter is applied to the original image in array image1 and then a new histogram is computed on the filtered image stored in array image2.

The following code section shows the two loops mentioned above (image and histogram data have been placed in aligned arrays):

// LOOP 1

#pragma vector aligned
for (position=cols; position<rows*cols-cols; position++)
{
         hist1[ int(image1[position]) ]++;
}

(…)

// LOOP 2

#pragma vector aligned
for (position=cols; position<rows*cols-cols; position++)
{
           if (position%cols != 0 || position%(cols-1) != 0)
{
image2[position] = ( 9.0f*image1[position]
                                              - image1[position-1]
                                              - image1[position+1]
                                              - image1[position-cols-1]
                                              - image1[position-cols+1]
                                              - image1[position-cols]
                                              - image1[position+cols-1]
                                              - image1[position+cols+1]
                                              - image1[position+cols]) ;
                }
               if (image2[position] >= 0 && image2[position] <= 255)
               hist2[ int(image2[position]) ]++;
 }

This code was compiled using Intel C++ Compiler’s option to generate an optimization report as follows:

icpc histogram.cpp -o histogram -O3 -qopt-report=2 -qopt-report-phase=vec -xCORE-AVX2…

Note that, in this case, the -xCORE-AVX2 compiler flag has been used to ask the compiler to use the Intel AVX2 ISA to generate executable code.

The section of the optimization report that the compiler created for the loops shown above looks like this:

LOOP BEGIN at histogram.cpp(92,5)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
   remark #15346: vector dependence: assumed FLOW dependence between  line 94 and  line 94
LOOP END

LOOP BEGIN at histogram.cpp(92,5)
<Remainder>
LOOP END

LOOP BEGIN at histogram.cpp(103,5)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
   remark #15346: vector dependence: assumed FLOW dependence between  line 118 and  line 118
LOOP END

As can be seen in the section of the optimization report shown above, the compiler has prevented vectorization in both loops, due to dependences present in the lines of code where histogram computations are taking place (lines 94 and 118).

Now let’s compile the code using the -xMIC-AVX512 flag, to indicate the compiler to use the Intel AVX-512 ISA:

icpc histogram.cpp -o histogram -O3 -qopt-report=2 -qopt-report-phase=vec -xMIC-AVX512…

This creates the following output for the code segment in the optimization report, showing that both loops have now been vectorized:

LOOP BEGIN at histogram.cpp(92,5)
   remark #15300: LOOP WAS VECTORIZED

   LOOP BEGIN at histogram.cpp(94,8)
      remark #25460: No loop optimizations reported
   LOOP END
LOOP END

LOOP BEGIN at histogram.cpp(92,5)
<Remainder loop for vectorization>
   remark #15301: REMAINDER LOOP WAS VECTORIZED

   LOOP BEGIN at histogram.cpp(94,8)
      remark #25460: No loop optimizations reported
   LOOP END
LOOP END

LOOP BEGIN at histogram.cpp(103,5)
   remark #15300: LOOP WAS VECTORIZED

   LOOP BEGIN at histogram.cpp(118,8)
      remark #25460: No loop optimizations reported
   LOOP END
LOOP END

LOOP BEGIN at histogram.cpp(103,5)
<Remainder loop for vectorization>
   remark #15301: REMAINDER LOOP WAS VECTORIZED

The compiler reports results can be summarized as follows:

LOOP 1, which implements a histogram computation, is not being vectorized using the Intel AVX2 flag because of an assumed dependency (which was described in section 3 in this document). However, the loop was vectorized when using the Intel AVX-512 flag, which means that the compiler has solved the dependency using instructions present in the Intel AVX-512 ISA.
LOOP 2 gets the same diagnostics as LOOP1. The difference between these two loops is that LOOP 2 adds, on top of the histogram computation, a filter operation that has no dependencies and would be vectorizable otherwise. The presence of the histogram computation is preventing the compiler from vectorizing the entire loop (when using the Intel AVX2 flag).

Note: As can be seen in the section of the optimization report shown above, the compiler split the loop into two sections: The main loop and the reminder loop. The remainder loop contains the last few iterations in the loop (those that do not completely fill the vector unit). The compiler will usually do this, unless it knows in advance that the total number of iterations for this loop will be a multiple of the vector length.

We will ignore the reminder loop in this document. Ways to improve performance by eliminating the reminder loop are described in the literature.

Analyzing Performance of The Code

Performance of the above code segment was analyzed by adding timing instructions at the beginning and at the end of each one of the two loops, so that the time spent in each loop can be compared between different executables generated using different compiler options.

The table below shows the timing results of executing, on a single core, the vectorized and non-vectorized versions of the code (results are the average of 5 executions) using the input image without preprocessing. Baseline performance is defined here as the performance of the non-vectorized code generated by the compiler when using the Intel AVX2 compiler flag.

Test case	Loop	Baseline (Intel® Advanced Vector Extensions 2)	Speedup Factor with Vectorization (Intel® Advanced Vector Extensions 512)
Input image	LOOP 1	1	2.2
	LOOP 2	1	7.0

To further analyze the performance of the code as a function of the input data, the input image was preprocessed using blurring and sharpening filters. Blurring filters have the effect of smoothing the image, while sharpening filters increase the contrast of the image. Blurring and sharpening filters are available in image processing or computer vision libraries. In this document, we used the OpenCV* library to preprocess the test image.

The table below shows the timing results for the three experiments:

Test case	Loop	Baseline (Intel® Advanced Vector Extensions 2)	Speedup Factor with Vectorization (Intel® Advanced Vector Extensions 512)
Input image	LOOP 1	1	2.2
	LOOP 2	1	7.0
Input image sharpened	LOOP 1	1	2.6
	LOOP 2	1	7.4
Input image blurred	LOOP 1	1	1.7
	LOOP 2	1	5.6

Looking at the results above, three questions arise:

Why the compiler when using the Intel AVX512 flag is vectorizing the code, and when using the Intel AVX2 flag is not?
If the code in LOOP 1 using the Intel AVX512 ISA is indeed vectorized, why is the improvement in performance relatively small compared to the theoretical speedup when using 512-bit vectors?
Why does the performance gain of the vectorized code changes when the image is preprocessed? Specifically, why does the performance of the vectorized code increase when using a sharpened image, while it decreases when using a blurred image?

In the next section, the above questions will be answered based on a discussion about one of the subsets of the Intel AVX512 ISA, the Intel AVX512CD (conflict detection) subset.

The Intel AVX-512CD Subset

The Intel AVX-512CD (Conflict Detection) subset of the Intel AVX512 ISA adds functionality to detect data conflicts in the vector registers. In other words, it provides functionality to detect which elements in a vector operand are identical. The result of this detection is stored in mask vectors, which are used in the vector computations, so that the histogram operation (updating the histogram array) will be performed only on elements of the array (which represent pixel values in the image) that are different.

To further explore how the new instructions from the Intel AVX512CD subset work, it is possible to ask the compiler to generate an assembly code file by using the Intel C++ Compiler –S option:

icpc example2.cpp -o example2.s -O3 -xMIC-AVX512 –S …

The above command will create, instead of the executable file, a text file containing the assembly code for our C++ source code. Let’s take a look at part of the section of the assembly code that implements line 94 (the histogram update) in LOOP 1 in the example source code:

vcvttps2dq (%r9,%rax,4), %zmm5                        #94.19 c1
vpxord    %zmm2, %zmm2, %zmm2                         #94.8 c1
kmovw     %k1, %k2                                    #94.8 c1
vpconflictd %zmm5, %zmm3                              #94.8 c3
vpgatherdd (%r12,%zmm5,4), %zmm2{%k2}                 #94.8 c3
vptestmd  %zmm0, %zmm3, %k0                           #94.8 c5
kmovw     %k0, %r10d                                  #94.8 c9 stall 1
vpaddd    %zmm1, %zmm2, %zmm4                         #94.8 c9
testl     %r10d, %r10d                                #94.8 c11
je        ..B1.165      # Prob 30%                    #94.8 c13

In the above code fragment, vpconflictd detects conflicts in the source vector register (containing the pixel values) by comparing elements with each other in the vector, and writes the results of the comparison as a bit vector to the destination. This result is further tested to define which elements in the vector register will be used simultaneously for the histogram update, using a mask vector (The vpconflictd instruction is part of the Intel AVX-512CD subset, and the vptestmd instruction is part of the Intel AVX-512F subset. Specific information about these subsets can be found in the Intel AVX-512 ISA documentation (Intel, 2016) ). This process can be described with a diagram in Figures 2 and 3.

Figure 2: Pixel values in array (smooth image).

Figure 3: Pixel values in array (sharp image).

Figure 2 shows the case where some neighboring pixels in the image have the same value. Only the elements in the vector register that have different values in the array image1 will be used to simultaneously update the histogram. In other words, only the elements that will not create a conflict will be used to simultaneously update the histogram. The elements in conflict will still be used to update the histogram, but at a different time.

In this case, the performance will vary depending on how smooth the image is. The worst case scenario would be when all the elements in the vector register are the same, which would decrease the performance considerably, not only because at the end the loop would be processed in scalar mode, but also because of the overhead introduced by the conflict detection and testing instructions.

Figure 3 shows the case where the image was sharpened. In this case it is more likely that neighboring pixels in the vector register will have different values. Most or all of the elements in the vector register will be used to update the histogram, thereby increasing the performance of the loop because more elements will be processed simultaneously in the vector register.

It is clear that the best performance will be obtained when all elements in the array are different. However, the best performance will still be less than the theoretical speedup (16x in this case), because of the overhead introduced by the conflict detection and testing instructions.

The above discussion can be used to get answers to the questions that arose in section 5.

Regarding the first question about why the compiler generates vectorized code when using the Intel AVX512 flag, the answer is that the Intel AVX512CD and Intel AVX512F subsets include new instructions to detect conflicts in the subsets of elements in the loop and to create conflict-free subsets that can be safely vectorized. The size of these subsets will be data dependent. Vectorization was not possible when using the Intel AVX2 flag because the Intel AVX2 ISA does not include conflict detection functionality.

The second question about the reduced performance (compared to the theoretical speedup) obtained with the vectorized code, can be answered considering that there is some overhead introduced when the conflict detection and testing instructions are executed. This penalty in performance is notorious in LOOP 1, where the only computation that takes place is the histogram update.

However, in LOOP 2, where extra work is performed (on top of the histogram update), the performance gain, relative to the baseline, increases. The compiler, using the Intel AVX512 flag, is resolving the dependency created by the histogram computation, increasing the total performance of the loop. In the Intel AVX2 case, the dependency in the histogram computation is preventing other computations in the loop (even if they are dependency-free) from running in vector mode. This is an important result of the use of the Intel AVX512CD subset. The compiler will now be able to generate vectorized code for more complex loops that include histogram-like dependencies, which possibly required code rewriting in order to be vectorized before Intel AVX-512.

To the third question, it should be noticed that the total performance of the vectorized loops becomes data-dependent when using the conflict detection mechanism. As it is shown in figures 2 and 3, the speedup when running in vector mode will depend on how many values in the vector register are not identical (conflict-free). Sharp or noisy images (in this case) are less likely to have similar/identical values in neighboring pixels, compared to a smooth/blurred image.

Conclusions

This article showed a simple example of a loop which, because of the possibility of memory conflicts, was not vectorized by the Intel C++ compiler using the Intel AVX2 (and earlier) instruction sets, but which is now vectorized when using the Intel AVX-512 ISA on an Intel Xeon Phi processor. In particular, the new functionality in the Intel AVX512CD and Intel AVX512F subsets (which is currently available in the Intel Xeon Phi processor and in future Intel Xeon processors) lets the compiler automatically generate vector code for this kind of application, with no changes to the code. However the performance of the vector code created this way will be in general less than an application running in full vector mode and will also be data dependent, because the compiler will vectorize this application using mask registers whose contents vary depending on how similar neighboring data is.

The intent in this document is to motivate the use of the new functionality in the Intel AVX-512CD and Intel AVX-512F subsets. In future documents, we will explore more possibilities for vectorization of complex loops by taking explicit control of the logic to update the mask vectors, with the purpose of increasing the efficiency of the vectorization.

References

Colfax International. (2015). "Optimization Techniques for the Intel MIC Architecture. Part 2 of 3: Strip-Mining for Vectorization." Retrieved from Optimization Techniques for the Intel MIC Architecture. Part 2 of 3: Strip-Mining for Vectorization: http://colfaxresearch.com/optimization-techniques-for-the-intel-mic-architecture-part-2-of-3-strip-mining-for-vectorization/

Intel. (2016, February). "Intel® Architecture Instruction Set Extensions Programming Reference." Retrieved from https://software.intel.com/sites/default/files/managed/b4/3a/319433-024.pdf

Wikipedia. (n.d.). Retrieved from Wikipedia: https://en.wikipedia.org/wiki/Image_histogram

Zhang, B. (2016). "Guide to Automatic Vectorization With Intel AVX-512 Instructions in Knights Landing Processors."

↧

Accelerating Your NVMe Drives with SPDK

September 30, 2016, 10:18 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® Software Guard Extensions Tutorial Series: Part 5, Enclave Development

≪ Previous: Improve Vectorization Performance using Intel® Advanced Vector Extensions 512

Introduction

The Storage Performance Development Kit (SPDK) is an open source set of tools and libraries hosted on GitHub that helps developers create high-performance and scalable storage applications. This tutorial will focus on the userspace NVMe driver provided by SPDK and will show you a Hello World example running on an Intel® architecture platform.

Hardware and Software Configuration

CPU and Chipset	Intel® Xeon® processor E5-2697 v2 @ 2.7 GHz Number of physical cores per socket: 12 (24 logical cores) Number of sockets: 2 Chipset: Intel® C610 (C1 stepping) System bus: 9.6 GT/s QPI
Memory	Memory size: 8 GB (8X8 GB) DDR3 1866 Brand/model: Samsung – M393B1G73BH0*
Storage	Intel® SSD DC P3700 Series
Operating System	CentOS* 7.2.1511 with kernel 3.10.0

Why is There a Need for a Userspace NVMe Driver?

Historically, storage devices have been an order of magnitude slower than other parts of a computer system, such as RAM and CPU. This meant the operating system and CPU would interface with disks using interrupts like so:

A request is made to the OS to read data from a disk.
The driver processes the request and communicates with the hardware.
The disk platter is spun up.
The needle is moved across the platter to start reading data.
Data is read and copied into a buffer.
An interrupt is generated, notifying the CPU that the data is now ready.
Finally, the data is read from the buffer.

The interrupt model does incur an overhead; however, traditionally this has been significantly smaller than the latency of disk-based storage devices, and therefore using interrupts has proved effective. Storage devices such as solid state drives (SSDs) and next-generation technology like 3D XPoint™ storage are now significantly faster than disks and the bottleneck has moved away from hardware (e.g., disks) back to software (e.g., interrupts + kernel) as Figure 1 shows:

Figure 1.Solid state drives (SSDs) and 3D XPoint™ storage are significantly faster than disks. Bottlenecks have moved away from hardware.

The userspace NVMe driver addresses the issue of using interrupts by instead polling the storage device when data is being read or written. Additionally and importantly, the NVMe driver operates within userspace, which means the application is able to directly interface with the NVMe device without going through the kernel. The invocation of a system call is called a context switch and this incurs an overhead as the state has to be both stored and then restored when interfacing with the kernel.

Prerequisites and Building SPDK

SPDK has known support for Fedora*, CentOS*, Ubuntu*, Debian*, and FreeBSD*. A full list of prerequisite packages can be found here.

Before building SPDK, you are required to first install the Data Plane Development Kit (DPDK) as SPDK relies on the memory management and queuing capabilities already found in DPDK. DPDK is a mature library typically used for network packet processing and has been highly optimized to manage memory and queue data with low latency.

The source code for SPDK can be cloned from GitHub using the following:

git clone https://github.com/spdk/spdk.git

Building DPDK (for Linux*):

cd /path/to/build/spdk

wget http://fast.dpdk.org/rel/dpdk-16.07.tar.xz

tar xf dpdk-16.07.tar.xz

cd dpdk-16.07 && make install T=x86_64-native-linuxapp-gcc DESTDIR=.

Building SPDK (for Linux):

Now that we have DPDK built inside of the SPDK folder, we need to change directory back to SPDK and build SPDK by passing the location of DPDK to make:

cd /path/to/build/spdk

make DPDK_DIR=./dpdk-16.07/x86_64-native-linuxapp-gcc

Setting Up Your System Before Running an SPDK Application

The command below sets up hugepages as well as unbinds any NVMe and I/OAT devices from the kernel drivers:

sudo scripts/setup.sh

Getting Started with ‘Hello World’

SPDK includes a number of examples as well as quality documentation to quickly get started. We will go through an example of storing ‘Hello World’ to an NVMe device and then reading it back into a buffer.

Before jumping to code it is worth noting how NVMe devices are structured and provide a high-level example of how this will utilize the NVMe driver to detect NVMe devices, write and then read data.

An NVMe device (also called an NVMe controller) is structured with the following in mind:

A system can have one or more NVMe devices.
Each NVMe device consists of a number of namespaces (it can be only one).
Each namespace consists of a number of Logical Block Addresses (LBAs).

This example will go through the following steps:

Setup

Create a request buffer pool that is used internally by SPDK to store request data for each I/O request:

request_mempool = rte_mempool_create("nvme_request", 8192,
                                     spdk_nvme_request_size(), 128, 0,
                                     NULL, NULL, NULL, NULL,
                                     SOCKET_ID_ANY, 0);

Probe the system for NVMe devices:

rc = spdk_nvme_probe(NULL, probe_cb, attach_cb, NULL);

Enumerate the NVMe devices, returning a boolean value to SPDK as to whether the device should be attached:

static bool
probe_cb(void *cb_ctx, struct spdk_pci_device *dev, struct spdk_nvme_ctrlr_opts *opts)
{
     printf("Attaching to %04x:%02x:%02x.%02x\n",
		     spdk_pci_device_get_domain(dev),
		     spdk_pci_device_get_bus(dev),
		     spdk_pci_device_get_dev(dev),
		     spdk_pci_device_get_func(dev));

     return true;
}

The device is attached; we can now request information about the number of namespaces:

static void
attach_cb(void *cb_ctx, struct spdk_pci_device *dev, struct spdk_nvme_ctrlr *ctrlr,
	  const struct spdk_nvme_ctrlr_opts *opts)
{
    int nsid, num_ns;
	const struct spdk_nvme_ctrlr_data *cdata = spdk_nvme_ctrlr_get_data(ctrlr);

	printf("Attached to %04x:%02x:%02x.%02x\n",
	       spdk_pci_device_get_domain(dev),
	       spdk_pci_device_get_bus(dev),
	       spdk_pci_device_get_dev(dev),
	       spdk_pci_device_get_func(dev));

	snprintf(entry->name, sizeof(entry->name), "%-20.20s (%-20.20s)", cdata->mn, cdata->sn);

	num_ns = spdk_nvme_ctrlr_get_num_ns(ctrlr);
	printf("Using controller %s with %d namespaces.\n", entry->name, num_ns);
	for (nsid = 1; nsid <= num_ns; nsid++) {
		register_ns(ctrlr, spdk_nvme_ctrlr_get_ns(ctrlr, nsid));
	}
}

Enumerate the namespaces to retrieve information such as the size:

static void
register_ns(struct spdk_nvme_ctrlr *ctrlr, struct spdk_nvme_ns *ns)
{
	 printf("  Namespace ID: %d size: %juGB\n", spdk_nvme_ns_get_id(ns),
		    spdk_nvme_ns_get_size(ns) / 1000000000);
}

Create an I/O queue pair to submit read/write requests to a namespace:
```
ns_entry->qpair = spdk_nvme_ctrlr_alloc_io_qpair(ns_entry->ctrlr, 0);
```
Reading/writing data
Allocate a buffer for the data that will be read/written:
```
sequence.buf = rte_zmalloc(NULL, 0x1000, 0x1000);
```

Copy ‘Hello World’ into the buffer:

sprintf(sequence.buf, "Hello world!\n");

Submit a write request to a specified namespace providing a queue pair, pointer to the buffer, index of the LBA, a callback for when the data is written, and a pointer to any data that should be passed to the callback:
```
rc = spdk_nvme_ns_cmd_write(ns_entry->ns, ns_entry->qpair, sequence.buf,
						    0, /* LBA start */
						    1, /* number of LBAs */
						    write_complete, &sequence, 0);
```
The write completion callback will be called synchronously.
Submit a read request to a specified namespace providing a queue pair, pointer to a buffer, index of the LBA, a callback for the data that has been read, and a pointer to any data that should be passed to the callback:
```
rc = spdk_nvme_ns_cmd_read(ns_entry->ns, ns_entry->qpair, sequence->buf,
					       0, /* LBA start */
						   1, /* number of LBAs */
					       read_complete, (void *)sequence, 0);
```
The read completion callback will be called synchronously.
Poll on a flag that marks the completion of both the read and write of the data. If the request is still in flight we can poll for the completions for a given queue pair. Although the actual reading and writing of the data is asynchronous, the spdk_nvme_qpair_process_completions function checks and returns the number completed I/O requests and will also call the read/write completion callbacks described above:
```
while (!sequence.is_completed) {
       spdk_nvme_qpair_process_completions(ns_entry->qpair, 0);
}
```
Release the queue pair and complete any cleanup before exiting:
```
spdk_nvme_ctrlr_free_io_qpair(ns_entry->qpair);
```

The complete code sample for the Hello World application described here is available on github, and API documentation for the SPDK NVME driver is available at www.spdk.io

Running the Hello World example should give the following output:

Other Examples Included with SPDK

SPDK includes a number of examples to help you get started and build an understanding of how SPDK works quickly. Here is the output from the perf example that benchmarks the NVMe drive:

Developers that require access to the NVMe drive information such as features, admin command set attributes, NVMe command set attributes, power management, and health information can use the identify example:

Authors

Steven Briscoe is an Application Engineer who focuses on cloud computing within the Software Services Group at Intel (UK).

Thai Le is a Software Engineer who focuses on cloud computing and performance computing analysis at Intel.

↧

Intel® Software Guard Extensions Tutorial Series: Part 5, Enclave Development

October 5, 2016, 1:18 pm

Latest and popular articles on Intel Technologies

≫ Next: Installation Failure Due to Spaces in Path

≪ Previous: Accelerating Your NVMe Drives with SPDK

In Part 5 of the Intel® Software Guard Extensions (Intel® SGX) tutorial series, we’ll finish developing the enclave for the Tutorial Password Manager application. In Part 4 of the series, we created a DLL to serve as our interface layer between the enclave bridge functions and the C++/CLI program core, and defined our enclave interface. With those components in place, we can now focus our attention on the enclave itself.

You can find the list of all of the published tutorials in the article Introducing the Intel® Software Guard Extensions Tutorial Series.

There is source code provided with this installment of the series: the completed application with its enclave. This version is hardcoded to run the Intel SGX code path.

The Enclave Components

To identify which components need to be implemented within the enclave, we’ll refer to the class diagram for the application core in Figure 1, which was first introduced in Part 3. As before, the objects that will reside in the enclave are shaded in green while the untrusted components are shaded in blue.

Figure 1. Class diagram for the Tutorial Password Manager with Intel® Software Guard Extensions.

From this we can identify four classes that need to be ported:

Vault
AccountRecord
Crypto
DRNG

Before we get started, however, we do need to make a design decision. Our application must function on systems both with and without Intel SGX support, and that means we can’t simply convert our existing classes so that they function within the enclave. We must create two versions of each: one intended for use in enclaves, and one for use in untrusted memory. The question is, how should this dual-support be implemented?

Option 1: Conditional Compilation

The first option is to implement both the enclave and untrusted functionality in the same source module and use preprocessor definitions and #ifdef statements to compile the appropriate code based on the context. The advantage of this approach is that we only need one source file for each class, and thus do not have to maintain changes in two places. The disadvantages are that the code can be more difficult to read, particularly if the changes between the two versions are numerous or significant, and the project structure will be more complex. Two of our Visual Studio* projects, Enclave and PasswordManagerCore, will share source files, and each will need to set a preprocessor symbol to ensure that the correct source code is compiled.

Option 2: Separate Classes

The second option is to duplicate each source file that has to go into the enclave. The advantages of this approach are that the enclave has its own copy of the source files which we can modify directly, allowing for a simpler project structure and easier code view. But, these come at a cost: if we need to make changes to the classes, those changes must be made in two places, even if those changes are common to both the enclave and untrusted versions.

Option 3: Inheritance

The third option is to use the C++ feature of class inheritance. The functions common to both versions of the class would be implemented in the base class, and the derived classes would implement the branch-specific methods. The big advantage to this approach is that it is a very natural and elegant solution to the problem, using a feature of the language that is designed to do exactly what we need. The disadvantages are the added complexity required in both the project structure and the code itself.

There is no hard and fast rule here, and the decision does not have to be a global one. A good rule of thumb is that Option 1 is best for modules where the changes are small or easily compartmentalized, and Options 2 and 3 are best when the changes are significant or result in source code that is difficult to read and maintain. However, it really comes down to style and preference, and either approach is fine.

For now, we’ll choose Option 2 because it allows for easy side-by-side comparisons of the enclave and untrusted source files. In a future installment of the tutorial series we may switch to Option 3 in order to tighten up the code.

The Enclave Classes

Each class has its own set of issues and challenges when it comes to adapting it to the enclave, but there is one universal truth that will apply to all of them: we no longer have to zero-fill our memory before freeing it. As you recall from Part 3, this was a recommended action when handling secure data in untrusted memory. Because our enclave memory is encrypted by the CPU, using an encryption key that is not available to any hardware layer, the contents of freed memory will contain what appears to be random data to other applications. This means we can remove all calls to SecureZeroMemory that are inside the enclave.

The Vault Class

The Vault class is our interface to the password vault operations. All of our bridge functions act through one or more methods in Vault. Its declaration from Vault.h is shown below.

class PASSWORDMANAGERCORE_API Vault
{
	Crypto crypto;
	char m_pw_salt[8];
	char db_key_nonce[12];
	char db_key_tag[16];
	char db_key_enc[16];
	char db_key_obs[16];
	char db_key_xor[16];
	UINT16 db_version;
	UINT32 db_size; // Use get_db_size() to fetch this value so it gets updated as needed
	char db_data_nonce[12];
	char db_data_tag[16];
	char *db_data;
	UINT32 state;
	// Cache the number of defined accounts so that the GUI doesn't have to fetch
	// "empty" account info unnecessarily.
	UINT32 naccounts;

	AccountRecord accounts[MAX_ACCOUNTS];
	void clear();
	void clear_account_info();
	void update_db_size();

	void get_db_key(char key[16]);
	void set_db_key(const char key[16]);

public:
	Vault();
	~Vault();

	int initialize();
	int initialize(const unsigned char *header, UINT16 size);
	int load_vault(const unsigned char *edata);

	int get_header(unsigned char *header, UINT16 *size);
	int get_vault(unsigned char *edata, UINT32 *size);

	UINT32 get_db_size();

	void lock();
	int unlock(const char *password);

	int set_master_password(const char *password);
	int change_master_password(const char *oldpass, const char *newpass);

	int accounts_get_count(UINT32 *count);
	int accounts_get_info_sizes(UINT32 idx, UINT16 *mbname_sz, UINT16 *mblogin_sz, UINT16 *mburl_sz);
	int accounts_get_info(UINT32 idx, char *mbname, UINT16 mbname_sz, char *mblogin, UINT16 mblogin_sz,
		char *mburl, UINT16 mburl_sz);

	int accounts_get_password_size(UINT32 idx, UINT16 *mbpass_sz);
	int accounts_get_password(UINT32 idx, char *mbpass, UINT16 mbpass_sz);

	int accounts_set_info(UINT32 idx, const char *mbname, UINT16 mbname_len, const char *mblogin, UINT16 mblogin_len,
		const char *mburl, UINT16 mburl_len);
	int accounts_set_password(UINT32 idx, const char *mbpass, UINT16 mbpass_len);

	int accounts_generate_password(UINT16 length, UINT16 pwflags, char *cpass);

	int is_valid() { return _VST_IS_VALID(state); }
	int is_locked() { return ((state&_VST_LOCKED) == _VST_LOCKED) ? 1 : 0; }
};

The declaration for the enclave version of this class, which we’ll call E_Vault for clarity, will be identical except for one crucial change: database key handling.

In the untrusted code path, the Vault object must store the database key, decrypted, in memory. Every time we make a change to our password vault we have to encrypt the updated vault data and write it to disk, and that means the key must be at our disposal. We have four options:

Prompt the user for their master password on every change so that the database key can be derived on demand.
Cache the user’s master password so that the database key can be derived on demand without user intervention.
Encrypt, encode, and/or obscure the database key in memory.
Store the key in the clear.

None of these are good solutions and they highlight the need for technologies like Intel SGX. The first is arguably the most secure, but no user would want to run an application that behaved in this manner. The second could be achieved using the SecureString class in .NET*, but it is still vulnerable to inspection via a debugger and there is a performance cost associated with the key derivation function that a user might find unacceptable. The third option is effectively insecure as the second, only it comes without a performance penalty. The fourth option is the worst of the lot.

Our Tutorial Password Manager uses the third option: the database key is XOR’d with a 128-bit value that is randomly generated when a vault file is opened, and it is stored in memory only in this XOR’d form. This is effectively a one-time pad encryption scheme. It is open to inspection for anyone running a debugger, but it does limit the amount of time in which the database key is present in memory in the clear.

void Vault::set_db_key(const char db_key[16])
{
	UINT i, j;
	for (i = 0; i < 4; ++i)
		for (j = 0; j < 4; ++j) db_key_obs[4 * i + j] = db_key[4 * i + j] ^ db_key_xor[4 * i + j];
}

void Vault::get_db_key(char db_key[16])
{
	UINT i, j;
	for (i = 0; i < 4; ++i)
		for (j = 0; j < 4; ++j) db_key[4 * i + j] = db_key_obs[4 * i + j] ^ db_key_xor[4 * i + j];
}

This is obviously security through obscurity, and since we are publishing the source code, it’s not even particularly obscure. We could choose a better algorithm or go to greater lengths to hide both the database key and the pad’s secret key (including how they are stored in memory); but in the end, the method we choose would still be vulnerable to inspection via a debugger, and the algorithm would still be published for anyone to see.

Inside the enclave, however, this problem goes away. The memory is protected by hardware-backed encryption, so even when the database key is decrypted it is not open to inspection by anyone, even a process running with elevated privileges. As a result, we no longer need these class members or methods:

char db_key_obs[16];
char db_key_xor[16];

	void get_db_key(char key[16]);
	void set_db_key(const char key[16]);

We can replace them with just one class member: a char array to hold the database key.

char db_key[16];

The AccountInfo Class

The account data is stored in a fixed-size array of AccountInfo objects as a member of the Vault object. The declaration for AccountInfo is also found in Vault.h, and it is shown below:

class PASSWORDMANAGERCORE_API AccountRecord
{
	char nonce[12];
	char tag[16];
	// Store these in their multibyte form. There's no sense in translating
	// them back to wchar_t since they have to be passed in and out as
	// char * anyway.
	char *name;
	char *login;
	char *url;
	char *epass;
	UINT16 epass_len; // Can't rely on NULL termination! It's an encrypted string.

	int set_field(char **field, const char *value, UINT16 len);
	void zero_free_field(char *field, UINT16 len);

public:
	AccountRecord();
	~AccountRecord();

	void set_nonce(const char *in) { memcpy(nonce, in, 12); }
	void set_tag(const char *in) { memcpy(tag, in, 16); }

	int set_enc_pass(const char *in, UINT16 len);
	int set_name(const char *in, UINT16 len) { return set_field(&name, in, len); }
	int set_login(const char *in, UINT16 len) { return set_field(&login, in, len); }
	int set_url(const char *in, UINT16 len) { return set_field(&url, in, len); }

	const char *get_epass() { return (epass == NULL)? "" : (const char *)epass; }
	const char *get_name() { return (name == NULL) ? "" : (const char *)name; }
	const char *get_login() { return (login == NULL) ? "" : (const char *)login; }
	const char *get_url() { return (url == NULL) ? "" : (const char *)url; }
	const char *get_nonce() { return (const char *)nonce; }
	const char *get_tag() { return (const char *)tag; }

	UINT16 get_name_len() { return (name == NULL) ? 0 : (UINT16)strlen(name); }
	UINT16 get_login_len() { return (login == NULL) ? 0 : (UINT16)strlen(login); }
	UINT16 get_url_len() { return (url == NULL) ? 0 : (UINT16)strlen(url); }
	UINT16 get_epass_len() { return (epass == NULL) ? 0 : epass_len; }

	void clear();
};

We actually don’t need to do anything to this class for it to work inside the enclave. Other than remove the unnecessary calls to SecureZeroFree, this class is fine as is. However, we are going to change it anyway in order to illustrate a point: within the enclave, we gain some flexibility that we did not have before.

Returning to Part 3, another of our guidelines for securing data in untrusted memory space was avoiding container classes that manage their own memory, specifically the Standard Template Library’s std::string class. Inside the enclave this problem goes away, too. For the same reason that we don’t need to zero-fill our memory before freeing it, we don’t have to worry about how the Standard Template Library (STL) containers manager their memory. The enclave memory is encrypted, so even if fragments of our secure data remain there as a result of container operations, they can’t be inspected by other processes.

There’s also a good reason to use the std::string class inside the enclave: reliability. The code behind the STL containers has been through significant peer review over the years and it can be argued that it is safer to use it than implement our own high-level string functions when given the choice. For simple code like what’s in the AccountInfo class, it’s probably not a significant issue, but in more complex programs this can be a huge benefit. However, this does come at the cost of a larger DLL due to the added STL code.

The new class declaration, which we’ll call E_AccountInfo, is shown below:

#define TRY_ASSIGN(x) try{x.assign(in,len);} catch(...){return 0;} return 1

class E_AccountRecord
{
	char nonce[12];
	char tag[16];
	// Store these in their multibyte form. There's no sense in translating
	// them back to wchar_t since they have to be passed in and out as
	// char * anyway.
	string name, login, url, epass;

public:
	E_AccountRecord();
	~E_AccountRecord();

	void set_nonce(const char *in) { memcpy(nonce, in, 12); }
	void set_tag(const char *in) { memcpy(tag, in, 16); }

	int set_enc_pass(const char *in, uint16_t len) { TRY_ASSIGN(epass); }
	int set_name(const char *in, uint16_t len) { TRY_ASSIGN(name); }
	int set_login(const char *in, uint16_t len) { TRY_ASSIGN(login); }
	int set_url(const char *in, uint16_t len) { TRY_ASSIGN(url); }

	const char *get_epass() { return epass.c_str(); }
	const char *get_name() { return name.c_str(); }
	const char *get_login() { return login.c_str(); }
	const char *get_url() { return url.c_str(); }

	const char *get_nonce() { return (const char *)nonce; }
	const char *get_tag() { return (const char *)tag; }

	uint16_t get_name_len() { return (uint16_t) name.length(); }
	uint16_t get_login_len() { return (uint16_t) login.length(); }
	uint16_t get_url_len() { return (uint16_t) url.length(); }
	uint16_t get_epass_len() { return (uint16_t) epass.length(); }

	void clear();
};

The tag and nonce members are still stored as char arrays. Our password encryption is done with AES in GCM mode, using a 128-bit key, a 96-bit nonce, and a 128-bit authentication tag. Since the size of the nonce and the tag are fixed there is no reason to store them as anything other than simple char arrays.

Note that this std::string-based approach has allowed us to almost completely define the class in the header file.

The Crypto Class

The Crypto class provides our cryptographic functions. The class declaration is shown below.

class PASSWORDMANAGERCORE_API Crypto
{
	DRNG drng;

	crypto_status_t aes_init (BCRYPT_ALG_HANDLE *halgo, LPCWSTR algo_id, PBYTE chaining_mode, DWORD chaining_mode_len, BCRYPT_KEY_HANDLE *hkey, PBYTE key, ULONG key_len);
	void aes_close (BCRYPT_ALG_HANDLE *halgo, BCRYPT_KEY_HANDLE *hkey);

	crypto_status_t aes_128_gcm_encrypt(PBYTE key, PBYTE nonce, ULONG nonce_len, PBYTE pt, DWORD pt_len, PBYTE ct, DWORD ct_sz, PBYTE tag, DWORD tag_len);
	crypto_status_t aes_128_gcm_decrypt(PBYTE key, PBYTE nonce, ULONG nonce_len, PBYTE ct, DWORD ct_len, PBYTE pt, DWORD pt_sz, PBYTE tag, DWORD tag_len);
	crypto_status_t sha256_multi (PBYTE *messages, ULONG *lengths, BYTE hash[32]);

public:
	Crypto(void);
	~Crypto(void);

	crypto_status_t generate_database_key (BYTE key_out[16], GenerateDatabaseKeyCallback callback);
	crypto_status_t generate_salt (BYTE salt[8]);
	crypto_status_t generate_salt_ex (PBYTE salt, ULONG salt_len);
	crypto_status_t generate_nonce_gcm (BYTE nonce[12]);

	crypto_status_t unlock_vault(PBYTE passphrase, ULONG passphrase_len, BYTE salt[8], BYTE db_key_ct[16], BYTE db_key_iv[12], BYTE db_key_tag[16], BYTE db_key_pt[16]);

	crypto_status_t derive_master_key (PBYTE passphrase, ULONG passphrase_len, BYTE salt[8], BYTE mkey[16]);
	crypto_status_t derive_master_key_ex (PBYTE passphrase, ULONG passphrase_len, PBYTE salt, ULONG salt_len, ULONG iterations, BYTE mkey[16]);

	crypto_status_t validate_passphrase(PBYTE passphrase, ULONG passphrase_len, BYTE salt[8], BYTE db_key[16], BYTE db_iv[12], BYTE db_tag[16]);
	crypto_status_t validate_passphrase_ex(PBYTE passphrase, ULONG passphrase_len, PBYTE salt, ULONG salt_len, ULONG iterations, BYTE db_key[16], BYTE db_iv[12], BYTE db_tag[16]);

	crypto_status_t encrypt_database_key (BYTE master_key[16], BYTE db_key_pt[16], BYTE db_key_ct[16], BYTE iv[12], BYTE tag[16], DWORD flags= 0);
	crypto_status_t decrypt_database_key (BYTE master_key[16], BYTE db_key_ct[16], BYTE iv[12], BYTE tag[16], BYTE db_key_pt[16]);

	crypto_status_t encrypt_account_password (BYTE db_key[16], PBYTE password_pt, ULONG password_len, PBYTE password_ct, BYTE iv[12], BYTE tag[16], DWORD flags= 0);
	crypto_status_t decrypt_account_password (BYTE db_key[16], PBYTE password_ct, ULONG password_len, BYTE iv[12], BYTE tag[16], PBYTE password);

	crypto_status_t encrypt_database (BYTE db_key[16], PBYTE db_serialized, ULONG db_size, PBYTE db_ct, BYTE iv[12], BYTE tag[16], DWORD flags= 0);
	crypto_status_t decrypt_database (BYTE db_key[16], PBYTE db_ct, ULONG db_size, BYTE iv[12], BYTE tag[16], PBYTE db_serialized);

	crypto_status_t generate_password(PBYTE buffer, USHORT buffer_len, USHORT flags);
};

The public methods in this class are modeled to perform various high-level vault operations: unlock_vault, derive_master_key, validate_passphrase, encrypt_database, and so on. Each of these methods invokes one or more cryptographic algorithms in order to complete its task. For example, the unlock_vault method takes the passphrase supplied by the user, runs it through the SHA-256-based key derivation function, and uses the resulting key to decrypt the database key using AES-128 in GCM mode.

These high-level methods do not, however, directly invoke the cryptographic primitives. Instead, they call into a middle layer which implements each cryptographic algorithm as a self-contained function.

Figure 2. Cryptographic library dependancies.

The private methods that make up our middle layer are built on the cryptographic primitives and support functions provided by the underlying cryptographic library, as illustrated in Figure 2. The non-Intel SGX implementation relies on Microsoft’s Cryptography API: Next Generation (CNG) for these, but we can’t use this same library inside the enclave because an enclave cannot have dependencies on external DLLs. To build the Intel SGX version of this class, we need to replace those underlying functions with the ones in the trusted crypto library that is distributed with the Intel SGX SDK. (As you might recall from Part 2, we were careful to choose cryptographic functions that were common to both CNG and the Intel SGX trusted crypto library for this very reason.)

To create our enclave-capable Crypto class, which we’ll call E_Crypto, what we need to do is modify these private methods:

crypto_status_t aes_128_gcm_encrypt(PBYTE key, PBYTE nonce, ULONG nonce_len, PBYTE pt, DWORD pt_len, PBYTE ct, DWORD ct_sz, PBYTE tag, DWORD tag_len);
	crypto_status_t aes_128_gcm_decrypt(PBYTE key, PBYTE nonce, ULONG nonce_len, PBYTE ct, DWORD ct_len, PBYTE pt, DWORD pt_sz, PBYTE tag, DWORD tag_len);
	crypto_status_t sha256_multi (PBYTE *messages, ULONG *lengths, BYTE hash[32]);

A description of each, and the primitives and support functions from CNG upon which they are built, is given in Table 1.

Method	Algorithm	CNG Primitives and Support Functions
*aes_128_gcm_encrypt*	AES encryption in GCM mode with: A 128-bit key A 128-bit authentication tag No additional authenticated data (AAD)	BCryptOpenAlgorithmProvider BCryptSetProperty BCryptGenerateSymmetricKey BCryptEncrypt BCryptCloseAlgorithmProvider BCryptDestroyKey
*aes_128_gcm_decrypt*	AES encryption in GCM mode with: A 128-bit key A 128-bit authentication tag No AAD	BCryptOpenAlgorithmProvider BCryptSetProperty BCryptGenerateSymmetricKey BCryptDecrypt BCryptCloseAlgorithmProvider BCryptDestroyKey
*sha256_multi*	SHA-256 hash (incremental)	BCryptOpenAlgorithmProvider BCryptGetProperty BCryptCreateHash BCryptHashData BCryptFinishHash BCryptDestroyHash BCryptCloseAlgorithmProvider

Method

Algorithm

CNG Primitives and Support Functions

aes_128_gcm_encrypt

AES encryption in GCM mode with:

A 128-bit key
A 128-bit authentication tag
No additional authenticated data (AAD)

BCryptOpenAlgorithmProvider
BCryptSetProperty
BCryptGenerateSymmetricKey
BCryptEncrypt
BCryptCloseAlgorithmProvider
BCryptDestroyKey

aes_128_gcm_decrypt

AES encryption in GCM mode with:

A 128-bit key
A 128-bit authentication tag
No AAD

BCryptOpenAlgorithmProvider
BCryptSetProperty
BCryptGenerateSymmetricKey
BCryptDecrypt
BCryptCloseAlgorithmProvider
BCryptDestroyKey

sha256_multi

SHA-256 hash (incremental)

BCryptOpenAlgorithmProvider
BCryptGetProperty
BCryptCreateHash
BCryptHashData
BCryptFinishHash
BCryptDestroyHash
BCryptCloseAlgorithmProvider

Table 1. Mapping Crypto class methods to Cryptography API: Next Generation functions

CNG provides very fine-grained control over its encryption algorithms, as well as several optimizations for performance. Our Crypto class is actually fairly inefficient: each time one of these algorithms is called, it initializes the underlying primitives from scratch and then completely closes them down. This is not a significant issue for a password manager, which is UI-driven and only encrypts a small amount of data at a time. A high-performance server application such as a web or database server would need a more sophisticated approach.

The API for the trusted cryptography library distributed with the Intel SGX SDK more closely resembles our middle layer than CNG. There is less granular control over the underlying primitives, but it does make developing our E_Crypto class much simpler. Table 2 shows the new mapping between our middle layer and the underlying provider.

Method	Algorithm	Intel® SGX Trusted Cryptography Library Primitives and Support Functions
*aes_128_gcm_encrypt*	AES encryption in GCM mode with: A 128-bit key A 128-bit authentication tag No additional authenticated data (AAD)	sgx_rijndael128GCM_encrypt
*aes_128_gcm_decrypt*	AES encryption in GCM mode with: A 128-bit key A 128-bit authentication tag No AAD	sgx_rijndael128GCM_decrypt
*sha256_multi*	SHA-256 hash (incremental)	sgx_sha256_init sgx_sha256_update sgx_sha256_get_hash sgx_sha256_close

Method

Algorithm

Intel® SGX Trusted Cryptography Library Primitives and Support Functions

aes_128_gcm_encrypt

AES encryption in GCM mode with:

A 128-bit key
A 128-bit authentication tag
No additional authenticated data (AAD)

sgx_rijndael128GCM_encrypt

aes_128_gcm_decrypt

AES encryption in GCM mode with:

A 128-bit key
A 128-bit authentication tag
No AAD

sgx_rijndael128GCM_decrypt

sha256_multi

SHA-256 hash (incremental)

sgx_sha256_init
sgx_sha256_update
sgx_sha256_get_hash
sgx_sha256_close

Table 2. Mapping Crypto class methods to Intel® SGX Trusted Cryptography Library functions

The DRNG Class

The DRNG class is the interface to the on-chip digital random number generator, courtesy of Intel® Secure Key. To stay consistent with our previous actions we’ll name the enclave version of this class E_DRNG.

We’ll be making two changes in this class to prepare it for the enclave, but both of these changes are internal to the class methods. The class declaration will stay the same.

The CPUID Instruction

One of our application requirements is that the CPU supports Intel Secure Key. Even though Intel SGX is a newer feature than Secure Key, there is no guarantee that all future generations of all possible CPUs which support Intel SGX will also support Intel Secure Key. While it’s hard to conceive of such a situation today, best practice is to not assume a coupling between features where one does not exist. If a set of features have independent detection mechanisms, then you must assume that the features are independent of one another and check for them separately. This means that no matter how tempting it may be to assume that a CPU with support for Intel SGX will also support Intel Secure Key, we absolutely must not do so under any circumstances.

Further complicating the situation is the fact that Intel Secure Key actually consists of two independent features, both of which must also be checked separately. Our application must determine support for both the RDRAND and RDSEED instructions. For more information on Intel Secure Key, see the Intel Digital Random Number Generator (DRNG) Software Implementation Guide.

The constructor in the DRNG class is responsible for the RDRAND and RDSEED feature detection checks. It makes the necessary calls to the CPUID instruction using the compiler intrinsics __cpuid and __cpuidex, and sets a static, global variable with the results.

static int _drng_support= DRNG_SUPPORT_UNKNOWN;
static int _drng_support= DRNG_SUPPORT_UNKNOWN;

DRNG::DRNG(void)
{
	int info[4];

	if (_drng_support != DRNG_SUPPORT_UNKNOWN) return;

	_drng_support= DRNG_SUPPORT_NONE;

	// Check our feature support

	__cpuid(info, 0);

	if ( memcmp(&(info[1]), "Genu", 4) ||
		memcmp(&(info[3]), "ineI", 4) ||
		memcmp(&(info[2]), "ntel", 4) ) return;

	__cpuidex(info, 1, 0);

	if ( ((UINT) info[2]) & (1<<30) ) _drng_support|= DRNG_SUPPORT_RDRAND;

#ifdef COMPILER_HAS_RDSEED_SUPPORT
	__cpuidex(info, 7, 0);

	if ( ((UINT) info[1]) & (1<<18) ) _drng_support|= DRNG_SUPPORT_RDSEED;
#endif
}

The problem for the E_DRNG class is that CPUID is not a legal instruction inside of an enclave. To call CPUID, one must use an OCALL to exit the enclave and then invoke CPUID in untrusted code. Fortunately, the Intel SGX SDK designers have created two convenient functions that greatly simplify this task: sgx_cpuid and sgx_cpuidex. These functions automatically perform the OCALL for us, and the OCALL itself is automatically generated. The only requirement is that the EDL file must import the sgx_tstdc.edl header:

enclave {

	/* Needed for the call to sgx_cpuidex */
	from "sgx_tstdc.edl" import *;

    trusted {
        /* define ECALLs here. */

		public int ve_initialize ();
		public int ve_initialize_from_header ([in, count=len] unsigned char *header, uint16_t len);
		/* Our other ECALLs have been omitted for brevity */
	};

    untrusted {
    };
};

The feature detection code in the E_DRNG constructor becomes:

static int _drng_support= DRNG_SUPPORT_UNKNOWN;

E_DRNG::E_DRNG(void)
{
	int info[4];
	sgx_status_t status;

	if (_drng_support != DRNG_SUPPORT_UNKNOWN) return;

	_drng_support = DRNG_SUPPORT_NONE;

	// Check our feature support

	status= sgx_cpuid(info, 0);
	if (status != SGX_SUCCESS) return;

	if (memcmp(&(info[1]), "Genu", 4) ||
		memcmp(&(info[3]), "ineI", 4) ||
		memcmp(&(info[2]), "ntel", 4)) return;

	status= sgx_cpuidex(info, 1, 0);
	if (status != SGX_SUCCESS) return;

	if (info[2]) & (1 << 30)) _drng_support |= DRNG_SUPPORT_RDRAND;

#ifdef COMPILER_HAS_RDSEED_SUPPORT
	status= __cpuidex(info, 7, 0);
	if (status != SGX_SUCCESS) return;

	if (info[1]) & (1 << 18)) _drng_support |= DRNG_SUPPORT_RDSEED;
#endif
}

Because calls to the CPUID instruction must take place in untrusted memory, the results of CPUID cannot be trusted! This warning applies whether you run CPUID yourself or rely on the SGX functions to do it for you. The Intel SGX SDK offers this advice: “Code should verify the results and perform a threat evaluation to determine the impact on trusted code if the results were spoofed.”

In our tutorial password manager, there are three possible outcomes:

RDRAND and/or RDSEED are not detected, but a positive result for one or both is spoofed. This will lead to an illegal instruction fault at runtime, at which point the program will crash.
RDRAND is detected, but a negative result is spoofed. This will result in an error at runtime, causing the program to exit gracefully since a required feature is not detected.
RDSEED is detected, but a negative result is spoofed. This will cause the program to fall back to the seed-from-RDRAND method for generating random seeds, which has a small performance impact. The program will otherwise function normally.

Since our worst-case scenarios are denial-of-service issues, which do not compromise the application’s secrets or robustness, we will not attempt to detect spoofing attacks.

Generating Seeds from RDRAND

In the event that the underlying CPU does not support the RDSEED instruction, we need to be able to use the RDRAND instruction to generate random seeds that are functionally equivalent to what we would have received from RDSEED if it were available. The Intel Digital Random Number Generator (DRNG) Software Implementation Guide describes the process of obtaining random seeds from RDRAND in detail, but the short version is that one method for doing this is to generate 512 pairs of 128-bit values and mix the intermediate values together using the CBC-MAC mode of AES to produce a single, 128-bit seed. The process is repeated to generate as many seeds as necessary.

In the non-Intel SGX code path, the method seed_from_rdrand uses CNG to build the cryptographic algorithm. Since the Intel SGX code path can’t depend on CNG, we once again need to turn to the trusted cryptographic library that is distributed with the Intel SGX SDK. The changes are summarized in Table 3.

Algorithm	CNG Primitives and Support Functions	Intel® SGX Trusted Cryptography Library Primitives and Support Functions
aes-cmac	BCryptOpenAlgorithmProvider BCryptGenerateSymmetricKey BCryptSetProperty BCryptEncrypt BCryptDestroyKey BCryptCloseAlgorithmProvider	sgx_cmac128_init sgx_cmac128_update sgx_cmac128_final sgx_cmac128_close

Table 3. Cryptographic function changes to the E_DRNG class’s seed_from_rdrand method

Why is this algorithm embedded in the DRNG class and not implemented in the Crypto class with the other cryptographic algorithms? This is simply a design decision. The DRNG class only needs this one algorithm, so we chose not to create a co-dependency between DRNG and Crypto (currently, Crypto does depend on DRNG). The Crypto class is also structured to provide the cryptographic services for vault operations rather than function as a general-purpose cryptographic API.

Why Not Use sgx_read_rand?

The Intel SGX SDK provides the function sgx_read_rand as a means of obtaining random numbers inside of an enclave. There are three reasons why we aren’t using it:

As documented in the Intel SGX SDK, this function is “provided to replace the C standard pseudo-random sequence generation functions inside the enclave, since these standard functions are not supported in the enclave, such as rand, srand, etc.” While sgx_read_rand does call the RDRAND instruction if it is supported by the CPU, it falls back to the trusted C library’s implementation of srand and rand if it is not. The random numbers produced by the C library are not suitable for cryptographic use. It is highly unlikely that this situation will ever occur, but as mentioned in the section on CPUID, we must not assume that it will never occur.
There is no Intel SGX SDK function for calling the RDSEED instruction and that means we still have to use compiler intrinsics in our code. While we could replace the RDRAND intrinsics with calls to sgx_read_rand, it would not gain us anything in terms of code management or structure and it would cost us additional time.
The intrinsics will marginally outperform sgx_read_rand since there is one less layer of function calls in the resulting code.

Wrapping Up

With these code changes, we have a fully functioning enclave! However, there are still some inefficiencies in the implementation and some gaps in functionality, and we’ll revisit the enclave design in Parts 7 and 8 in order to address them.

As mentioned in the introduction, there is sample code provided with this part for you to download. The attached archive includes the source code for the Tutorial Password Manager core, including the enclave and its wrapper functions. This source code should be functionally identical to Part 3, only we have hardcoded Intel SGX support to be on.

Coming Up Next

In Part 6 of the tutorial we’ll add dynamic feature detection to the password manager, allowing it to choose the appropriate code path based on whether or not Intel SGX is supported on the underlying platform. Stay tuned!

↧

Installation Failure Due to Spaces in Path

October 7, 2016, 12:41 pm

Latest and popular articles on Intel Technologies

≫ Next: System Analyzer Utility for Linux

≪ Previous: Intel® Software Guard Extensions Tutorial Series: Part 5, Enclave Development

When attempting to install an Intel® Software Development Product you may encounter the following error message: "The serial number you provided is not valid for product XXXX for Linux*". The issue may be that the path you are using has spaces in it.

This is an issue on Linux* that impacts all products. It may also be an issue on Mac OS*. It is not an issue on Windows*.

Please be advised, do not use spaces in any path (path to package or any path entered in installer including install destination).

Have questions?

Check out the Installation FAQ
Or ask* in our Developer Forums

↧

System Analyzer Utility for Linux

October 7, 2016, 4:24 pm

Latest and popular articles on Intel Technologies

≫ Next: Data Plane Development Kit (DPDK) Packet Capture Framework

≪ Previous: Installation Failure Due to Spaces in Path

Overview

This article describes a utility to help diagnose system and installation issues for Intel(R) SDK for OpenCL(TM) Applications and Intel(R) Media Server Studio. It is a simple Python script with full source code available.

It is intended as a reference for the kinds of checks to consider from the command line and possibly from within applications. However, this implementation should be considered a prototype/proof of concept -- not a productized tool.

Features

When executed, the tool reports back

Platform readiness: check if processor has necessary GPU components
OS readiness: check if OS can see GPU, and if it has required glibc/gcc level
Install checks for Intel(R) Media Server Studio/Intel(R) SDK for OpenCL Applications components
Results from runs of small smoke test programs for Media SDK and OpenCL

System Requirements

The tool is based on Python 2.7. It should run on a variety of systems with or without necessary components to run GPU applications. However, it is still a work in progress so it may not always exit cleanly when software components are missing.

Using the Software

The display should look like the output below for a successful installation

$ python sys_analyzer_linux.py -v
--------------------------
Hardware readiness checks:
--------------------------
 [ OK ] Processor name: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
 [ INFO ] Intel Processor
 [ INFO ] Processor brand: Core
 [ INFO ] Processor arch: Skylake
--------------------------
OS readiness checks:
--------------------------
 [ INFO ] GPU PCI id     : 1916
 [ INFO ] GPU description: SKL ULT GT2
 [ OK ] GPU visible to OS
 [ INFO ] no nomodeset in GRUB cmdline (good)
 [ INFO ] Linux distro   : Ubuntu 14.04
 [ INFO ] Linux kernel   : 4.4.0
 [ INFO ] glibc version  : 2.19
 [ INFO ] gcc version    : 4.8.4 (>=4.8.2 suggested)
 [ INFO ] /dev/dri/card0 : YES
 [ INFO ] /dev/dri/renderD128 : YES
--------------------------
Intel(R) Media Server Studio Install:
--------------------------
 [ OK ] user in video group
 [ OK ] libva.so.1 found
 [ INFO ] Intel iHD used by libva
 [ OK ] vainfo reports valid codec entry points
 [ INFO ] i915 driver in use by Intel video adapter
 [ OK ] /dev/dri/renderD128 connects to Intel i915

--------------------------
Media SDK Plugins available:
(for more info see /opt/intel/mediasdk/plugins/plugins.cfg)
--------------------------
    H264LA Encoder 	= 588f1185d47b42968dea377bb5d0dcb4
    VP8 Decoder 	= f622394d8d87452f878c51f2fc9b4131
    HEVC Decoder 	= 33a61c0b4c27454ca8d85dde757c6f8e
    HEVC Encoder 	= 6fadc791a0c2eb479ab6dcd5ea9da347
--------------------------
Component Smoke Tests:
--------------------------
 [ OK ] Media SDK HW API level:1.19
 [ OK ] Media SDK SW API level:1.19
 [ OK ] OpenCL check:platform:Intel(R) OpenCL GPU OK CPU OK

↧

Data Plane Development Kit (DPDK) Packet Capture Framework

October 7, 2016, 7:20 pm

Latest and popular articles on Intel Technologies

≫ Next: Innovative Puzzle Game From Salmi Games Wins Intel® Level Up Contest Game of the Year

≪ Previous: System Analyzer Utility for Linux

This article describes how the Data Plane Development Kit (DPDK) packet capture framework is used for capturing packets on the DPDK ports. It is written with users of DPDK in mind who want to know more about the feature and for those who want to monitor traffic on DPDK-controlled devices.

The DPDK packet capture framework was introduced in DPDK v16.07. The DPDK packet capture framework consists of the DPDK pdump library and DPDK pdump tool.

DPDK pdump Library and pdump Tool

The librte_pdump library provides the APIs required to allow users to initialize the packet capture framework and to enable or disable packet capture. The library works on a client/server model and its usage is recommended for debugging purposes.

The ‘dpdk-pdump’ tool runs as a DPDK secondary process and is capable of enabling or disabling packet capture on the DPDK ports. The tool is developed based on the librte_pdump library. The dpdk-pdump tool provides command-line options with which users can request enabling or disabling of the packet capture on DPDK ports. The dpdk-pdump tool can only be used in conjunction with a primary application which has the packet capture framework initialized already.

The application which initializes the packet capture framework will act as a server and the application that enables or disables the packet capture will act as a client. The server sends the Rx and Tx packets from the DPDK ports to the client.

The DPDK ‘testpmd’ application is modified to initialize the packet capture framework and act as a server, and the dpdk-pdump tool acts as a client. To view Rx or Tx packets of testpmd, the application should be launched first, and then the dpdk-pdump tool. Packets from the testpmd will be sent to the tool, which then sends them on to the pcap pmd device and that device writes them to the pcap file or to an external interface depending on the command-line option used.

Test Environment

Figure 1 shows the usage of the dpdk-pdump tool for packet capturing on the DPDK port.

Figure 1:Packet capturing on DPDK port using the dpdk-pdump tool.

Configuration Steps

The following steps demonstrate how to run the dpdk-pdump tool to capture Rx side packets of dpdk_port0 and inspect them using tcpdump.

Build DPDK as described in the installation docs. Make sure DPDK is built with the following configuration options set:
```
CONFIG_RTE_LIBRTE_PMD_PCAP=y
CONFIG_RTE_LIBRTE_PDUMP=y
```

Launch testpmd as the primary application:

sudo ./app/testpmd -c 0xf0 -n 4 -- -i --port-topology=chained

Launch the pdump tool as follows:

sudo ./build/app/dpdk-pdump -- --pdump 'port=0,queue=*,rx-dev=/tmp/capture.pcap'

Send some traffic to dpdk_port0 via a traffic generator.

Inspect the contents of capture.pcap using a tool that can interpret pcap files, for example tcpdump:

$tcpdump -nr /tmp/capture.pcap
reading from file /tmp/capture.pcap, link-type EN10MB (Ethernet)
11:11:36.891404 IP 4.4.4.4.whois++ > 3.3.3.3.whois++: UDP, length 18
11:11:36.891442 IP 4.4.4.4.whois++ > 3.3.3.3.whois++: UDP, length 18
11:11:36.891445 IP 4.4.4.4.whois++ > 3.3.3.3.whois++: UDP, length 18

Conclusion

In this article we described the DPDK pdump library and the pdump tool and how they can be used to capture traffic passing on DPDK ports.

Additional Information

More details on librte_pdump library and dpdk-pdump tool can be found at following links.

The DPDK Programmer's Guide has a section dedicated to the librte_pdump library, which contains more information about how it works.

The Sample Applications Users Guide has a section dedicated to the dpdk-pdump application.

The article DPDK Pdump in Open vSwitch* with DPDK describes use of dpdk-pdump for DPDK devices in OvS-DPDK configurations.

If you have questions regarding usage, feel free to follow up with an email query at users@dpdk.org.

About the Author

Reshma Pattan is a Network Software Engineer at Intel Corporation. Her work primarily focuses on data plane library development for DPDK. Her contributions to DPDK include the addition of the packet reorder library, the packet ordering sample application, the pdump library, and the dpdk-pdump tool.

↧