Quantcast
Channel: Intel Developer Zone Articles
Viewing all 3384 articles
Browse latest View live

Action Classification Using PyTorch*

$
0
0

Abstract

For real-world video classification use cases it is imperative to capture the spatiotemporal features. In such cases, the interwoven patterns in an optical flow are expected to hold higher significance. By contrast, most of the implementations involve learning individual image representations disjunctive with the previous frames in the video. In an attempt at exploring more appropriate methods, this case study revolves around video classification that sends an alert in the instance of any violence detected. PyTorch*1, trained on an Intel® Xeon® Scalable processor, is used as the Deep Learning framework for better and faster training and inferencing.

Introduction

In typical contemporary scenarios we frequently observe sudden outbursts of physical altercations such as road rage or a prison upheaval. However, to address such scenarios, high tech industries have deployed closed-circuit television (CCTV) cameras that provide extensive virtual coverage of public places. In the case of any untoward incidents, it is common to analyze the footage made available through video surveillance and start an investigation. An intervention by security officials as the violence is taking place could prevent loss of precious lives and minimize destruction of public property. One obvious approach to implementing this solution is to position human forces for continuous manual monitoring of CCTV cameras. This can be burdensome and erroneous at the same time due to the tedious nature of the job along with human limitations. A more effective method could be automatic detection of violence in CCTV footage triggering alerts to security officials, thus reducing the risk for manual errors. More appealing to the defense and security industries, this solution can also be of relevance to authorities associated with handling public properties.

In this experiment, we implemented the proposed solution using 3D convolutional neural networks (CNN) with ResNet-342 as the base topology. The experiments were performed using transfer learning on pretrained 3D residual networks (ResNet) initialized with weights of the Kinetics* human action video dataset.

The dataset for training was taken from Google’s atomic visual action (AVA) dataset. This is a binary classification between fighting and non-fighting class (explained further in the Dataset Preparation section). Each class contains an approximately equal number of instances. The videos for the non-fighting class comprises instances from the eat, sleep, and drive class made available with the AVA dataset3.

Hardware Details

The configuration of the Intel Xeon Scalable processor is as follows:

NameDescription
Intel® architecturex86_64
CPU op-mode(s)32-bit, 64-bit
Byte OrderLittle Endian
CPU(s)24
On-line CPU(s) list0–23
Thread(s) per core2
Core(s) per socket6
Socket(s)2
Non-uniform memory access (NUMA) node(s)2
Vendor IDGenuine Intel
CPU family6
Model85
Model nameIntel® Xeon® Gold 6128 processor 3.40 GHz
Stepping4
CPU MHz1199.960
BogoMIPS6800.00
Virtualization typeVT-x
L1d cache32K
L1i cache32K
L2 cache1024K
L3 cache19712K
NUMA node0 CPU(s)0-5,12-17
NUMA node1 CPU(s)6-11,18-23

Table 1. Intel® Xeon® Gold processor configuration.

Software Configuration

Prerequisite dependencies to proceed with the development of this use case are outlined below:

LibraryVersion
PyTorch*0.3.1
Python*3.6
Operating SystemCentOS* 7.3.1
OpenCV 3.3.1
youtube-dl2018.01.21
ffmpeg3.4
torchvision0.2.0

Table 2. Software configuration.

Solution Design

In addition to being time consuming, a CNN requires millions of data points to be trained from scratch. In this context, with only 545 video clips in the fighting class and 450 in the non-fighting class, training a network from scratch could result in an over-fitted network. Therefore, we opted for transfer learning, which minimizes the training time and helps attain better inference. The experiment uses a pretrained 3D CNN ResNet network, initialized with the weights of the Kinetics video action dataset. Fine-tuning of the network is done by training the final layers with the acquired AVA training dataset customized to the fight classification. This fine-tuned model is later used for inference.

Image-based features extracted using 2D convolutions are not directly suitable for deep learning on video-based classifications. Learning and preserving spatiotemporal features is vital here. One of the alternatives for capturing this information is 3D ConvNet4. In 2D ConvNets, convolution and pooling operations are performed spatially, whereas in 3D ConvNets these are done spatiotemporally. The difference in treatment of multiple frames as input is marked in the figures below:

2D convolution
Figure 1. 2D convolution on multiple frames4.

3D convolution
Figure 2. 3D convolution4.

As shown, 2D convolutions applied on multiple images (treating them as different channels), results in an image (figure 1). Even though input is three dimensional—that is, W, H, L, where L is the number of input channels— the output shape is a 2D matrix. Here, convolutions are calculated across two directions and the filter depth matches the input channels. Consequently, there is a loss of temporal information of the input signal after every convolution.

Input shape = [W,H,L] filter = [k,k,L] output = 2D

On the other hand, 3D convolution preserves the temporal information of the input signal and results in an output volume (figure 2). The same phenomenon is applicable for 2D and 3D pooling operations as well. Here, the convolutions are calculated across three directions, giving the output shape of a 3D volume.

Input shape = [W,H,L] filter = [k,k,d] output = 3D

Note: d<L

3D ConvNet models temporal information better because of its 3D convolution and 3D pooling operations.

In our case, video clips are referred with a size of c × l × h × w, where c is the number of channels, l is length in number of frames, and h and w are the height and width of the frame, respectively. We also refer 3D convolution and pooling kernel size by d×k ×k, where d is kernel temporal depth and k is kernel spatial size.

Dataset Preparation

The dataset for training is acquired from the Google AVA3. The original AVA dataset contains 452 videos split into 242 for training, 66 for validation, and 144 for test. Each video has 15 minutes annotated in one-second intervals, resulting in 900 annotated segments. These annotations are specified by two CSV files, ava_train_v2.0.csv and ava_val_v2.0.csv. The CSV file has the following information.

  • video_id: YouTube* identifier.
  • middle_frame_timestamp: in seconds from the start of the YouTube video.
  • person_box: top-left (x1, y1) and bottom-right (x2, y2) normalized with respect to frame size, where (0.0, 0.0) corresponds to the top-left, and (1.0, 1.0) corresponds to the bottom-right.
  • action_id: identifier of an action class.

Among the 80 action classes available, only the fighting class (450 samples) is considered for positive samples for the current use case, and an aggregate of 450 samples (150 per class) are taken from the eat, sleep, and drive sub classes to form the non-fighting class. The YouTube videos are downloaded using the command-line utility, youtube-dl.

Each clip is four seconds long and has approximately 25 frames per second. The frames for each clip are extracted into a separate folder with the folder name as the name of the video clip. These are extracted using the ava extraction script.

Data Conversion

The ffmpeg library is used for converting the available AVA video clips to frames. The frames are then converted to type Float Tensor using the Tensor class provided with PyTorch. This conversion results in efficient memory management as the tensor operations in this class do not make memory copies. The methods either transform the existing tensor or return a new tensor referencing the same storage.

Network Topology

3d cnn resnet

The architecture followed for the current use case is ResNet based with 3D convolutions. A basic ResNet block consists of two convolutional layers and each convolutional layer is followed by batch normalization and a rectified linear unit (ReLU). A shortcut pass5 connects the top of the block to the layer just before the last ReLU in the block. For our experiments, we use the relatively shallow ResNet-34 that adopts the basic blocks.

Architecture of 3D CNN
Figure 3. Architecture of 3d cnn resnet – 34.

When the dimensions increase (dotted line shortcuts in the given figure), the following two options are considered:

  • Shortcut performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter.
  • The projection shortcut is used to match dimensions (done by 1×1 convolutions). 

For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2.

We have used Type A shortcuts with the ResNet-34 basic block to avoid increasing the number of parameters of the relatively shallow network.

The 3D convolutions are used to directly extract the spatiotemporal features from raw videos. A two-channeled approach of using a combination of RGB color space and optical flows as inputs to the 3D CNNs is used on the Kinetics dataset to derive the pretrained network. As pretraining on large-scale datasets is an effective way to achieve good performance levels on small datasets, we expect the 3D ResNet-34 pretrained on Kinetics to perform well for this use case.

Training

The training is performed using the Intel Xeon Scalable processor. The pretrained weights used for this experiment can be downloaded from GitHub*. This model is trained on the Kinetics Video dataset.

A brief description of the pretrained model is provided below:

resnet-34-kinetics-cpu.pth: --model resnet --model_depth 34 --resnet_shortcut A

The solution is based on the 3D-Resnets-PyTorch implementation by Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh.

The number of frames per clip is written to the n_frames files generated using utils/n_frames_kinetics.py. After this, an annotation file is generated in JavaScript* Object Notation (JSON) format using utils/kinetics_json.py. The opts.py file contains the train and test dataset paths and the default values for fine-tuning parameters, which can be changed to suit the use case. Fine-tuning is done on the conv5_x and fc layers of the pretrained model. The checkpoints are saved as .pth files for every 10th epoch. The system was trained for 850 epochs and the loss converged up to 0.22 (approximately).

Inference

The trained model is inferred on the YouTube videos downloaded from the test dataset in AVA. The video clips are further broken down into frames and are passed to the classifier. These are then converted to Torch* tensors. The frames obtained per video clip are divided into segments, and a classification score is obtained for each of the segments. The classification results are written on to the video frames and stitched back into a video. Inferencing is done from the code in this GitHub link.

Given that input videos are located in ./videos, the following command is used for inference:

python main.py --input ./input --video_root ./videos --output ./output.json --model ./resnet-34-kinetics.pth --mode score

Results

The following gif is extracted from the video results obtained by passing a video clip to the trained PyTorch model.

Inferred GIF
Figure 4. Inferred GIF.

Conclusion and Future Work

The results are obtained with a high level of accuracy. AVA contains samples from movies that are at least a decade old. Hence, to test the efficacy of the trained model, inferencing was done on external videos (CCTV footage, recently captured fight sequences, and so on). The results showed that the quality of the video or the period during which the video was captured did not influence the accuracy. As a future work, we could enhance the classification problem with detection. Also, the same experiment can be carried out using recurrent neural network techniques.

About the Authors

Astha Sharma and Sandhiya S are Technical Solution Engineers working with the Intel® AI Academia Program.

References

  1. PyTorch
  2. 3D Convolutional Neural Networks with ResNet-34
  3. AVA Dataset
  4. 3D CNN
  5. Type A and Type B resnet_shortcut

Getting Started with Parallel STL

$
0
0

Parallel STL is an implementation of the C++ standard library algorithms with support for execution policies, as specified in the working draft N4659 for the next version of the C++ standard, commonly called C++17. The implementation also supports the unsequenced execution policy specified in the ISO* C++ working group paper P0076R3.

Parallel STL offers efficient support for both parallel and vectorized execution of algorithms for Intel® processors. For sequential execution, it relies on an available implementation of the C++ standard library.

Parallel STL is available as a part of Intel® Parallel Studio XE and Intel® System Studio.

 

Prerequisites

To use Parallel STL, you must have the following software installed:

  • C++ compiler with:
    • Support for C++11
    • Support for OpenMP* 4.0 SIMD constructs
  • Intel® Threading Building Blocks (Intel® TBB) 2018

The latest version of the Intel® C++ Compiler is recommended for better performance of Parallel STL algorithms, comparing to previous compiler versions.

To build an application that uses Parallel STL on the command line, you need to set the environment variables for compilation and linkage. You can do this by calling suite-level environment scripts such as compilervars.{sh|csh|bat}, or you can set just the Parallel STL environment variables by running pstlvars.{sh|csh|bat} in <install_dir>/{linux|mac|windows}/pstl/bin.

<install_dir> is the installation directory, by default, it is:

For Linux* and macOS*:

  • For super-users:      /opt/intel/compilers_and_libraries_<version>
  • For ordinary users:  $HOME/intel/compilers_and_libraries_<version>

For Windows*:

  • <Program Files>\IntelSWTools\compilers_and_libraries_<version>

 

Using Parallel STL

Follow these steps to add Parallel STL to your application:

  1. Add the <install_dir>/pstl/include folder to the compiler include paths. You can do this by calling the pstlvars script.

  2. Add #include "pstl/execution" to your code. Then add a subset of the following set of lines, depending on the algorithms you intend to use:

    • #include "pstl/algorithm"
    • #include "pstl/numeric"
    • #include "pstl/memory"
  3. When using algorithms and execution policies, specify the namespaces std::execution in case of there is no vendor implementation of C++17 standard library or pstl::execution otherwise. See the 'Examples' section below.
  4. For any of the implemented algorithms, pass one of the values seq, unseq, par or par_unseq as the first parameter in a call to the algorithm to specify the desired execution policy. The policies have the following meaning:

     

    Execution policy

    Meaning

    seq

    Sequential execution.

    unseq

    Try to use SIMD. This policy requires that all functions provided are SIMD-safe.

    par

    Use multithreading.

    par_unseq

    Combined effect of unseq and par.

     

  5. Compile the code as C++11 (or later) and using compiler options for vectorization:

    • For the Intel® C++ Compiler:
      • For Linux* and macOS*: -qopenmp-simd or -qopenmp
      • For Windows*: /Qopenmp-simd or /Qopenmp
    • For other compilers, find a switch that enables OpenMP* 4.0 SIMD constructs.

    To get good performance, specify the target platform. For the Intel C++ Compiler, some of the relevant options are:

    • For Linux* and macOS*: -xHOST, -xSSE4.1, -xCORE-AVX2, -xMIC-AVX512.
    • For Windows*: /QxHOST, /QxSSE4.1, /QxCORE-AVX2, /QxMIC-AVX512.
    If using a different compiler, see its documentation.

     

  6. Link with the Intel TBB dynamic library for parallelism. For the Intel C++ Compiler, use the options:

    • For Linux* and macOS*: -tbb
    • For Windows*: /Qtbb (optional, this should be handled by #pragma comment(lib, <libname>))

Version Macros

Macros related to versioning, as described below. You should not redefine these macros.

PSTL_VERSION

Current Parallel STL version. The value is a decimal numeral of the form xyy where x is the major version number and yy is the minor version number.

PSTL_VERSION_MAJOR

PSTL_VERSION/100; that is, the major version number.

PSTL_VERSION_MINOR

PSTL_VERSION - PSTL_VERSION_MAJOR * 100; that is, the minor version number.

Macros

PSTL_USE_PARALLEL_POLICIES

This macro controls the use of parallel policies.

When set to 0, it disables the par and par_unseq policies, making their use a compilation error. It's recommended for code that only uses vectorization with unseq policy, to avoid dependency on Intel® TBB runtime library.

When the macro is not defined (default) or evaluates to a non-zero value all execution policies are enabled.

PSTL_USE_NONTEMPORAL_STORES

This macro enables the use of #pragma vector nontemporal in the algorithms std::copy, std::copy_n, std::fill, std::fill_n, std::generate, std::generate_n with the unseq policy. For further details about the pragma, see the User and Reference Guide for the Intel® C++ Compiler at https://software.intel.com/en-us/node/524559.

If the macro evaluates to a non-zero value, the use of #pragma vector nontemporal is enabled.

When the macro is not defined (default) or set to 0, the macro does nothing.

 

Examples

Example 1

The following code calls vectorized copy:

#include "pstl/execution"
#include "pstl/algorithm"
void foo(float* a, float* b, int n) {
    std::copy(pstl::execution::unseq, a, a+n, b);
}

Example 2

This example calls the parallelized version of fill_n:

#include <vector>
#include "pstl/execution"
#include "pstl/algorithm"

int main()
{
    std::vector<int> data(10000000);
    std::fill_n(pstl::execution::par_unseq, data.begin(), data.size(), -1);  // Fill the vector with -1

    return 0;
}

Implemented Algorithms

Parallel STL supports all of the aforementioned execution policies only for the algorithms listed in the following table. Adding a policy argument to any of the rest of the C++ standard library algorithms will result in sequential execution.

 

Algorithm

Algorithm page at cppreference.com

adjacent_find

http://en.cppreference.com/w/cpp/algorithm/adjacent_find

all_of

http://en.cppreference.com/w/cpp/algorithm/all_any_none_of

any_of

http://en.cppreference.com/w/cpp/algorithm/all_any_none_of

copy

http://en.cppreference.com/w/cpp/algorithm/copy

copy_if

http://en.cppreference.com/w/cpp/algorithm/copy

copy_n

http://en.cppreference.com/w/cpp/algorithm/copy_n

count

http://en.cppreference.com/w/cpp/algorithm/count

count_if

http://en.cppreference.com/w/cpp/algorithm/count

destroy

http://en.cppreference.com/w/cpp/memory/destroy

destroy_n

http://en.cppreference.com/w/cpp/memory/destroy_n

equal

http://en.cppreference.com/w/cpp/algorithm/equal

exclusive_scan

http://en.cppreference.com/w/cpp/algorithm/exclusive_scan

fill

http://en.cppreference.com/w/cpp/algorithm/fill

fill_n

http://en.cppreference.com/w/cpp/algorithm/fill_n

find

http://en.cppreference.com/w/cpp/algorithm/find

find_end

http://en.cppreference.com/w/cpp/algorithm/find_end

find_first_of

http://en.cppreference.com/w/cpp/algorithm/find_first_of

find_if

http://en.cppreference.com/w/cpp/algorithm/find

find_if_not

http://en.cppreference.com/w/cpp/algorithm/find

for_each

http://en.cppreference.com/w/cpp/algorithm/for_each

for_each_n

http://en.cppreference.com/w/cpp/algorithm/for_each_n

generate

http://en.cppreference.com/w/cpp/algorithm/generate

generate_n

http://en.cppreference.com/w/cpp/algorithm/generate_n

inclusive_scan

http://en.cppreference.com/w/cpp/algorithm/inclusive_scan

is_heap

http://en.cppreference.com/w/cpp/algorithm/is_heap

is_heap_until

http://en.cppreference.com/w/cpp/algorithm/is_heap_until

is_partitioned

http://en.cppreference.com/w/cpp/algorithm/is_partitioned

is_sorted

http://en.cppreference.com/w/cpp/algorithm/is_sorted

is_sorted_until

http://en.cppreference.com/w/cpp/algorithm/is_sorted_until

lexicographical_compare

http://en.cppreference.com/w/cpp/algorithm/lexicographical_compare

max_element

http://en.cppreference.com/w/cpp/algorithm/max_element

merge

http://en.cppreference.com/w/cpp/algorithm/merge

min_element

http://en.cppreference.com/w/cpp/algorithm/min_element

minmax_element

http://en.cppreference.com/w/cpp/algorithm/minmax_element

mismatch

http://en.cppreference.com/w/cpp/algorithm/mismatch

move

http://en.cppreference.com/w/cpp/algorithm/move

none_of

http://en.cppreference.com/w/cpp/algorithm/all_any_none_of

partial_sort

http://en.cppreference.com/w/cpp/algorithm/partial_sort

partition_copy

http://en.cppreference.com/w/cpp/algorithm/partition_copy

reduce

http://en.cppreference.com/w/cpp/algorithm/reduce

remove_copy

http://en.cppreference.com/w/cpp/algorithm/remove_copy

remove_copy_if

http://en.cppreference.com/w/cpp/algorithm/remove_copy

replace

http://en.cppreference.com/w/cpp/algorithm/replace

replace_copy

http://en.cppreference.com/w/cpp/algorithm/replace_copy

replace_copy_if

http://en.cppreference.com/w/cpp/algorithm/replace_copy

replace_if

http://en.cppreference.com/w/cpp/algorithm/replace

search

http://en.cppreference.com/w/cpp/algorithm/search

search_n

http://en.cppreference.com/w/cpp/algorithm/search_n

sort

http://en.cppreference.com/w/cpp/algorithm/sort

stable_sort

http://en.cppreference.com/w/cpp/algorithm/stable_sort

swap_ranges

http://en.cppreference.com/w/cpp/algorithm/swap_ranges

transform

http://en.cppreference.com/w/cpp/algorithm/transform

transform_exclusive_scan

http://en.cppreference.com/w/cpp/algorithm/transform_exclusive_scan

transform_inclusive_scan

http://en.cppreference.com/w/cpp/algorithm/transform_inclusive_scan

transform_reduce

http://en.cppreference.com/w/cpp/algorithm/transform_reduce

uninitialized_copy

http://en.cppreference.com/w/cpp/memory/uninitialized_copy

uninitialized_copy_n

http://en.cppreference.com/w/cpp/memory/uninitialized_copy_n

uninitialized_default_construct

http://en.cppreference.com/w/cpp/memory/uninitialized_default_construct

uninitialized_default_construct_n

http://en.cppreference.com/w/cpp/memory/uninitialized_default_construct_n

uninitialized_fill

http://en.cppreference.com/w/cpp/memory/uninitialized_fill

uninitialized_fill_n

http://en.cppreference.com/w/cpp/memory/uninitialized_fill_n

uninitialized_move

http://en.cppreference.com/w/cpp/memory/uninitialized_move

uninitialized_move_n

http://en.cppreference.com/w/cpp/memory/uninitialized_move_n

uninitialized_value_construct

http://en.cppreference.com/w/cpp/memory/uninitialized_value_construct

uninitialized_value_construct_n

http://en.cppreference.com/w/cpp/memory/uninitialized_value_construct_n

unique_copy

http://en.cppreference.com/w/cpp/algorithm/unique_copy

Known limitations

Parallel and vector execution is only supported for a subset of aforementioned algorithms if random access iterators are provided, while for the rest execution will remain serial.

Legal Information

Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
© Intel Corporation

Code Sample: Custom Audio Editor Tool with Unreal Engine* for Sound Spatialization in VR

$
0
0

File(s):

Download
License:Intel Sample Source Code License Agreement
Optimized for... 
OS:Microsoft Windows® 10 (64 bit)
Hardware:GPU required, HTC Vive*
Software:
(Programming Language, tool, IDE, Framework)
Microsoft Visual Studio* 2017, C#; Unreal Engine* 4.18.1 or greater
Prerequisites:

Familiarity with Visual Studio, Unreal Engine API, 3D graphics, parallel processing.

Summary

This Code Sample show you step-by-step on building a useful tool for VR devs using Unreal Engine that leverages the power of intel CPU’s.  Unreal Engine has a powerful virtual reality editor option, but something they did not include is the ability to edit and place sounds while inside VR. It can be troublesome constantly having to restart the editor after adjusting a sound to test what it sounds like in VR. So we decided to create a sound editor that allows game devs and sound designers alike to quickly place, edit, and test spatialized sound inside VR! This will prevent the user from having to constantly enter and exit the editor.

What You Will Learn

  • Motion controller interaction
  • How to create a custom C++ class
  • VR UI
  • Saving editor changes
  • Sound spatialization parameters

Below, we will walk you through step-by-step to demonstrate how we made this custom audio editor tool for Unreal Engine from start to finish.

Instruction

Before we begin, you need to do a couple of things. Download and unzip the project folder. You also need to make sure you have at least version 4.18.1 of Unreal Engine* installed.

When you have downloaded and unzipped the folder, right-click on Intel_VR_Audio_Tools.uproject and select "Generate Visual Studio project files." After that completes, open the project. A popup that says "Missing Intel_VR_Audio_Tools Modules" will appear. Click "Yes" to start the rebuild; this should take less than 20 seconds. This is needed because of how you are dynamically finding .wav files that have been added to the project, which will be explained in the Custom C++ Class section.

Follow the tutorial for the step-by-step to build the custom tool.

Updated Log

Created April 25, 2018

Characterizing DPDK-Enabled Open vSwitch* Using TRex on a Dual-Socket System

$
0
0

Introduction

Measuring the performance of an Open vSwitch* (OvS) or any vSwitch can be difficult without resorting to expensive commercial traffic generators, or at least some extra network nodes on which to run a software traffic generator.

However, on dual-socket or multi-socket systems there is another option. The software traffic generator can be run on one socket while OvS as the System Under Test (SUT) can reside on the other socket. Being on separate sockets, the two will work well without interfering with each other. Each will use memory from its own socket’s RAM and will use its own PCIe bus. The only extra hardware requirement is a second network interface card (NIC) to drive the generated traffic without interfering with the NIC and PCI-bus of the SUT.

This article shows you how to set up this configuration using the TRex Realistic Traffic Generator.

The overall system looks like this:

Overall system configuration
Figure 1. Overall system configuration.

Setting up the Hardware

It is important to choose the correct PCIe* slots for your NICs. For instance, to be able to handle the incoming traffic from even one 10 Gbps NIC you need a PCIe Gen3 NIC on an 8-lane bus. See Understanding Performance of PCI Express Systems.

For this article, I used a system based on an Intel® Server Board S2600WT.

sudo dmidecode -t system
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 2.7 present.

Handle 0x0001, DMI type 1, 27 bytes
System Information
        Manufacturer: Intel Corporation
        Product Name: S2600WT2
…

I can use this information to look up the technical specification for the board and see exactly which PCIe slots will give sufficient bandwidth while placing the two NICs on different sockets.

The system will be functional even if the PCIe placement is wrong; however, the direct memory access (DMA) will have to cross the socket interconnect, which will become a bottleneck and the performance observed by TRex will not be accurate.

Host Configuration

You should also be aware of the implications of other host configuration options that affect Data Plane Development Kit (DPDK) applications: hyper-threading and core ID to socket mapping, core isolation, and so on. These are covered in Open vSwitch with DPDK (Ovs-DPDK).

Care has to be taken to ensure that the cores assigned to TRex and the cores assigned to OvS-DPDK are on the correct sockets (i.e., the same socket as their assigned NICs). Be aware if hyper-threading is enabled then the core (strictly speaking, the “logical CPU”) to socket mapping is more complex. If you are using a program such as htop as a way to verify which cores are running poll-mode drivers (PMDs) then also be aware that TRex uses virtually no CPU resources to receive packets (as they actually terminate on the NIC) and only uses a noticeable amount of the CPU to transmit packets. So, if TRex is not actively transmitting, htop will not report its CPUs as being under load.

Setting up TRex

The TRex manual has a good sanity-check tutorial to ensure TRex is installed and working correctly in the “First time Running” section. There is also a list of other TRex documentation that should be scanned to get an idea of its other functionality.

Once TRex is running successfully in loop-back mode, connect the OvS and TRex NICs directly.

This article assumes you are already comfortable setting up Open vSwitch with DPDK (Ovs-DPDK). If not, the process is documented within OvS at Open vSwitch with DPDK. Only changes to the standard OvS-DPDK configuration are described below.

Hugepages

TRex uses DPDK, which requires hugepages. If hugepages are not configured on your system TRex will do this configuration for you auto-magically. Unfortunately, it does not account for other users of hugepages such as OvS-DPDK, so we need to set up and mount the hugepages manually beforehand in order to prevent TRex from reconfiguring everything when it starts up.

Create 2 GB of 2 MB hugepages (i.e., 2048 x 2 MB hugepages) on both sockets:

root# cat 2048 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
root# cat 2048 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages

If not already mounted, mount the hugepages:

root# mkdir /dev/hugepages
root# sudo mount -t hugetlbfs nodev $HUGE_DIR

As both TRex and OvS-DPDK create hugepages via DPDK, they need to change from the default hugepage-backed filenames.

For OvS, when ovsdb is running but before vswitchd is started:

ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-extra=”--file-prefix=ovs”

For TRex, edit trex_cfg.yaml:

- Port_limit    : 4 
  Prefix:       : trex

Without this step you will get errors such as "EAL: Can only reserve 1857 pages from 4096 requested. Current CONFIG_RTE_MAX_MEMSEG=256 is not enough” depending on whether you start OvS or TRex first.

You will be able to see the different hugepage-backed memory-mapped files that DPDK created for each application in /dev/hugepages as ovsmap_NNN and trexmap_NNN.

Other TRex Configurations

Although marked as optional in the TRex documentation, when TRex is used with OvS-DPDK you will need to limit its memory use.

limit_memory : 1024 #MB.

When using more than one core per interface to generate traffic I found:

  • Configuration item “c” must be configured.
  • The master and latency thread IDs must be set to ensure they run on the “TRex” socket and not the “OvS-DPDK” socket. I have used socket#1 as the TRex socket.
  • The length of the “threads” lists must be equal to the value of c.
  • Multiple cores are assigned to an interface by using multiple threads lists. The configuration below assigns cores 16 and 20 to the first interface, 17 and 21 to the second, and so on.

Therefore, you will need a configuration somewhat like:

c : 4 
  …
  platform :
         master_thread_id : 14
         latency_thread_id : 15
         dual_if :
         - socket : 1
           threads : [16, 17, 18, 19]
         - socket : 1
           threads : [20, 21, 22, 23]

For the wiring and OvS forwarding configuration listed above I also used the port_bandwidth_gb and explicitly set the src and dest_macs of the ports so that the outgoing src mac address for each port matched the incoming dest mac. These items may or may not be required:

port_bandwidth_gb : 10

  port_info       :  
          - dest_mac   : 00:03:47:00:01:02
            src_mac    : 00:03:47:00:01:01
          - dest_mac   : 00:03:47:00:01:01
            src_mac    : 00:03:47:00:01:02
          - dest_mac   : 00:03:47:00:00:02
            src_mac    : 00:03:47:00:00:01
          - dest_mac   : 00:03:47:00:00:01
            src_mac    : 00:03:47:00:00:02

Running TRex

TRex can now be run in the usual ways, such as:

sudo ./t-rex-64 -f cap2/dns.yaml -m 250kpps

Or the TRex Python* bindings can be used to exert much more fine-grained control and create your own application-specific traffic generator.

Creating Your Own Traffic Generator

By using the Python bindings, it becomes very simple to write your own traffic generator command-line interface (CLI) that is tailored to your own use case. This can make tasks that are slow via a traditional GUI easily repeatable and modifiable. For instance, when I needed to change the total load offered to OvS but at the same time maintain a ratio of offered load across several ports it was straightforward to write a CLI on top of TRex that looked like this:

$ ./mytest.py
(Cmd) dist 1 2 3 4         <<< per port load ratio 1:2:3:4
Dist ratio is [1, 2, 3, 4]
     
(Cmd) start 1000               <<< total load 1000 kpps
-f stl/traffic.py -m 100kpps --port 0 --force
-f stl/traffic.py -m 200kpps --port 1 --force
-f stl/traffic.py -m 300kpps --port 2 --force
-f stl/traffic.py -m 400kpps --port 3 –force

(Cmd) stats               <<< check loss & latency
Stats from last 4.5s
0->1 offered 98 dropped 0 rxd 98 (kpps) => 0% loss
1->0 offered 196 dropped 0 rxd 196 (kpps) => 0% loss
2->3 offered 295 dropped 0 rxd 295 (kpps) => 0% loss
3->2 offered 393 dropped 0 rxd 393 (kpps) => 0% loss
0->1 average 8 jitter 0 total_max 13 (us)
1->0 average 13 jitter 1 total_max 20 (us)
2->3 average 15 jitter 3 total_max 46 (us)
3->2 average 12 jitter 3 total_max 88 (us)
18 3 25 29 7 48 931 4 984 948 5 985   <<< excel pastable format!

(Cmd) start 10000  <<< increase total load while keeping distribution.
-f stl/traffic.py -m 1000kpps --port 0 --force
-f stl/traffic.py -m 2000kpps --port 1 --force
-f stl/traffic.py -m 3000kpps --port 2 --force
-f stl/traffic.py -m 4000kpps --port 3 –force

(Cmd) stats
Stats from last 2.3s
0->1 offered 987 dropped 0 rxd 988 (kpps) => 0% loss
1->0 offered 1964 dropped 0 rxd 1964 (kpps) => 0% loss
2->3 offered 2927 dropped 886 rxd 2040 (kpps) => 30% loss
3->2 offered 3879 dropped 1847 rxd 2031 (kpps) => 47% loss
0->1 average 18 jitter 3 total_max 25 (us)
1->0 average 29 jitter 7 total_max 48 (us)
2->3 average 931 jitter 4 total_max 984 (us)
3->2 average 948 jitter 5 total_max 985 (us)

Something like this happened to be very slow and error-prone to do on a commercial traffic generator, but simple using TRex and a small amount of Python.

An elided version of the script with additional comments explaining the use of the TRex API and the Python cmd module follows:

from trex_stl_lib.api import *
import cmd  <<< See https://docs.python.org/2/library/cmd.html

def main():
    # connect to the server
    c = STLClient(server = '127.0.0.1')
    c.connect()
    # enter the cli loop
    MyCmd(c).cmdloop()
    ...

class MyCmd(cmd.Cmd):
    def __init__(self, client):
        # standard cmd boilerplate
        cmd.Cmd.__init__(self)

    def do_start(self, line):     <<< cmd invokes this when you type 'start ...' on cli
        """start <total offered rate kpps> 
        Start traffic. Total kpps across all ports. e.g 'start 10'<<< This doc string is also the help string for the 'start' command!
        """
        (argc, argv) = self._parse(line)
        if (argc != 1):
            self.do_help('start')
            return 0              <<< 1 halt cmd loop; 0 get another command
            
        ...   <<< port to rate mapping set up here
        
        for port, rate in enumerate(rates):
            start_line = "-f stl/traffic.py \  <<< traffic.py is based on the TRex sample stl/cap.py
                -m %skpps --port %d --force" % (rate, port)
            rc = self.client.start_line(start_line)   <<< tell TRex server to start stream on port at rate
            
            
    def do_dist(self, line):  <<< cmd invokes this when you type 'dist ...' on cli
        """Set traffic dist across 4 ports e.g. 'dist 2 1 1 1'"""<<< help string for command

        ... <<< rate distribution is parsed and stored here

    def do_stop(self, line):
        """Stop all traffic"""
        rc = self.client.stop(rx_delay_ms=100) #returns None

    def do_stats(self, line):
        """Display pertinent stats"""

        pp = pprint.PrettyPrinter(indent=4)
        stats = self.client.get_stats()
        pp.pprint(stats)   <<< using pretty print is a fast 
                           <<< way to see stats' complicated internals!
        
        ... <<< Skip the gory details of parsing stats
        

    def do_quit(self, line):
        return 1    <<< return 1 tells cmd base class to exit

Summary

A software-based traffic generator such as TRex can be used to test vSwitches with very little extra hardware – just a DPDK-compatible NIC. Also, by using the TRex Python bindings it is simple to write a traffic generator CLI that is tailored to your own use case. This can make test scenarios that are slow and error-prone to do via a traffic generator GUI easily repeatable and modifiable, enabling fast exploratory performance testing.

About the Author

Billy O’Mahony is a network software engineer with Intel. He has worked on the Open Platform for Network Functions Virtualization (OPNFV) project and accelerated software switching solutions in the user space running on Intel® architecture. His contributions to Open vSwitch with DPDK include Ingress Scheduling and RXQ/PMD Assignment.

References

Understanding Performance of PCI Express Systems

Intel® Server Board S2600WT - Technical Product Specification

Open vSwitch with DPDK

TRex Manual

TRex Documentation

Getting Started with the New Unity* Entity Component, C# Job System, and Burst Compiler

$
0
0
By Cristiano Ferreira and Mike Geig

Figure from a video game

Low, medium, and high. Standard fare for GPU settings, but why not CPU settings, too? Today the potential power of the CPU on your end users' machines can vary wildly. Typically, developers will define a CPU min-spec, implement the simulation and gameplay systems using that performance target, and then call it a day. This leaves the many potentially available cores and features built into modern mainstream CPUs sitting idle on the sideline. The new C# job system and entity component system from Unity* don't just allow you to easily leverage previously unused CPU resources, they will also help run all your game code more efficiently in general. Then you can use those extra CPU resources to add more scene dynamism and immersion. In this article, you'll see how to quickly get started learning these new features.

Unity is attacking two important performance problems for computing in game engines. The first problem under assault is inefficient data layout. Unity's Entity Component System (ECS) improves management of data storage for high-performance operations on those structures. The second problem is the lack of a high-performance job language and SIMD vectorization that can operate on that well-organized data. Unity's new C# job system, entity component system and Burst compiler technology leave those shortcomings in the dust.

The Unity entity component system and C# job system are two different things, but they go hand-in-hand. To get to know them, let's look at the current Unity workflow for creating an object in your scene, and then differentiate from there.

In the current Unity workflow, you:

  • Create a GameObject.
  • Add components to your game object that give your object desired properties:
    • Rendering
    • Collision
    • Rigidbody physics
  • Create and add MonoBehaviour scripts to your object to command and alter the states of these components at runtime.

Let's call this the Classic Unity workflow. There are some inherent drawbacks and performance considerations for this way of doing things. For one, data and processing are tightly coupled. This means that code reuse can happen less frequently as processing is tied to a very specific set of data. On top of this, the classic system is very dependent on reference types.

In the Classic GameObject and Components example shown below, the Bullet GameObject is dependent on the Transform, Renderer, Rigidbody, and Collider references. Objects being referenced in these performance-critical scripts exist scattered in heap memory. As a result of this, data is not transformed into a form that can be operated on by the faster SIMD vector units.

Classic gameobject and components lists
Figure 1. Classic gameobject and components lists.

Gaining Speed with Cache Prefetching

Accessing data from system memory is far slower than pulling data from a nearby cache. That is where prefetching comes in. Cache prefetching is when computer hardware predicts what data will be accessed next, and then preemptively pulls it from the original, slower memory into faster memory so that it is warmed and ready when it's needed. Using this, hardware gets a nice performance boost on predictive computations. If you are iterating over an array, the hardware prefetch unit can learn to pull swaths of data from system memory into the cache. When it comes time for the processor to operate on the next part of the array, the necessary data is sitting close by in the cache and ready to go. For tightly packed contiguous data, like you'd have in an array, it's easy for the hardware prefetcher to predict and get the right objects. When many different game objects are sparsely allocated in heap memory, it becomes impossible for the prefetcher to do its thing, forcing it to fetch useless data.

Scattered memory references between gameobjects
Figure 2. Scattered memory references between gameobjects, their behaviors, and their components.

The illustration above shows the random sporadic nature of this data storage method. With the scenario shown above, every single reference (arrows)—even if cached as a member variable—could potentially pull all the way from system memory. The Classic Unity GameObject scenario can get your game prototyped and running in a very short timeline, but it's hardly ideal for performance-critical simulations and games. To deepen the issue, each of those reference types contain a lot of extra data that might not need to be accessed. These unused members also take up valuable space in processor caches. If only a select few member variables of an existing component are needed, the rest can be considered wasted space, as shown in the Wasted Space illustration below:

The few items used for the movement operation
Figure 3. The items in bold indicate the members that are actually used for the movement operation; the rest is wasted space.

To move your GameObject, the script needs to access the position and rotation data members from the Transform component. When your hardware is fetching data from memory, the cache line is filled with much potentially useless data. Wouldn't it be nice if you could simply have an array of only position and rotation members for all of the GameObjects that are supposed to move? This will enable you to perform the generic operation in a fraction of the time.

Enter the Entity Component System

Unity's new entity component system helps eliminate inefficient object referencing. Instead of GameObjects with their own collection of components, let's consider an entity that only contains the data it needs to exist.

In the Entity Component System with Jobs Diagram below, notice that the Bullet entity has no Transform or Rigidbody component attached to it. The Bullet entity is just the raw data needed explicitly for your update routine to operate on. With this new system, you can decouple the processing completely from individual object types.

Entity component system with jobs diagram
Figure 4. Entity component system with jobs diagram.

Of course, it's not just movement systems that benefit from this. Another common component in many games are more complex health systems set up across a wide variety of enemies and allies. These systems typically have little to no variation between object types, so they are another great candidate to leverage the new system. An entity is a handle used to index a collection of different data types that represent it (archetypes for ComponentDataGroups). Systems can filter and operate on all components with the required data without any help from the programmer; more on this later. The data is all efficiently organized in tightly packed contiguous arrays and filtered behind the scenes without the need to explicitly couple systems with entity types. The benefits of this system are immense. Not only does it improve access times with cache efficiency; it also allows advanced technologies (auto-vectorization / SIMD) available in modern CPUs that require this kind of data alignment to be used. This gives you performance by default with your games. You can do much more every frame or do the same thing in a much shorter amount of time. You'll also get a huge performance gain from the upcoming Burst compiler feature for free.

Wasted space generated by the classic system
Figure 5. Note the fragmentation in cache line storage and wasted space generated by the classic system. See image below for data comparison.

Comparison between Transform and DataNeededToMove
Figure 6. Compare the memory footprint associated with a single move operation with both accomplishing the same goal.

The Burst Compiler

The Burst compiler is the behind-the-scenes performance gain that results from the entity component system having organized your data more efficiently. Essentially, the burst compiler will optimize operations on code depending on the processor capabilities on your player's machine. For instance, instead of doing just 1 float operation at a time, maybe you can do 16, 32, or 64 by filling unused registers. The new compiler technology is employed on Unity's new math namespace and code within the C# job system (described below), relying on the fact that the system knows data has been set up the proper way with the entity component system. The current version for Intel CPUs supports Intel® Streaming SIMD Extensions 4 (Intel® SSE4), Intel® Advanced Vector Extensions 2 (Intel® AVX2), and Intel® Advanced Vector Extensions 512 (Intel® AVX-512) for float and integer. The system also supports different accuracy per method, applied transitively. For example, if you are using a cosine function inside your top-level method with a low accuracy, the whole method will use a low accuracy version of cosine as well. The system also provides for AOT (Ahead-of-Time) compilation with dynamic selection of proper optimized function based on the feature support of the processor currently running the game. Another benefit to this method of compilation is the future-proofing of your game. If a brand-new processor line comes out to market with some amazing new features to be leveraged, Unity can do all of the hard work for you behind the scenes. All it takes is an upgrade to the compiler to reap the benefits. The compiler is package-based and can be upgraded without requiring a Unity editor update. Since the Burst package will be updated at its own cadence, you will be able to take advantage of the latest hardware architectural improvements and features without having to wait for the code to be rolled into the next editor release.

The C# Job System

Most people who have worked with multi-threaded code and generic tasking systems know that writing thread-safe code is difficult. Race conditions can rear their ugly heads in extremely rare cases. If the programmer hasn't thought of them, the result can be potentially critical bugs. On top of that, context-switching is expensive, so learning how to balance workloads to function as efficiently as possible across cores is difficult. Finally, writing SIMD optimized code or SIMD intrinsics is an esoteric skill, sometimes best left to a compiler. The new Unity C# job system takes care of all of these hard problems for you so that you can use all of the available cores and SIMD vectorization in modern CPUs without the headache.

C# job system diagram
Figure 7. C# job system diagram.

Let's look at a simple bullet movement system, for example. Most game programmers have written a manager for some type of GameObject as shown above in the Bullet Manager. Typically, these managers pool a list of GameObjects and update the positions of all active bullets in the scene every frame. This is a good use for the C# job system. Because movement can be treated in isolation, it is well suited to be parallelized. With the C# job system, you can easily pull this functionality out and operate on different chunks of data on different cores in parallel. As the developer, you don't have to worry about managing this work distribution; you only need to focus entirely on your game-specific code. You'll see how to easily do this in a bit.

Combining These Two New Systems

Combing the entity component system and the C# job system gives you a force more powerful than the sum of its parts. Since the entity component system sets up your data in an efficient, tightly packed manor, the job system can split up the data arrays so that they can be efficiently operated on in parallel. Also, you get some major performance benefits from cache locality and coherency. The thin as-needed allocation and arrangement of data increases the chance that the data your job will need will be in shared memory before it's needed. The layout and job system combination beget predictable access patterns that give the hardware cues to make smart decisions behind the scene, giving you great performance.

"OK!" You are saying, "This is absolutely amazing, but how do I use this new system?"

To help get your feet wet, let's compare and contrast the code involved in a very simple game that uses the following programming systems:

  1. Classic System
  2. Classic System Using Jobs
  3. Entity Component System Using Jobs

Here's how the game works:

  • The player hits the space bar and spawns a certain amount of ships in that frame.
  • Each generated ship is set to a random X coordinate within the bounds of the screen.
  • Each generated ship has a movement function that sends it toward the bottom of the screen.
  • Each generated ship resets its position once the bottom bound is crossed.

Test Configuration:

  • In this article, we will reference the Unity Profiler, a very powerful tool for isolating bottlenecks and viewing work distribution. See the Unity docs to learn more!
    • Screen captures and data were taken using the Intel® Core™ i7-8700K processor and an NVIDIA GeForce* GTX 1080 graphics card.

1. Classic System

The Classic system checks each frame for spacebar input and triggers the AddShips() method. This method finds a random X/Z position between the left and right sides of the screen, sets the rotation of the ship to point downward, and spawns a ship prefab at that location.

void Update()
{
    if (Input.GetKeyDown("space"))
        AddShips(enemyShipIncremement);
}

void AddShips(int amount)
{
    for (int i = 0; i < amount; i++)
    {
        float xVal = Random.Range(leftBound, rightBound);
        float zVal = Random.Range(0f, 10f);

        Vector3 pos = new Vector3(xVal, 0f, zVal + topBound);
        Quaternion rot = Quaternion.Euler(0f, 180f, 0f);

        var obj = Instantiate(enemyShipPrefab, pos, rot) as GameObject;
    }
}

<Classic/ClassicSpawning_Update_AddShips.cs>

Code sample showing how to add ships using the classic system

Classic battleship
Figure 8. Classic ship prefab. (source: Unity.com Asset store battleships package).

The ship object spawned, along with each of its components, are created in heap memory. The movement script attached accesses the transform component every frame and updates the position, making sure to stay between the bottom and top bounds of the screen. Super simple!

using UnityEngine;

namespace Shooter.Classic
{
    public class Movement : MonoBehaviour
    {
        void Update()
        {
            Vector3 pos = transform.position;
            pos += transform.forward * GameManager.GM.enemySpeed * Time.deltaTime;

            if (pos.z < GameManager.GM.bottomBound)
                pos.z = GameManager.GM.topBound;

            transform.position = pos;
        }
    }
}

<Classic/Classic_Movement.cs>

Code sample showing move behavior

The graphic below shows the profiler tracking 16,500 objects on the screen at once. Not bad, but we can do better! Keep on reading.

The profiler tracking 16,500 objects on 30 FPS screen
Figure 9. After some initializations, the profiler is already tracking 16,500 objects on the screen at 30 FPS.

Classic performance visualization
Figure 10. Classic performance visualization.

Looking at the BehaviorUpdate() method, you can see that it takes 8.67 milliseconds to complete the behavior update for all ships. Also note that this is all happening on the main thread.

In the C# job system, that work is split among all available cores.

2. Classic System Using Jobs

using Unity.Jobs;
using UnityEngine;
using UnityEngine.Jobs;

namespace Shooter.JobSystem
{
    [ComputeJobOptimization]
    public struct MovementJob : IJobParallelForTransform
    {
        public float moveSpeed;
        public float topBound;
        public float bottomBound;
        public float deltaTime;

        public void Execute(int index, TransformAccess transform)
        {
            Vector3 pos = transform.position;
            pos += moveSpeed * deltaTime * (transform.rotation * new Vector3(0f, 0f, 1f));
            
            if (pos.z < bottomBound)
                pos.z = topBound;
                
            transform.position = pos;
        }
    }
}

<Jobs/Jobs_Movement_Job.cs>

Sample code showing job movement implementation using the C# Job System

Our new MovementJob script is a struct that implements one of the IJob interface variants. This self-contained structure defines a task, or "job", and the data needed to complete that task. It is this structure that we will schedule with the Job System. For each ship's movement and bounds-checking calculations, you know you need the movement speed, the top bound, bottom bound, and the delta time values. The job has no concept of delta time, so that data must be provided explicitly. The calculation logic itself for the new position is the same as the classic system, although assigning that data back to the original transform must be updated via the TransformAccess parameter since reference types (such as Transform) don't work here. The basic requirements to create a job involve implementing one of the IJob interface variants, such as IJobParallelForTransform in the above example, and implementing the Execute method specific to your job. Once created, this job struct can simply be passed into the Job Scheduler. From there, all of the execution and resulting processing will be completed for you.

To learn more about how this job is structured, let's break down the interface it is using: IJob | ParallelFor | Transform. IJob is the basic interface that all IJob variants inherit from. A Parallel For Loop is a parallel pattern that essentially takes a typical single threaded for loop and splits the body of work into chunks based on index ranges to be operated on within different cores. Last but not least, the Transform keyword indicates that the Execute function to implement will contain the TransformAccess parameter to supply movement data to external Transform references. To conceptualize all of these, think of an array of 800 elements that you iterate over in a regular for loop. What if you had an 8-core system and each core could do the work for 100 entities automagically? A-ha! That's exactly what the system will do.

Using Jobs speeds up the iteration task
Figure 11. Using Jobs speeds up the iteration task significantly.

The Transform keyword on the end of the interface name simply gives us the TransformAccess parameter for our Execute method. For now, just know each ship's individual transform data is passed in for each Execute invocation. Now let's look at the AddShips() and Update() method in our game manager to see how this data is set every frame.

using UnityEngine;
using UnityEngine.Jobs;

namespace Shooter.JobSystem
{
    public class GameManager : MonoBehaviour
    {

        // ...
        // GameManager classic members
        // ...

        TransformAccessArray transforms;
        MovementJob moveJob;
        JobHandle moveHandle;

        
        // ...
        // GameManager code
        // ...
    }
}

<Job/Job_GameManagerVars.cs>

Code sample showing required variables to set up and track jobs

Right away, you notice that you have some new variables that you need to keep track of:

  • TransformAccessArray is the data container that will hold a modified reference to each ship's Transform (job-ready TransformAccess). The normal Transform data type isn't thread-safe, so this is a convenient helper type to set movement related data for your GameObjects.
  • MovementJob is an instance of the job struct we just created. This is what we will be using to configure our job in the job system.
  • JobHandle is the unique identifier for your job that you use to reference your job for various operations, such as verifying completion. You'll receive a handle to your job when you schedule it.
void Update()
{
    moveHandle.Complete();

    if (Input.GetKeyDown("space"))
        AddShips(enemyShipIncremement);

    moveJob = new MovementJob()
    {
        moveSpeed = enemySpeed,
        topBound = topBound,
        bottomBound = bottomBound,
        deltaTime = Time.deltaTime
    };

    moveHandle = moveJob.Schedule(transforms);

    JobHandle.ScheduleBatchedJobs();
}

void AddShips(int amount)
{
    moveHandle.Complete();

    transforms.capacity = transforms.length + amount;

    for (int i = 0; i < amount; i++)
    {
        float xVal = Random.Range(leftBound, rightBound);
        float zVal = Random.Range(0f, 10f);

        Vector3 pos = new Vector3(xVal, 0f, zVal + topBound);
        Quaternion rot = Quaternion.Euler(0f, 180f, 0f);

        var obj = Instantiate(enemyShipPrefab, pos, rot) as GameObject;

        transforms.Add(obj.transform);
    }
}

<Jobs/Jobs_GameManager_Update_addShips.cs>

Code sample showing C# Job System + Classic Update() and AddShips() implementations

Now you need to keep track of our job and make sure that it completes and reschedules with fresh data each frame. The moveHandle.Complete() line above guarantees that the main thread doesn't continue execution until the scheduled job is complete. Using this job handle, the job can be prepared and dispatched again. Once moveHandle.Complete() returns, you can proceed to update our MovementJob with fresh data for the current frame and then schedule the job to run again. While this is a blocking operation, it prevents a job from being scheduled while the old one is still being performed. Also, it prevents us from adding new ships while the ships collection is still being iterated on. In a system with many jobs we may not want to use the Complete() method for that reason.

When you schedule MovementJob at the end of Update(), you also pass it the list of all the transforms to be updated from ships, accessed through the TransformAccessArray. When all jobs have completed setup and schedule, you can dispatch all jobs using the JobHandle.ScheduleBatchedJobs() method.

The AddShips() method is similar to the previous implementation with a few small exceptions. It double-checks that the job has completed in the event the method is called from somewhere else. That shouldn't happen, but better safe than sorry! Also, it saves off a reference to the newly spawned transforms in the TransformAccessArray member. Let's see how the work distribution and performance look.

With C# the number of figures is double
Figure 12. Using the C# Job System, we can nearly double the number of objects on the screen from the classic system in the same frame time (~33 ms).

C# job system + classic Profiler View
Figure 13. C# job system + classic Profiler View.

Now you can see that the Movement and UpdateBoundingVolumes jobs are taking about 4 ms per frame. Much better! Also note that there are nearly double the number of ships on the screen as the classic system!

We can still do better, however. This current method is still limited by a few things:

  • GameObject instantiation is a lengthy process involving system calls for memory allocation.
  • Transforms are still allocated in a random location in the heap.
  • Transforms still contain unused data, polluting cache lines and making memory access less efficient.

3. Entity Component System Using Jobs

This is where things get just a little bit more complex, but once you understand it you'll know it forever. Let's tackle this by looking at our new enemy ship prefab first:

Entity Component System ship prefab.
Figure 14. C# job system + Entity Component System ship prefab.

You'll probably notice a few new things. One, there are no built-in Unity components attached, aside from the Transform component (which isn't used). This prefab now represents a template that we will use to generate entities rather than a GameObject with components. The idea of a prefab doesn't exactly apply to the new system in the same way you are used to. You can look at it as a convenient container of data for your entity. This could all be done purely in script as well. You also now have a GameObjectEntity.cs script attached to the prefab. This required component signifies that this GameObject will be treated like an entity and use the new entity component system. You see that the object now also contains a RotationComponent, a PositionComponent, and a MoveSpeedComponent. Standard components such as position and rotation are built-in and don't need to be explicitly created, but MoveSpeed does. On top of that, we have a MeshInstanceRendererComponent, which exposes a public member a material reference that supports GPU instancing, which is required for the new entity component system. Let's see how these tie into the new system.

using System;
using Unity.Entities;

namespace Shooter.ECS
{
    [Serializable]
    public struct MoveSpeed : IComponentData
    {
        public float Value;
    }

    public class MoveSpeedComponent : ComponentDataWrapper<MoveSpeed> { }
}

<ECS/ECS_MoveSpeed.cs>

Code sample showing how to set up MoveSpeed data (IComponentData) for the Entity Component System

When you open one of these data scripts, you see that each structure inherits from IComponentData. This flags the data as a type to be used and tracked by the entity component system and allows the data to be allocated and packed in a smart way behind the scenes while you get to focus purely on your gameplay code. The ComponentDataWrapper class allows you to expose this data to the inspector window of the prefab it's attached to. You can see that the data you've associated with this prefab represents only the parts of the Transform component required for basic movement (position and rotation) and the movement speed. This is a clue that you won't be using Transform components in this new workflow.

Let's now look at the new version of the GameplayManager script:

using Unity.Collections;
using Unity.Entities;
using Unity.Mathematics;
using Unity.Transforms;
using UnityEngine;

namespace Shooter.ECS
{
    public class GameManager : MonoBehaviour
    {
        EntityManager manager;
        

        void Start()
        {
            manager = World.Active.GetOrCreateManager<EntityManager>();
            AddShips(enemyShipCount);
        }

        void Update()
        {
            if (Input.GetKeyDown("space"))
                AddShips(enemyShipIncremement);
        }

        void AddShips(int amount)
        {
            NativeArray<Entity> entities = new NativeArray<Entity>(amount, Allocator.Temp);
            manager.Instantiate(enemyShipPrefab, entities);

            for (int i = 0; i < amount; i++)
            {
                float xVal = Random.Range(leftBound, rightBound);
                float zVal = Random.Range(0f, 10f);
                manager.SetComponentData(entities[i], new Position { Value = new float3(xVal, 0f, topBound + zVal) });
                manager.SetComponentData(entities[i], new Rotation { Value = new quaternion(0, 1, 0, 0) });
                manager.SetComponentData(entities[i], new MoveSpeed { Value = enemySpeed });
            }
            entities.Dispose();
        }
    }
}

<ECS/ECS_GameManager.cs>

Code sample showing C# Job System + Entity Component System Update() and AddShips() implementations

We've made a few changes to enable the entity component system to use the script. Notice you now have an EntityManager variable. You can think of this as your conduit for creating, updating, or destroying entities. You'll also notice the NativeArray<Entity> type constructed with the amount of ships to spawn. The manager's instantiate method takes a GameObject parameter and the NativeArray<Entity> setup that specifies how many entities to instantiate. The GameObject passed in must contain the previously mentioned GameObjectEntity script along with any needed component data. The EntityManager creates entities based off of the data components on the prefab while never actually creating or using any GameObjects.

After you create entities, iterate through all of them and set each new instance's starting data. This example sets the starting position, rotation, and movement speed. Once that's done, the new data containers, while secure and powerful, must be freed to prevent memory leaks. The movement system can now take over the show.

using Unity.Collections;
using Unity.Entities;
using Unity.Jobs;
using Unity.Mathematics;
using Unity.Transforms;
using UnityEngine;

namespace Shooter.ECS
{
    public class MovementSystem : JobComponentSystem 
	{
        [ComputeJobOptimization]
        struct MovementJob : IJobProcessComponentData<Position, Rotation, MoveSpeed>
        {
            public float topBound;
            public float bottomBound;
            public float deltaTime;

            public void Execute(ref Position position, [ReadOnly] ref Rotation rotation, [ReadOnly] ref MoveSpeed speed)
            {
                float3 value = position.Value;

                value += deltaTime * speed.Value * math.forward(rotation.Value);

                if (value.z < bottomBound)
                    value.z = topBound;

                position.Value = value;
            }
        }

        protected override JobHandle OnUpdate(JobHandle inputDeps)
        {
            MovementJob moveJob = new MovementJob
            {
                topBound = GameManager.GM.topBound,
                bottomBound = GameManager.GM.bottomBound,
                deltaTime = Time.deltaTime
            };

            JobHandle moveHandle = moveJob.Schedule(this, 64, inputDeps);

            return moveHandle;
        }
    }
}

<ECS/ECS_MovementSystem.cs>

Code sample showing C# Job System + Entity Component MovementSystem implementation

Here's the meat and potatoes of the demo. Once entities are set up, you can isolate all relevant movement work to your new MovementSystem. Let's cover each new concept from the top of the sample code to the bottom. 

The MovementSystem class inherits from JobComponentSystem. This base class gives you the callbacks you need to implement, such as OnUpdate(), to keep all of the system-related code self-contained. Instead of having an uber-GameplayManager.cs, you can perform system-specific updates in this neat package. The idea of JobComponentSystem is to keep all data and lifecycle management contained in one place.

<ECS/ECS_MovementJobStruct.cs>

The MovementJob structure encapsulates all information needed for your job, including the per-instance data, fed in via parameters in the Execute function, and per-job data via member variables that are refreshed via OnUpdate(). Notice that all per-instance data is marked with the [ReadOnly] attribute except the position parameter. That is because in this example we are only updating the position each frame. The rotation and movement speed of each ship entity is fixed for its lifetime. The actual Execute function contains the code that operates on all of the required data.

You may be wondering how all of the position, rotation, and movement speed data is being fed into the Execute function invocations. This happens automatically for you behind the scene. The entity component system is smart enough to automatically filter and inject data for all entities that contain the IComponentData types specified as template parameters to IJobProcessComponentData.

using Unity.Collections;
using Unity.Entities;
using Unity.Jobs;
using Unity.Mathematics;
using Unity.Transforms;
using UnityEngine;

namespace Shooter.ECS
{
    public class MovementSystem : JobComponentSystem 
	{

        // ...
        // Movement Job
        // ...

        protected override JobHandle OnUpdate(JobHandle inputDeps)
        {
            MovementJob moveJob = new MovementJob
            {
                topBound = GameManager.GM.topBound,
                bottomBound = GameManager.GM.bottomBound,
                deltaTime = Time.deltaTime
            };

            JobHandle moveHandle = moveJob.Schedule(this, 64, inputDeps);

            return moveHandle;
        }
    }
}

<ECS/ECS_MovementOnUpdate.cs>

 Code sample showing C# Job System OnUpdate() method implementation

The OnUpdate() method below MovementJob is also new. This is a virtual function provided by JobComponentSystem so that you can more easily organize per-frame setup and scheduling within the same script. All it's doing here is:

  • Setting up the MovementJob data to use freshly injected ComponentDataArrays (per-entity-instance data)
  • Setting up per-frame data (time and bounds)
  • Scheduling the job

Voila! Our job is set up and completely self-contained. The OnUpdate() function will not be called until you first Instantiate entities containing this specific group of data components. If you decided to add some asteroids with the same movement behavior, all you would need to do is add those three same Component scripts containing the data types on the representative GameObject that you instantiate. The important thing to know here is that the MovementSystem doesn't care what the entity is that it's operating on. It only cares if the entity contains the types of data it cares about. There are also mechanisms available to help control life cycle.

With FPS ~33 ms can be used 91,000 objects
Figure 15. Running at the same frame time of ~33 ms, we can now have 91,000 objects on screen at once using the entity component system.

The available CPU time tracks more objects
Figure 16. With no dependencies on classic systems, the entity component system can use the available CPU time to track and update more objects.

As you can see in the profiler window above, you've now lost the transform update method that was taking up quite a bit of time on the main thread in the C# job system and Classic combo section shown above. This is because we are completely bypassing the TransformArrayAccess conduit we had previously and directly updating position and rotation information in MovementJob and then explicitly constructing our own matrix for rendering. This means there is no need to write back to a traditional Transform component. Oh yeah, and we've forgotten about one tiny detail the Burst compiler.

Burst Compiler

Now, we'll take exactly the same scene, do absolutely nothing to the code beyond keeping the [ComputeJobOptimization] attribute above our job structure to allow the Burst compiler to pick up the job and we'll get all these benefits. Just make sure that the Use Burst Jobs setting is selected in the Jobs dropdown window shown below.

The dropdown allows the use of Burst Jobs
Figure 17. The dropdown allowing the use of Burst Jobs.

 Burst Jobs allows 150,000 objects at once
Figure 18. By simply allowing Burst Jobs to optimize jobs with the [ComputeJobOptimization] attribute, we go from 91,000 objects on screen at once up to 150,000 with much higher potential.

Time to complete MovementJob from 25 to 2 ms
Figure 19. In this simple example, the total time to complete all MovementJob and UpdateRotTransTransform tasks went from 25 ms down to only 2 ms completion time. We can now see that the bottleneck has shifted from the CPU to the GPU, as the cost of rendering all of these tiny ships on the GPU now outweighs the cost of tracking, updating, and render command generation / dispatch on the CPU side.

As you can see from the screenshot, we've got 59,000 more entities on screen at the same exact frame rate. For FREE. That's because the Burst compiler was able to perform some arcane magic on the code in the Execute() function, leveraging new tightly packed data layout and the latest architecture enhancements available on modern CPUs behind the scenes. As mentioned above, this arcane magic actually takes the form of auto-vectorization, optimized scheduling, and better use of instruction-level parallelism to reduce data dependencies and stalls in the pipeline.

Conclusion

Take a few days to soak in all of these new concepts and they'll pay dividends on subsequent projects. The power saved through the powerful gains reaped in these new systems are a currency that can be spent or saved.

Table 1. Optimizations resulted in significant improvements, such as the number of objects supported on the screen and update costs.

 ClassicC# Job System + ClassicC# Job System + Entity Component System (Burst Off)C# Job System + Entity Component System (Burst On)
Total Frame Time~ 33 ms / frame~ 33 ms / frame~ 33 ms / frame~ 33 ms / frame
# Objects on Screen16,50028,00091,000150,000+
MovementJob Time Cost~ 2.5 ms / frame~ 4 ms / frame~ 4 ms / frame~ < 0.5 ms / frame
CPU Rendering Time Cost To Draw All Ships10 ms / frame18.76 ms / frame18.92 job to calculate rendering matrices + 3 ms Rendering Commands = 21.92 ms / frame~ 4.5 ms job to calculate rendering matrices + 4.27 ms Rendering Commands = 8.77 ms / frame
Time GPU bound~ 0 ms / frame~ 0 ms / frame~ 0 ms / frame~ 15.3 ms / frame

If you're targeting a mobile platform and want to significantly reduce the battery consumption factor in player retention, just take the gains and save them. If you're making a high-end desktop experience catering to the PC master race, use those gains to do something special with cutting edge simulation or destruction tech to make your environments more dynamic, interactable, and immersive. Stand on the shoulders of giants and leverage this revolutionary new tech to do something previously claimed impossible in real time then put it on the asset store so I can use it.

Thanks for reading. Stay tuned for more samples from Unity—watch this space!

Resources

Unity

Unity Entity Component System Documentation

Unity Entity Component System Samples

Unity Entity Component System Forums

Unity Documentation

Learning about efficient memory layout

Ship Asset Used in Demo

Convert SPIR-V to Intel® SPMD Program Compiler (Intel® SPC)

$
0
0

There is a growing trend within the games industry to move compute work to the graphics processing unit (GPU) resulting in engines and/or studios developing large portfolios of GPU compute shaders for many different compute tasks. However, there are times when it would be convenient to run those compute shaders on the CPU without having to re-invest in developing C/C++ variants of them. There are many reasons for doing this, including simple experimentation and debug, utilizing spare CPU cycles and encouraging CPU-based content scaling, CPU-based interaction with other CPU side game assets, for deterministic consistency of results, and so on.

To help address this opportunity while also utilizing the single instruction, multiple data (SIMD) vector units built into modern CPU cores, we have started developing a prototype translator, based on the open source Khronos* SPIRV-Cross project, that will take Standard Portable Intermediate Representation (SPIR-V1) as input and produce Intel® SPMD Program Compiler (Intel® SPC) kernels as output. ISPC takes C-style kernels and generates highly vectorized CPU object files targeting multiple ISAs such as Intel® Streaming SIMD Extensions (Intel® SSE), Intel® Advanced Vector Extensions 2 (Intel® AVX2), and Intel® Advanced Vector Extensions 512 (Intel® AVX-512).

The project should be considered as a starting point for conversion to Intel® SPC rather than a fully featured and performant solution. The project currently supports a subset of the standard SPIR-V intrinsics, built-ins and types, but it was designed to utilize a core performance feature of Intel® SPC, which is the notion of uniform (scalar) or varying (vector) variables. This allows optimizations such as avoiding expensive and divergent vector branches if the test condition is scalar.

The code has been tested on a handful of shaders, such as the compute examples in Sascha Willems’ Vulkan* repository and the particle system compute shaders in Microsoft DirectX* 12 MiniEngine sample.

The code can be downloaded from our GitHub* repository and has currently only been tested on Windows* systems. The GitHub Readme contains more detailed documentation about the implementation and supported features.

Usage

glslangValidator.exe -H -V -o test.spv test.comp

spirv-cross.exe --ispc --output test.ispc test.spv

ispc.exe -O2 test.ispc -o test.ispc.obj -h test.ispc.h --target=avx2 --opt=fast-math

Example API Usage

ispc::raytracing_get_workgroup_size(workgroupSize[0], workgroupSize[1], workgroupSize[2]);

int32_t dispatch[3] = { textureComputeTarget.width / workgroupSize[0], textureComputeTarget.height / workgroupSize[1], 1 };
int32_t dispatch_count = dispatch[0] * dispatch[1];

concurrency::parallel_for<uint32_t>(0, dispatch_count, [&](uint32_t dispatchID)
{
    int32_t workgroupID[3] = { dispatchID % dispatch[0], dispatchID / dispatch[1], 0 };
    ispc::raytracing_dispatch_single(workgroupID, dispatch, planeCount, *pPlanes, sphereCount, *pSpheres, *pUBO, resultImage);
});

This project is open sourced under the original SPIRV-Cross Apache 2.0 license and we welcome any comments and contributions.

Further information on using ISPC in games can be found in the article, Use the Intel® SPMD Program Compiler for CPU Vectorization in Games.

Footnote

1. SPIR-V is the default shader language for Vulkan and can be generated from OpenGL* Shading Language (GLSL) and High-Level Shading Language (HLSL) shaders by tools such as the glslangValidator and shaderc compilers.

 

Code Sample: New Unity* Entity Component, C# Job System, and Burst Compiler

$
0
0

File(s):

Download
License:Intel Sample Source Code License Agreement
Optimized for... 
OS:Windows® 10 (64 bit)
Hardware:N/A
Software:
(Programming Language, tool, IDE, Framework)
C#, Microsoft Visual Studio* 2015, Unity
Prerequisites:Familiarity with Microsoft Visual Studio, 3D graphics, parallel processing.
Tutorial:Game DevGetting Started with the New Unity* Entity Component, C# Job System, and Burst Compiler

Introduction

Low, medium, and high. Standard fare for GPU settings, but why not CPU settings, too? Today the potential power of the CPU on your end users’ machines can vary wildly. Typically, developers will define a CPU min-spec, implement the simulation and gameplay systems using that performance target, and then call it a day. This leaves the many potentially available cores and features built into modern mainstream CPUs sitting idle on the sideline. The new C# job system and entity component system from Unity* don’t just allow you to easily leverage previously unused CPU resources, they will also help run all your game code more efficiently in general. Then you can use those extra CPU resources to add more scene dynamism and immersion. In this article, you’ll see how to quickly get started learning these new features.

Unity is attacking two important performance problems for computing in game engines. The first problem under assault is inefficient data layout. Unity’s Entity Component System (ECS) improves management of data storage for high-performance operations on those structures. The second problem is the lack of a high-performance job language and SIMD vectorization that can operate on that well-organized data. Unity’s new C# job system, entity component system and Burst compiler technology leave those shortcomings in the dust.

The Unity entity component system and C# job system are two different things, but they go hand-in-hand. To get to know them, let’s look at the current Unity workflow for creating an object in your scene, and then differentiate from there.

Get Started

Refer the tutorial link above.

References

Unity
Unity Entity Component System Documentation
Unity Entity Component System Samples
Unity Entity Component System Forums
Unity Documentation
Learning about efficient memory layout
Ship Asset Used in Demo

Updated Log

Created May 17, 2018

Tips to Improve Performance for Popular Deep Learning Frameworks on CPUs

$
0
0

Introduction

The purpose of this document is to help developers speed up the execution of the programs that use popular deep learning frameworks in the background. There are situations where we have observed that the deep learning code, with default settings, does not take advantage of the full compute capability of the underlying machine on which it runs. This is often the case, especially when the code runs on Intel® Xeon® processors.

Optimization

The primary goal of the performance optimization tips given in this section is to make use of all the cores available in the machine. Intel® DevCloud consists of Intel® Xeon® Gold 6128 processors.

Assume that the number of cores per socket in the machine is denoted as NUM_PARALLEL_EXEC_UNITS. On the Intel DevCloud, assign NUM_PARALLEL_EXEC_UNITS to 6.

TensorFlow

To get the best performance from a machine, change the parallelism threads and OpenMP* settings as below:

import tensorflow as tf

config = tf.ConfigProto(intra_op_parallelism_threads=NUM_PARALLEL_EXEC_UNITS, inter_op_parallelism_threads=2, allow_soft_placement=True, device_count = {'CPU': NUM_PARALLEL_EXEC_UNITS})

session = tf.Session(config=config)

os.environ["OMP_NUM_THREADS"] = "NUM_PARALLEL_EXEC_UNITS"

os.environ["KMP_BLOCKTIME"] = "30"

os.environ["KMP_SETTINGS"] = "1"

os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"

Keras with TensorFlow Backend

To get the best performance from a machine, change the parallelism threads and OpenMP settings as below:

from keras import backend as K

import tensorflow as tf

config = tf.ConfigProto(intra_op_parallelism_threads=NUM_PARALLEL_EXEC_UNITS, inter_op_parallelism_threads=2, allow_soft_placement=True, device_count = {'CPU': NUM_PARALLEL_EXEC_UNITS })

session = tf.Session(config=config)

K.set_session(session)

os.environ["OMP_NUM_THREADS"] = "NUM_PARALLEL_EXEC_UNITS"

os.environ["KMP_BLOCKTIME"] = "30"

os.environ["KMP_SETTINGS"] = "1"

os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"

Caffe

To get the best performance from the underlying machine, change the OpenMP settings as below:

export OMP_NUM_THREADS= NUM_PARALLEL_EXEC_UNITS

export KMP_AFFINITY= granularity=fine,verbose,compact,1,0

In general:

export OMP_NUM_THREADS= <number of threads to use>

export KMP_AFFINITY= <your affinity settings of choice>

For example:

KMP_AFFINITY=granularity=fine,balanced

KMP_AFFINITY=granularity=fine,compact

Conclusion

Even though we have observed a speed up in most cases, please note that the performance is largely code-dependent and there can be multiple other reasons that affect the code performance. A good code profiling tool like the Intel® VTune™ Amplifier can help you dig deeper and analyze performance problems.

Author

Anju Paul is a Technical Solutions Engineer working on behalf of the Intel AI® Academia Program

 

 


The 5G Network Transformation

$
0
0

Overview

Internet traffic has undergone tremendous growth over the years and shows no signs of slowing down. For example, in Staying Connected in 2017: Our Predictions, AT&T* reports that the traffic on their network has grown 250,000 percent since 2007. People are adding new devices to their homes, and new data-hungry applications are being developed for work, connectivity, entertainment, gaming and more. In addition to the amount of data required, many applications are also latency-sensitive. This means networks have to handle large volumes of data faster than ever, and without added cost to the end user or subscriber.

Enter 5G

With 5G, a user will be able to download a high-definition video in under a second (a task that could take up to 10 minutes on 4G LTE). 5G networks will boost the development of other new technologies, such as autonomous vehicles, virtual reality, smart agriculture, remote emergency and medical services, and the Internet of Things (IoT).

conceptual representation of 5g and data economy interaction
Figure 1. 5G is critical to a new data economy.

In addition to being a dramatically better mobile broadband system, 5G is an innovation platform for services, applications, and connected devices.

Connecting the World

According to Introducing OpenCellular: An open source wireless access platform, at the end of 2015 approximately half the world's population did not have internet access. The OpenCellular Project, founded by Facebook*, is designed to support a range of communication options, from a network in a box to an access point supporting everything from 2G to LTE. Facebook plans to open-source the hardware design, along with necessary firmware and control software, to enable telecom operators, entrepreneurs, OEMs, and researchers to locally build, implement, deploy, and operate wireless infrastructure based on this platform.

This project empowers the developer community to contribute to the goal of getting to 100 percent connectivity in 5G. To be successful, new 5G technology must be designed to connect with legacy 2G, 3G, 4G LTE, Wi-Fi* and wired networks. Implementing the new generation networks in this way also means operational efficiency for the whole network, and will benefit the operator bring down the cost for even existing users allowing them to remain competitive and hence grow.

The Pillars of 5G Wireless Technology

Everything You Need to Know About 5G, by Amy Nordrum, Kristen Clark and IEEE Spectrum* Staff.

The five pillars below are the foundation of 5G technology.

Millimeter wave

Current networks use the 3 kHZ to 6 GHz spectrum, which is getting crowded due to the explosion of data from smart phones and other connected devices. Next-generation technologies will use the 30-300 GHz spectrum, known as millimeter wave or mmWave, available for mobile broadband communications for the first time. The associated leap in performance can deliver fiber-like speeds, without the wires.

Small cells

Millimeter waves cannot travel through buildings and they can be absorbed by plants and rain. This is why the current technology of big base stations broadcasting their signals over long distances will not work in 5G. To solve this problem, 5G will use thousands of low-powered mini base stations.

Massive MIMO

Advanced multiple input multiple output (MIMO) antenna technology, including adaptive analog beamforming and beam tracking/steering techniques, can increase data rate, coverage, and capacity at base stations and within devices.

Beamforming

It's like a traffic signaling system for cellular signals, allowing data to be sent by the base station to a specific user, instead of broadcasting in all directions, hoping the user will receive it. This precision prevents interference and is much more efficient than the current technology, enabling base stations to handle a larger number of incoming and outgoing data streams simultaneously. The base station uses the direction of the source stream to calculate where the user device is located and determine where to send the stream.

Full duplex

Today's base station transceivers can't simultaneously send and receive signals on the same frequency. 5G transceivers support full duplex transmissions, which enable send and receive of same frequency signals at the same time. An alternate solution is to time division the signal, meaning send the incoming and outgoing data at the same frequency interleaved in a known pattern so the receiving base station can differentiate between the two. Researchers have designed high speed silicon transistor switches to halt the backward roll of these switches so both signals can be sent at the same time, improving the spectral efficiency of the signals

5G and Network Transformation

The previous sections illustrate how 5G is a fundamentally different technology in terms of how the physical layer of the technology works, how it interacts with legacy technologies, and the use cases themselves. Now let us talk about the requirements and challenges of 5G and how they can be addressed by various network elements.

Challenge 1

Network hardware elements (user equipment, modems, antennas, etc.) operating at the physical layer need to work at a much higher speed, support greater bandwidth, and be backward-compatible.

Solution 1

The 5G standard has been defined such that, other than the physical layer, it is backward-compatible. Though the actual bits and bytes travel on different frequency bands and need different modems, the upper-layer protocols do not change much. For this reason, most of the 5G software stack remains unchanged; however, it does need to support higher bandwidth and speeds, a challenge that will be met by a combination of software and hardware architecture designs.

Challenge 2

Network elements must be scalable and agile so that services can be brought up and down fast, as required by user-generated demand.

Solution 2

Software-Defined Networking (SDN) and Network Function Virtualization (NFV) will play a key role, since these technologies will enable network functions to be modularized and to run commodity-based servers. These servers, sitting in service providers' data centers and at the edge, can spin network services up and down on-the-go to meet user demands.

  • Virtual network functions (VNFs) hosted in containers and micro services will help reduce the footprint of network services. Compared to VM-based virtual network functions, they cost less to operate and are faster to bring up and down. Security concerns related to containerization can be addressed by using secure Kata containers as required by the network.
  • Orchestration frameworks will play a critical role, since automation and interworking of network functions across industries and vendors will be the key to scale a 5G network. Various open source projects like OpenStack*, OPNFV*, ONAP*, and Open Baton* are working in this direction.

Challenge 3

Networks need to be flexible without compromising on throughput and bandwidth requirements. Different use cases need to be optimized for different network service level agreements (SLAs). For example, use cases like automated driving and remote medicine are extremely latency-sensitive, while applications like gaming, augmented reality (AR), and virtual reality (VR) demand high bandwidth and low latency. A smart agricultural application has massive bandwidth demand due to scale but is not latency-sensitive. If these different scenarios are to be serviced by the same network, the network must classify the packets as belonging to a particular group, or network slice, and process them according to a set of rules. This requires extensive traffic classification and scheduling to implement all the network nodes through which the traffic passes. This implementation needs to be flexible enough so traffic classes and slices can be defined dynamically and not be tied to what has been preprogrammed in the hardware.

Service assurance: 5G's stringent requirements leave no room for error in terms of how network service classes are handled. If a service class (see Figure 2 for the different service classes) is guaranteed to meet an SLA that requires sub-microsecond latency, the underlying network infrastructure has to reserve resources to make that happen. It is vital to monitor systems for utilization and malfunctions, in order to prevent service disruptions or to facilitate the prompt resumption of normal service. Today, monitoring and management activities throughout the network are supported by discrete systems in fixed service chains with tightly integrated hardware and software products as well as established management frameworks and assurance tools. In a virtualized environment based on NFV, these activities are more challenging as a result of the disaggregation of hardware and software and the ability to deploy services dynamically.

infographic depicts cloud services matched to different needs
Figure 2. Matching cloud services to diverse delivery needs.

Solution 3

Using a well-designed software stack for the core and edge – supported by hardware designed to have the flexibility needed to enable traffic to be classified, sliced, and monitored – is the best way to transform networks and make them ready for 5G. One of the biggest challenges in the networking industry for moving to a more software-based solution has been the fact that these networks have to support legacy devices and be backward-compatible. With 5G, new networks are being deployed, creating the opportunity for a flexible and agile green field deployment.

Intel's 5G Solutions

Intel develops leading network technology and building blocks such as silicon, software, connectivity, memory, and integrated solutions to address the demands of next-generation networks. These solutions provide both the flexibility and scalability needed to build, utilize, and optimize tomorrow's network.

It all starts with Intel® Architecture (IA), which provides the performance and scalability necessary to keep up with today's demanding network requirements. Our roadmap of processors starts with Intel Atom® processors and scales to our leading Intel® Xeon® processor, which is purpose-built for the cloud and offers the most advanced foundation for software-defined infrastructure. With a tool chain that allows seamless migration across the IA roadmap, developers and network administrators can run their software on a single architecture – IA.

SDN/NFV

The network of tomorrow will be deployed using SDN and NFV. Instead of running a separate router, VPN, and firewall on three different pieces of hardware, you can run all of them on the same IA-based infrastructure. SDN provides an intelligent network and orchestration software that enables you to swap out hardware without requiring software reconfiguration. This will provide tremendous value to service providers in terms of driving network scale and agility while reducing capital expenditure (CAPEX) and operating expenditure (OPEX).

Visual cloud

Visual computing is exploding. Studies show that video will account for more than 79 percent of traffic traversing the network by 2020. Use cases include AR and VR, video on demand, live streaming, video surveillance, autonomous driving, medical imaging, 3D modeling, and computer/robotic vision. Intel is democratizing the creation and delivery of these compelling visual experiences by incorporating visual compute IP into our Intel Xeon processors with Intel® Graphics Technology. Intel® Quick Sync Video uses the dedicated media processing capabilities of Intel Graphics Technology to decode and encode quickly, while also enabling the processor to perform other tasks for maximum performance.

Infrastructure

Although industry standards are in the planning stage and actual deployments will not occur until 2019 or 2020, there is a lot of buzz about 5G. Telecom operators and equipment manufacturers are becoming 5G ready now. There will be incremental steps to get there, including the continual expansion of LTE and LTE-A. Intel has an end-to-end story (see Figure 3) for both consumers and businesses from devices to access to the core and cloud.

infographic depicts wide range of use for 5G
Figure 3: 5G end to end solution from smart user devices to core and cloud

Building blocks

For 5G infrastructure, building blocks include FlexRAN, which is a vRAN software reference platform, and Multi-access Edge Computing (MEC) , with products that can be deployed today to provide for lower latency and more connectivity. These components will ultimately provide a best-in-class user experience.

FlexRAN

Wireless base stations, like most network nodes, have traditionally been vertically integrated boxes (see Fig 8 below). FlexRAN (see Figure 3) is a reference architecture developed by Intel to implement software based radio stations which can sit on any part of the wireless networks from edge to core.

colorful design for a software base station
Figure 4. FlexRAN: A reference design for a software base station

Multi Access Edge Computing (MEC)

MEC implements software services close to the user to meet the low latency requirements of newer networks (4G, 4G LTE, 5G) and meets high bandwidth requirements. It opens the door to new types of applications that can use information such as real-time access to radio network information and location awareness. MEC unlocks the network to a new ecosystem of services at the network edge

Intel offers the Network Edge Virtualization (NEV) SDK platform with standard APIs and interfaces for developers and content providers. The NEV SDK is part of open source Akraino project.

infographic depicts Multi Access Edge Computing
Figure 5. Multi Access Edge Computing (MEC)

What's in it for Developers?

Analysts predict a USD 12-trillion dollar opportunity for 5G-related goods and services available globally in 2035. From creating innovative applications and services at the edge, to building a new SDN/NFV infrastructure for the datacenter and cloud, or connecting the world through initiatives like FaceBook's OpenCellular Project, developers will be at the heart of the 5G transformation. We at Intel look forward to making the journey with you.

About the Author

Sujata Tibrewala is an Intel community development manager and technology evangelist who defines programs and training events to ensure that the network developer ecosystem works together toward a common goal: to drive SDN/NFV adoption in the industry using open source ingredients such as DPDK, FD.io, Tungsten Fabric, Open VSwitch, Open Stack, ONAP, and more.   Sujata has worked at several companies, including CISCO, Agere, Ericsson, Avaya, Brocade, leading all phases of diverse software technology projects such as an SDN open flow implementation, TCP/IP/Ethernet/VLAN forwarding software development on CISCO switches, and network processors and cloud deployments using virtualization technologies.She has a Masters from IISc Bangalore and Bachelors from IIT Kharagpur and has completed an Executive Women Leadership Program from Stanford. 

 

How to Build a Custom Audio Editor with Unreal Engine* for Sound Spatialization in VR

$
0
0

woman in a virtual environment

Overview

Unreal Engine* from Epic Games has a powerful virtual reality (VR) editor option, but something they did not include is the ability to edit and place sounds while inside VR. It can be troublesome to have to constantly restart the editor after adjusting a sound to test what it sounds like in VR. So we decided to create a sound editor that allows game developers and sound designers to quickly place, edit, and test spatialized sound inside VR. This will prevent the user from having to constantly enter and exit the editor.

woman in a virtual environment

System requirement

  • Unreal Engine 4.18.1 or greater
  • Visual Studio* 2017
  • HTC Vive*

What you'll learn

  • Motion controller interaction
  • How to create a custom C++ class
  • VR UI
  • Saving editor changes
  • Sound spatialization parameters

Below, we will walk you through step-by-step to demonstrate how we made this custom audio editor tool for Unreal Engine from start to finish:

Project link download

Before we begin, you need to do a couple of things. Download and unzip the project folder. You also need to make sure you have at least version 4.18.1 of Unreal Engine installed.

When you have downloaded and unzipped the folder, right-click on Intel_VR_Audio_Tools.uproject and select "Generate Visual Studio project files." After that completes, open the project. A popup that says "Missing Intel_VR_Audio_Tools Modules" will appear. Click "Yes" to start the rebuild; this should take less than 20 seconds. This is needed because of how you are dynamically finding .wav files that have been added to the project, which will be explained in the Custom C++ Class section.

Setting Up the VR Player

We start with Unreal's VR template and choose the MotionControllerPawn as our pawn, which has motion control set up and allows movement by teleporting.

Motion Controller Interaction

Before the motion controller can interact with 3D widgets, a WidgetInteraction component needs to be added to BP_MotionController, which is located in the VirtualRealityBP folder. We also need to add a scene component for the sound selector widget, called soundScene.

widget, motion controller options screenshot

Press and Release Pointer keys are attached to the event called when the right trigger is pulled. We need to add to the MotionControllerPawn, which is also located in the VirtualRealityBP folder.

screenshot of widget interaction with right controller

Custom C++ Class

The reason for rebuilding the project is because while making this tutorial, the issue of knowing the names and locations of the sounds and dynamically updating a widget to match all those files appeared daunting. Luckily, Unreal Engine has some stuff to help us out.

The IntelSoundComponent is a C++ class that can be added to any blueprint to dynamically locate and load a .wav file into a USoundWave, which is how Unreal loads a sound file.

First, we right-click in the content browser and create a new C++ class named IntelSoundComponent. This action creates an IntelSoundComponent.cpp file and an IntelSoundComponent.h file.

Next, we add some includes which are needed to locate and manage files.

Includes added in IntelSoundComponent.cpp are Paths.h, FileManager.h and Runtime/Engine/Classes/Sound/SoundWave.h (which for some reason need everything before SoundWave.h).

#include "IntelSoundComponent.h"
#include "Paths.h"
#include "FileManager.h"
#include "Runtime/Engine/Classes/Sound/SoundWave.h"

bool exists;
FString dir, soundDir;
TArray<FString> soundFiles;

// Sets default values for this component's properties
UIntelSoundComponent::UIntelSoundComponent()
{
	// Set this component to be initialized when the game starts, and to be ticked every frame.  You can turn these features
	// off to improve performance if you don't need them.
	PrimaryComponentTick.bCanEverTick = true;
	
	//Empty soundFiles TArray. Easiest way if new wave files are added.
	soundFiles.Empty();
	
	//the way Unreal Engine calls the project's root directory
	dir = FPaths::ProjectDir();

	//Combining Root with the folder location for the sounds. 
	//This could probably be an external folder if needed with the help of ( IPlatformFile& PlatformFile = FPlatformFileManager::Get().GetPlatformFile(); )
	soundDir = dir + "Content/Sounds";

	//UE4 returns a bool if the directory exists or not.
	exists = FPaths::DirectoryExists(soundDir);
	
}

Code block. IntelSoundComponent.cpp

Now we create a a bool named exists; 2 FString variables named dir and SoundDir; and a TArray of FStrings named soundFiles. Since soundFiles is a TArray, we are able to call soundFiles.Empty(); which empties the TArray. We believe it's also the fastest way if new wave files are added. Then, we set FString dir to FPaths::ProjectDir(); (which gives the root location of the project). Next, we set FString soundDir to dir + "Content/Sounds" because that is the folder we put our .wav files into. FPaths has another method that can check if a directory exists, so we set our bool to exists = FPaths::DirectoryExists(soundDir);.

// Called when the game starts
void UIntelSoundComponent::BeginPlay()
{
	Super::BeginPlay();
	
	//UE4 way of managing files
	IFileManager &fileManager = IFileManager::Get();
	
	//UE_LOG(LogTemp, Warning, TEXT("%s"), &fileManager);

	
	if (exists == true){

		//Extensions to sound files. Was using .wav, but .uasset seems to work when there is and isn't an editor.
		FString ext = "/*.wav";
		FString ext2 = "/*.uasset";
		
		//path = FPaths::ProjectDir() + Content/Sounds + /*.uasset
		FString path = soundDir + ext2;

		//This finds file in the given array, with the given path 
		//the true bool is saying to look for files while false bool is saying to not look for directories
		fileManager.FindFiles(soundFiles, *path, true, false);

Code block. IntelSoundComponent.cpp

On BeginPlay() we start by instantiating IFileManager by using IFileManager &fileManager = IFileManager::Get();. We do this to debug and test if the .wav files are being found with fileManager.FindFiles, which search for .uassets instead of the .wav files we were using before, because .uassets are more reliable when sharing projects.

//Setting soundFileArray to soundFiles to pass into blueprint.
void UIntelSoundComponent::soundArray(TArray<FString> &soundFileArray) {
	
	soundFileArray = soundFiles;
	
}

//loading a wav file as a USoundWave so Unreal can set the sound chosen with LoadObject<USoundWave> for blueprint
USoundWave* UIntelSoundComponent::setWavToSoundWave(const FString &fileName) {
	
	USoundWave* swRef;
	FString name = fileName;

	swRef = LoadObject<USoundWave>(nullptr, *name);

	return swRef;

}

Code block. IntelSoundComponent.cpp

Lastly, in the .cpp, we create two functions that will be exposed as blueprint nodes. SoundArray (which passes the soundFiles TArray into blueprints) and setWavToSoundWave (which took a while for us to figure out because we had to find a way to dynamically reference a .wav file in a way that Unreal could understand, which is a USoundWave). For this problem we discovered LoadObject. This function loads an object at runtime into any type that we set, if possible. For us, it was LoadObject(nullptr, *name);—*name being the sound that was chosen by VR player.

In the IntelSoundComponent.h we create two UFUNCTIONS as a way to make the two functions in the .cpp blueprint callable.

#include "CoreMinimal.h"
#include "Components/SceneComponent.h"
#include "Runtime/Engine/Classes/Sound/SoundWave.h"
#include "IntelSoundComponent.generated.h"



UCLASS( ClassGroup=(Custom), meta=(BlueprintSpawnableComponent) )
class INTEL_VR_AUDIO_TOOLS_API UIntelSoundComponent : public USceneComponent
{
	GENERATED_BODY()


public:	
	// Sets default values for this component's properties
	UIntelSoundComponent();
	
	//Blueprint function to expose soundFiles into blueprint.
	UFUNCTION(BlueprintCallable, Category = IntelAudio)
		void soundArray(TArray<FString> &soundFileArray);

	//Blueprint function passing a wav converted in USoundwave into blueprint.
	UFUNCTION(Category = IntelAudio, BlueprintCallable)
		USoundWave* setWavToSoundWave(const FString &fileName);

Code block. IntelSoundComponent.h

Blueprint function to expose sound files into blueprint.

screenshot of sound array widget

Blueprint function passing a .wav file converted in USoundWave into blueprint.

screenshot of sound widget

Setting Up the UI

We need to set up three UMG widgets.

screenshot of multiple widget set up

We create the blueprints needed to manage those UMG widgets.

screenshot of multiple blue prints to manage umg widgets

We have a couple of widgets for this project. AudioParamsSliderWidget is the widget that pops up when a sound is selected. soundButtonWidgetBP is just a button widget for the sounds in the Content/Sounds folder. We put a widget called soundSelectorWidgetBP in the level by having an actorBP we create called IntelSoundWidgetBP get the sounds from the SoundArray C++ node and populate soundSelectorWidgetBP with soundButtonWidgetBPs. (We could do this dynamically but then we would have to get a reference to the newly spawned actor every time we began play.) All this happens in the IntelSoundManagerBP, which we also placed in the level from the start.

screenshot of sound widget
IntelSoundManagerBP

In the image above, we get the soundFiles TArray of FStrings and split at the period in the name of the (name of sound).wav. We send that string into an array of strings in IntelSoundWidget to name the buttons being dynamically populated.

screenshot of sound widget
IntelSoundWidgetBP

In the IntelSoundWidgetBP we spawn the soundUI,

screenshot of sound widget

add sounds,

screenshot of sound widget

and if we don't use the Set Widget node the widget would spawn but not be visible in game.

screenshot of sound widget spawned in VR environment

Sound Parameters

Once the player selects a sound from the widget an IntelSoundAudioActorBP actor will spawn. In this actor we see the AudioParamsSliderWidgetBP, and if Spatialize? is clicked, three attenuation settings will be exposed to be changed through the widget.

screenshot of sound widget - volume settings

screenshot of sound widget - volume settings

Sound attenuation is essentially the ability of a sound to lower in volume as the player moves away from it.

graph

 

graph

graph

 

graph

The three settings exposed are Attenuation Function, Attenuation Shape, and the Falloff Distance.

There are plenty more settings that could be exposed with more time. Here are images of the Attenuation Setting struct in Unreal.

Unreal* attenuation settings

We believe the three settings we chose are the most basic and fundamentally needed settings. Showing debug lines when changing settings is something we are working on. We are looking for a way to use the attenuation setting debug lines Unreal uses to show attenuation in the editor in the game, but we have not found that answer. So, we might get the shape extents of the chosen attenuation shape and function and use the Unreal built-in draw debug lines nodes.

Saving On Exit

When we exit the game and have spawned sounds, moved them around, and played with the audio parameters, we save all the variables that are important using IntelSaveGameBP through IntelSoundAudioActorBP.

variables save UI
IntelSaveGameBP

screenshot of sound manager blueprint
IntelSoundManagerBP

screenshot of audio actor blueprint
IntelSoundAudioActorBP

Now if everything worked correctly, we should be able to edit any sounds in the folder inside VR.

Intel® Parallel Computing Center at LAMDA Group, Nanjing University

$
0
0

nanjing university logo

Principal Investigators:

Prof. Zhi-Hua Zhou portraitProf. Zhi-Hua Zhou is currently Professor and Standing Deputy Director of the National Key Laboratory for Novel Software Technology; he is also the Founding Director of the LAMDA group. His research interests are mainly in artificial intelligence, machine learning and data mining. He has authored the books Ensemble Methods: Foundations and Algorithms and Machine Learning (in Chinese), and published more than 150 papers in top-tier international journals or conference proceedings. 

He has received various awards/honors including the National Natural Science Award of China, the PAKDD Distinguished Contribution Award, the IEEE ICDM Outstanding Service Award, the Microsoft Professorship Award, etc. He also holds 22 patents.

He is an Executive Editor-in-Chief of the Frontiers of Computer Science, Associate Editor-in-Chief of the Science China Information Sciences, Action or Associate Editor of Machine Learning, IEEE Transactions on Pattern Analysis and Machine Intelligence, ACM Transactions on Knowledge Discovery from Data, etc. He served as Associate Editor-in-Chief for Chinese Science Bulletin (2008-2014), Associate Editor for IEEE Transactions on Knowledge and Data Engineering (2008-2012), IEEE Transactions on Neural Networks and Learning Systems (2014-2017), ACM Transactions on Intelligent Systems and Technology (2009-2017), Neural Networks (2014-2016),  Knowledge and Information Systems (2003-2008), etc.

He founded ACML (Asian Conference on Machine Learning), served as Advisory Committee member for IJCAI (2015-2016), Steering Committee member for ICDM, PAKDD and PRICAI, and Chair of various conferences such as General co-chair of PAKDD 2014 and ICDM 2016, Program co-chair of SDM 2013 and IJCAI 2015 Machine Learning Track, and Area chair of NIPS, ICML, AAAI, IJCAI, KDD, etc. He is/was the Chair of the IEEE CIS Data Mining Technical Committee (2015-2016), the Chair of the CCF-AI(2012- ), and the Chair of the Machine Learning Technical Committee of CAAI (2006-2015). He is a foreign member of the Academy of Europe, and a Fellow of the ACM, AAAI, AAAS, IEEE, IAPR, IET/IEE, CCF, and CAAI.

Description:

The major goal of this Intel® Parallel Computing Center (Intel® PCC) is to implement a deep forest framework as an alternative to neural networks on KNL and all IA architectures. The deep forest model possesses non-differential units (i.e., tree/tree ensembles) instead of neural units to construct multi-layered structure with highly competitive performance compared with current deep models without the need of GPU. Due to the properties of tree ensemble unites, such approaches are born to be suitable for IA architectures rather than GPU structure, and can handle discrete or tabular data better than perceptron based neural networks. There is big potential to be optimized on IA, especially to utilize the many core architecture devices such as Intel® Xeon® and Intel® Xeon Phi™. By doing so, we believe a CPU centered deep learning system can be achieved using decision trees as building blocks instead of neurons. 

In other words, after a performance profiling on the current deep forest code, optimizations and modifications on the current implementation on Intel Xeon devices will be carried out accordingly. Other variations of deep forest models for specific tasks will also be designed and implemented, with the help of Intel® Many Integrated Core Architecture (Intel® MIC Architecture) and the Intel® AI platform.
 
This Intel Parallel Computing Center will also give students hands-on experiences of applying AI technology to solve real world problems with the help of Intel’s AI platforms including hardware and software. Firstly, hardware-oriented AI training. The success of AI applications depends on designing efficient platforms and the knowledge of hardware is a critical step. Students will have access to the latest models for learning and developing proposes. Secondly, software-oriented AI training. Writing efficient implementations of AI programs also requires experiences of using well-maintained IA libraries, like implementing AI System with Intel’s AI tools integration including Intel® Parallel Studio, Intel® Data Analytics Acceleration Library (Intel® DAAL), Intel® Math Kernel Library (Intel® MKL), etc.

Publications:

Zhi-Hua Zhou's Publications

Related Website:

http://lamda.nju.edu.cn

Intel® Parallel Computing Center at First Institute of Oceanography, SOA, China

$
0
0

first institute of oceanography logo

Principal Investigators:

Project Lead, Prof. Fangli Qiao

Professor Fangli QiaoFangli Qiao has been working on the development of new generation ocean and climate models. Working with his research team, he established the surface wave-induced mixing theory which dramatically improves the performance of different ocean circulation models and climate models. He revealed the key role of sea spray in air-sea heat flux and greatly reduced the decades-standing systematic error in the forecast of typhoon/hurricane intensity. He led a team to design a high performance parallel scheme and test with more than 10 million CPU cores. He served as editorial board member of Ocean Modelling and Journal of Marine Systems etc.

Co-Project Lead, Prof. Zhenya Song

Zhenya Song obtained the Ph.D. degree in Physical Oceanography of Ocean University of China in 2011. Currently, Dr. Song is the professor of FIO and has been working on the ocean and climate simulation, HPC, and the effects of the wave effects in the climate systems since 2004. He for the first time incorporated a surface wave model into the global CGCMs and then developed a new generation coupled model named FIO-ESM.

Co-Project Lead, Associate Prof. Xiaomeng Huang

Xiaomeng Huang obtained the Ph.D. degree in Computer Science of Tsinghua University in 2007. Currently, he is an associate professor of the Department of Earth System Science in Tsinghua University. He focuses on the crossing field combined ocean modelling and HPC. Research interests include ocean model, parallel computing and big data.

Description:

Ocean surface wave is crucially important to navigation safety and climate change. High-resolution global wave model plays a key role in accurate surface wave forecasting and simulation. The MASNUM wave model is one of three state-of-the-art wave models in the world, which is developed by FIO and now widely used in several research groups and operational ocean forecasting systems. This work will focus on implementing code on new Intel technologies like AEP/HBW Memory/CLX-AP and new algorithms like Deep Learning, to improve the computing performance and simulation ability of MASNUM wave model. Moreover, it will deliver an open source high resolution and large-scale new generation wave model and development experience to the worldwide ocean community, which will expand both FIO and Intel’s influence on HPC and ocean scientific research.

Related Websites:

http://www.fio.org.cn
http://www.qnlm.ac/ronum/index
http://www.fio.org.cn/team/bodao-detail-1122.htm
http://www.fio.org.cn/team/shuodao-detail-1897.htm

Getting Started with Ubuntu* Core on an UP Squared* Board

$
0
0

Introduction

This article demonstrates to new users how to install Ubuntu* Core on an UP Squared* Grove* IoT Development Kit. The UP Squared board is a low-power and high performance platform ideal for Internet of Things (IoT) applications. The UP Squared board based on either the Intel® Celeron® processor (N3350) or Intel® Pentium® processor (N4200). For more information, visit http://www.up-board.org/upsquared. Ubuntu* Core is a lightweight, transactional version of Ubuntu* designed for IoT and cloud usage. Snaps are universal Linux packages that are available to install on Ubuntu* Core to work on IoT devices and more. For more information on the Ubuntu* Core, go to https://www.ubuntu.com/core.

Hardware Requirements

The hardware components used in this project are listed below:

  • UP Squared board
  • 2 USB 2.0 or 3.0 flash drives with at least 2GB free space available
  • A monitor with an HDMI interface
  • USB keyboard and mouse
  • A VGA or HDMI cable
  • A network connection with Internet access or Wi-Fi kit for UP Squared
  • An existing Linux* system is required to generate the RSA key (see Figure-1 below) and to login with SSH into the Ubuntu Core (Figure-14 below).

Software Requirements

The software requirements used in this project are listed below:

Steps

Download Images

  • Download Ubuntu Core Image 16.04.4

Setup  the Ubuntu SSO Account

  • Create an Ubuntu Account
  • Generate Keyss
  • Import Key

Write the USB Drives

  • Download Ubuntu Core Image 16.04.4

Installation

  • Install Live Flash
  • Install Ubuntu Core

Generate a Host SSH Key

The first step is to create an Ubuntu SSO account from https://login.ubuntu.com
The account is required to create the first user on an Ubuntu Core installation. Click on the Personal details to fill out your information. 
Next, use an existing Linux system to generate the RSA key by running ssh-keygen -t rsa on the Linux shell as follows:
mkdir ~/.ssh
chmod 700 ~/.ssh
ssh-keygen –t rsa

Figure 1: Generate an SSH key on the Linux shell

Your public key is now available as .ssh/id_rsa.pub in your home folder /home/Ubuntu/.ssh/id_rsa.pub.

  • Click on the SSH keys and insert the contents of your public key /home/Ubuntu/.ssh/id_rsa.pub, then click on Import SSH key.

Figure 2: Submitted the SSH keys successfully

Create a Live USB Ubuntu* Flash Drive

Booting from the Live USB Flash Drive

  • Connect the USB hub, keyboard, mouse and the monitor to the Up Squared.

Figure 3: Up Squared board

  • Insert the Live USB Ubuntu Desktop flash drive you created earlier in to the Up Squared board.
  • Select "Try Ubuntu without installing”.

Figure 4: Try Ubuntu without installing

Install Ubuntu* Core Image on the Up Squared

  • Insert the second USB flash drive containing the Ubuntu Core image file into the Up Squared board.
  • Check for directories mounted on the internal eMMC storage. Umount any directory mounted on the internal eMMC storage. Open a terminal and type:
mount | grep mmcblk
umount /media/ubuntu/writable
  • Check for the name of the drives of the Up Squared:
sudo fdisk -l
  • Assume /dev/sda is the second USB flash drive containing the Ubuntu Core image, mount it to /media/usb1.
sudo mkdir /media/usb1
sudo mount /dev/sda /media/usb1
  • Decompress the Ubuntu Core image and flash it into the Up Squared internal memory:
xzcat /media/usb1/ubuntu-core-16-amd64.img.xz | sudo dd of=/dev/mmcblk0 bs=32M status=progress; sync

Figure 5: Flash Ubuntu Core

  • Remove the Live USB Ubuntu Desktop flash drive and reboot the Up Squared board. The Up Squared will reboot from the internal memory where the Ubuntu Core has been flashed.

Configure the UP Squared* Board

After the Up Squared has rebooted, you will see a prompt below.

Figure 6: Ubuntu Core Configuration
  • Select Start to configure your network.

Figure 7: Configure wlan0
  • Select wlan0, then select Configure WIFI settings.

Figure 8: Configure WIFI
  • Enter Network name and Password, then select Done.

Figure 9: Network configuration
  • Highlight Done and press Enter
  • Highlight Done and press Enter again.

Figure 10: Network connections configuration complete

  • Now, DHCPv4 is enabled for wlan0, select Done.
  • Enter the Ubuntu One email address that was set up earlier, select Done then Enter.

Figure 11: Profile setup

  • Once the configuration complete, the Ubuntu SSO username and Up Squared localhost IP address will be displayed on the screen. Use this Up Squared localhost IP address to login from a different Ubuntu machine later in Figure 14.
Figure 12: Configuration complete
  • Select Finish then Enter. Ubuntu Core login will be prompted as follow:
Figure 13: Ubuntu Core login from Up Squared board

First User login on a Different Ubuntu* Machine

  • First, add RSA identities to the authentication agent by running ssh-add on the shell.
  • Next, login with SSH into the Ubuntu Core from a different Ubuntu machine on the same network. The user name is your Ubuntu SSO username and the password is not required.
ssh <your Ubuntu SSO username>@<Ubuntu Core IP address>

Figure 14: ssh into Ubuntu Core from a different Ubuntu machine

  • Set a password in case you want to login from the local console on the Up Squared board. On the different Ubuntu machine console, type:
sudo passwd <your Ubuntu SSO username>

Figure 15: Set localhost password
Now, using your Ubuntu SSO username and password just set in Figure 15 to log in to the Up Squared board from the its local console:
Figure 16:localhost login

Run Hello World Snap on localhost

Now the Up Squared is ready for the snaps. Snaps are self-contained application bundles that contain most of the libraries and runtimes needed. It is a squashFS filesystem containing your app code and a snap.yaml file.
  • Sign in to a Snap store using an Ubuntu SSO account:
Figure 17: Sign in to a snap store from localhost
  • Install the Hello Snap package using the snap name:

Figure 18: Install hello snap

  • Run the Hello Snap:

Figure 19: Run hello snap
  • List all snaps:

Figure 20: List all snaps

Refresh the Hello snap:

Figure 21: Refresh hello snap
Refresh all snaps:

Figure 22: Refresh all snaps
Remove the Hello snap:

Figure 23: Remove Hello snap

Summary

We have described how to install Ubuntu* Core on the Up Squared board and run Hello snap on Ubuntu Core. Visit https://uappexplorer.com/snaps for the list of other available snaps.

Key References

About the Author

Nancy Le is a software engineer at Intel Corporation in the Core & Visual Computing Group working on Intel Atom® processor enabling for Intel® IoT projects.

 

 

Custom Layers Support in Inference Engine

$
0
0

Deep Learning Inference Engine Workflow

The Deep Learning Inference Engine is a part of Intel® Deep Learning Deployment Toolkit (Intel® DL Deployment Toolkit) and OpenVINO™ toolkit. It facilitates deployment of deep learning solutions by delivering a unified, device-agnostic inference API.

For more information, refer to the Inference Engine Developer Guide.

The Deep Learning Inference Engine workflow involves the creation of custom kernels and either custom or existing layers.

A layer is defined as a convolutional neural network (CNN) building block implemented in the training framework (for example, Convolution in Caffe*). A kernel is defined as the corresponding implementation in the Inference Engine. This tutorial is aimed at advanced users of the Inference Engine. It allows users to provide their own kernels for existing or completely new layers.

Networks training is typically done on high-end data centers, using popular training frameworks like Caffe or TensorFlow*. Scoring (or inference), on the other hand, can take place on the embedded, low-power platforms. The limitations of these target platforms make the deployment of the final solution very challenging, both with respect to the data types and API support. Model Optimizer tool enables automatic and seamless transition from the training environment to the deployment environment.

Below is an example Caffe workflow (TensorFlow steps are the same). The Model Optimizer converts the original Caffe proprietary formats to the Intermediate Representation (IR) file that describes the topology accompanied by a binary file with weights. These files are consumed by the Inference Engine and used for scoring.

Example Caffe* Workflow

Note: To work with Caffe, the Model Optimizer requires Caffe recompilation with the special interface wrappers (see the Model Optimizer Developer Guide for details).

The process of conversion from the supported frameworks to the Inference Engine formats is automatic for topologies with the standard layers that are known to the Model Optimizer tool (see Using the Model Optimizer to Convert TensorFlow* Models or Using the Model Optimizer to Convert Caffe* Models).

This tutorial explains the flow and provides examples for the non-standard (or custom) layers.

Inference Engine and the Model Optimizer are provided as parts of the Intel DL Deployment Toolkit and OpenVINO toolkit. The components are the same in both toolkits, but the paths are slightly different:

  • In the Intel DL Deployment Toolkit:
    • <DL_TOOLKIT_ROOT_DIR>/deployment_tools/model_optimizer
    • <DL_TOOLKIT_ROOT_DIR>/deployment_tools/inference_engine
  • In the OpenVINO toolkit:
    • <OPENVINO_ROOT_DIR>/model_optimizer
    • <OPENVINO_ROOT_DIR>/inference_engine

Custom Layers Workflow

The Inference Engine has a notion of plugins (device-specific libraries to perform hardware-assisted inference acceleration). Before creating any custom layer with the Inference Engine, you need to consider the target device. The Inference Engine supports only CPU and GPU custom kernels. It is usually easier to begin with the CPU extension, and debugging with the CPU path, and then switch to the GPU.

For performance implications and estimations, see Performance Implications and Estimating Performance Without Implementing or Debugging a Kernel.

When creating new kernels in the Inference Engine, you must connect custom layers to these kernels as follows:

  1. Register the custom layer for the Model Optimizer tool. This step, which is device agnostic, is required to generate correct Intermediate Representation (IR) file with the custom layers.
  2. Implement the kernel in OpenCL™ (if you target GPU acceleration) or C++ (for general CPU codepath) that can be plugged into the Inference Engine.
  3. Register the kernel in the Inference Engine, so that each time it meets the layer of the specific type in the IR, it inserts a call to the custom kernel into the internal graph of computations. The registration process also defines the connection between the Caffe parameters of the layer and the kernel inputs.

The rest of document explains the steps in details.

Note: The Inference Engine moved to the concept of core primitives implemented by the plugins and extensions that come in the source code. For the CPU device, it allows re-compilation for the target platform with SSE, AVX2, and similar codepaths. The CPU extensions, which you can modify or use as a starting point, are located in the <INSTALL_DIR>/deployment_tools/samples/extension directory.

Registering Custom Layers for the Model Optimizer

The main purpose of registering a custom layer within the Model Optimizer is to define the shape inference (how the output shape size is calculated from the input size). Once the shape inference is defined, the Model Optimizer does not need to call the specific training framework again.

For information on registering custom layers, see Custom Layers in the Model Optimizer.

Note: For Caffe, there is legacy option to use the training framework fallback for shape inference. Custom layers can be registered in the <MODEL_OPTIMIZER_DIR>/bin/CustomLayersMapping.xml, and the tool will call the Caffe directly to get information on the output shapes.

Although the legacy option is much simpler than the primary registration process, it has a limitation related to shapes that dynamically depend on the input data. So we strongly encourage you to use general custom layers registration mechanism via Python* classes for the Model Optimizer.

Performance Considerations for Custom Kernels and Custom Layers

Creating custom kernels and custom layers can create performance issues in certain conditions, so it is important to keep in mind the implications of specific development decisions and to estimate how these development decisions might affect performance.

Performance Implications

  • Overriding Existing Layers.

    Custom kernels are used to quickly implement missing layers for cutting-edge topologies. For that reason, it is not advised to override layers on the critical path (for example, Convolutions). Also, overriding existing layers can disable some existing performance optimizations such as fusing.

  • Post-processing Custom Layers.

    When the custom layers are at the very end of your pipeline, it is easier to implement them as regular post-processing in your application without wrapping as kernels. This is particularly true for kernels that do not fit the GPU well, for example, (output) bounding boxes sorting. In many cases, you can do such post-processing on the CPU.

  • Blocked Layout.

    If the performance of the CPU extension is of concern, consider an implementation that supports the blocking layout (that the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) is using internally), which would eliminate (automatic) Reorders before and after your kernel. For example of the blocked layout support, please refer to the PReLu extension example in the <INSTALL_DIR>/deployment_tools/samples/extension/ext_prelu.cpp.

Estimating Performance without Implementing or Debugging a Kernel

In most cases, before actually implementing a full-blown code for the kernel, you can estimate the performance by creating a stub kernel that does nothing and is “infinitely” fast to let the topology execute end-to-end. The estimation is valid only if the kernel output does not affect the performance (for example, if its output is not driving any branches or loops).

CPU Kernels

Interface

Since the primary vehicle for the performance of the CPU codepath in the Inference Engine is Intel MKL-DNN, new CPU kernels are extending the Inference Engine plugin for the Intel MKL-DNN. Implementing the InferenceEngine::ILayerImplFactory defines a general “CPU-side” extension. So, there are no Intel MKL-DNN specifics in the way you need to implement a kernel.

Let’s consider simple MyCustomLayerFactory class that registers example kernels which make multiplication by two of its input data, but and does not change the dimensions:

  1. Create a constructor, a virtual destructor, and a data member to keep the layer info:
    // my_custom_layer.h
    class MyCustomLayerFactory: public InferenceEngine::ILayerImplFactory {
    public:
    explicit MyCustomLayerFactory(const CNNLayer *layer): cnnLayer(*layer) {}
    private:
    CNNLayer cnnLayer;
    };
    
  2. Overload and implement the abstract methods (getShapes, getImplementations) of the InferenceEngine::ILayerImplFactory class:
    StatusCode MyCustomLayerFactory::getShapes(const std::vector<TensorDesc>& inShapes, std::vector<TensorDesc>& outShapes, ResponseDesc *resp) noexcept override {
        if (cnnLayer == nullptr)
            return GENERAL_ERROR;
        outShapes.clear();
        // the kernel accepts single tensor only
        if (inShapes.size() != 1)
            return GENERAL_ERROR;
        else// the output tensor’s shape is the same (kernel doesn’t change that)
    outShapes.emplace_back(inShapes[0]); 
        return OK;
    }
    StatusCode MyCustomLayerFactory::getImplementations(std::vector<ILayerImpl::Ptr>& impls, ResponseDesc *resp) noexcept override {
        // below the method reports single (CPU) impl of the kernel
        // in theory, here you can report multiple implementations
        // (e.g. depending on the layer parameters, available via the cnnLayer instance
        impls.push_back(ILayerImpl::Ptr(new MyCustomLayerImpl(cnnLayer)));
        return OK;
    }
    
  3. Introduce an actual kernel as MyCustomLayerImpl class, inherited from the abstract InferenceEngine::ILayerExecImpl class:
    // my_custom_layer.h
    class MyCustomLayerImpl: public ILayerExecImpl {
    public:
        explicit MyCustomLayerImpl(const CNNLayer *layer): cnnLayer(*layer) {}
        StatusCode getSupportedConfigurations(std::vector<LayerConfig>& conf, ResponseDesc *resp) noexcept override;
        StatusCode init(LayerConfig& config, ResponseDesc *resp) noexcept override;
        StatusCode execute(std::vector<Blob::Ptr>& inputs, std::vector<Blob::Ptr>& outputs, ResponseDesc *resp) noexcept override;
    private:
        CNNLayer cnnLayer;
    };
    
  4. Implement the virtual methods for your kernel class. First of all, implement the getSupportedConfigurations, which returns all supported format (input/output tensor layouts) for your implementation:
    // my_custom_layer.cpp
    virtual StatusCode MyCustomLayerImpl::getSupportedConfigurations(std::vector< LayerConfig>& conf, ResponseDesc *resp) noexcept {
        try {
            if (cnnLayer == nullptr)
                THROW_IE_EXCEPTION << "Cannot get cnn layer";
            if (cnnLayer->insData.size() != 1 || cnnLayer->outData.empty())
                THROW_IE_EXCEPTION << "Incorrecr number of input/outpput edges!";
        DataPtr dataPtr = cnnLayer->insData[0].lock();
        if (!dataPtr)
            THROW_IE_EXCEPTION << "Cannot get input data!";
        DataConfig dataConfig;
        // this layer can procees data in-place but it is not constant
        dataConfig.inPlace = -1;
        dataConfig.constant = false;
        SizeVector order;
        //order of dimensions is default (unlike some Permute, etc kernels)
        for (size_t i = 0; i < dataPtr->getTensorDesc().getDims().size(); i++) {
            order.push_back(i);
        }
        // combine info into the TensorDesc
        dataConfig.desc = TensorDesc(
       dataPtr->getTensorDesc().getPrecision(),
                     dataPtr->getTensorDesc().getDims(),
                     {dataPtr->getTensorDesc().getDims(), order} /*BlockingDesc*/
        );
              //NCHW is default, so this call can be omitted, but see comment in the end
               dataConfig.desc.SetLayout(MemoryFormat::NCHW);
        LayerConfig config;
        // finally, add the expected input config to the kernel config
        config.inConfs.push_back(dataConfig);
        //pretty much the same for the (single) output that the kernel will	produce	
        DataConfig outConfig;
        outConfig.constant = false;
        outConfig.inPlace = 0;
        order.clear();
        for (size_t i = 0; i < cnnLayer->outData[0]->getTensorDesc().getDims().size(); i++) {
            order.push_back(i);
        }
    // NCHW is default, so we use the TensorDesc constructor that omits the layout
        
       outConfig.desc = TensorDesc(
    cnnLayer->outData[0]->getTensorDesc().getPrecision(),
                                    cnnLayer->outData[0]->getDims(),
                                    {cnnLayer->outData[0]->getDims(), order}
        );
        // add the output config to the layer/kernel config
        config.outConfs.push_back(outConfig);
        // no dynamic batch support
        config.dynBatchSupport = 0;
        // finally, “publish” the single configuration that we are going to support
        conf.push_back(config);
        return OK;
    } catch (InferenceEngine::details::InferenceEngineException& ex) {
        std::string errorMsg = ex.what();
        errorMsg.copy(resp->msg, sizeof(resp->msg) - 1);
        return GENERAL_ERROR;
    }
    

    Note: Carefully select the formats to support, as the framework might insert potentially costly reorders - special calls to reshape the data to meet the kernels requirements. Many streaming kernels (for example, that apply some arithmetic to every element of the input, like ReLU) are actually agnostic to the layout, so you should specify InferenceEngine::MKLDNNPlugin::MemoryFormat::any for them. Similarly, kernels that do not follow the traditional tensor semantics (of batches or features), but store the values in tensors can also use MemoryFormat::any.

    Finally, if the performance is of concern, consider an implementation that supports the blocking layout (that the Intel MKL-DNN is using internally), which would eliminate reorders before and after your kernel. For an example of the blocked layout support, please see the PReLu extension example in the following directory: <INSTALL_DIR>/deployment_tools/samples/extension/ext_prelu.cpp.

  5. Implement init method to get a runtime-selected configuration from a vector that populated in the previous step and check the parameters:
    // my_custom_layer.cpp
    virtual StatusCode MyCustomLayerImpl::init(LayerConfig& config, ResponseDesc *resp) noexcept {
        StatusCode rc = OK;
        if (config.dynBatchSupport) {
            config.dynBatchSupport = 0;
            rc = NOT_IMPLEMENTED;
        }
        for (auto& input : config.inConfs) {
            if (input.inPlace >= 0) {
                input.inPlace = -1;
                rc = NOT_IMPLEMENTED;
            }
            for (auto& offset : input.desc.getBlockingDesc().getOffsetPaddingToData()){
                if (offset) // our simplified impl doesn’t support data offsets
                    return GENERAL_ERROR;
            }
            if (input.desc.getBlockingDesc().getOffsetPadding())
                return GENERAL_ERROR; // our simplified impl doesn’t support padding
            
            for (size_t i = 0; i < input.desc.getBlockingDesc().getOrder().size(); i++){
                if (input.desc.getBlockingDesc().getOrder()[i] != i) {
            // our simplified tensors other than 4D, and just regular dims order
                    if (i != 4 || input.desc.getBlockingDesc().getOrder()[i] != 1)
                        return GENERAL_ERROR;  
                }
            }
        }
     
        // pretty much the same checks for output
        for (auto& output : config.outConfs) {
            if (output.inPlace < 0)
                // no in-place support for the output
                return GENERAL_ERROR;
            for (auto& offset : output.desc.getBlockingDesc().getOffsetPaddingToData()) {
                if (offset)
                    return GENERAL_ERROR;
            }
            if (output.desc.getBlockingDesc().getOffsetPadding())
                return GENERAL_ERROR;
            
            for (size_t i = 0; i < output.desc.getBlockingDesc().getOrder().size(); i++) {
                if (output.desc.getBlockingDesc().getOrder()[i] != i) {
                    if (i != 4 || output.desc.getBlockingDesc().getOrder()[i] != 1)
                        return GENERAL_ERROR;
                }
            }
        }
        return rc;
    }
    
  6. Implement the execute method, which accepts and processes the actual tenors as input/output blobs:
    // my_custom_layer.cpp
    virtual StatusCode MyCustomLayerImpl::execute(std::vector<Blob::Ptr>& inputs, std::vector<Blob::Ptr>& outputs, ResponseDesc *resp) noexcept {
        if (inputs.size() != 1 || outputs.empty()) {
            std::string errorMsg = "Incorrect number of input or output edges!";
            errorMsg.copy(resp->msg, sizeof(resp->msg) - 1);
            return GENERAL_ERROR;
        }
        const float* src_data = inputs[0]->buffer();
        float* dst_data = outputs[0]->buffer();
        for (size_t o = 0; o < outputs->size(); o++) {
                dst_data[o] = src_data[o]*2; // the kernel just multiplies the input
        }
    }
    

Packing the Kernels into a Shared Library

Packing the kernels into a shared library groups kernels into a shared library. The library should internally implement the InferenceEngine::IExtension, which defines the functions that you need to implement:

// my_custom_extension.h
class MyCustomExtentionLib: public InferenceEngine::IExtension {
private:
    static InferenceEngine::Version ExtensionDescription = {
        {1, 0},             // extension API version
        "1.0",
        "MyCustomExtentionLib"   // extension description message
    };
public:
    // cleanup resources, in our case does nothing
    void Unload() noexcept override {
    }
    //  method called upon destruction, in our case does nothing
    void Release() noexcept override {
        delete this;
    }
    // logging, in our case does nothing
    void SetLogCallback(InferenceEngine::IErrorListener &listener) noexcept override {}
// returns version info
void GetVersion(const InferenceEngine::Version *& versionInfo) const noexcept override {
        versionInfo = &ExtensionDescription;
    }
// retrunes the list of supported kernels/layers
 StatusCode getPrimitiveTypes(char**& types, unsigned int& size, ResponseDesc* resp) noexcept override {
        std::string type_name = "MyCustomLayer";
        types = new char *[1];
        size = 1;
        types[0] = new char[type_name.size() + 1];
        std::copy(type_name.begin(), type_name.end(), types[0]);
        types[0][type_name.size()] = '\0';
        return OK;
    }
// main function!
    StatusCode MyCustomExtentionLib::getFactoryFor(ILayerImplFactory *&factory, const CNNLayer *cnnLayer, ResponseDesc *resp) {
        if (cnnLayer->type != "MyCustomLayer") {
            std::string errorMsg = std::string("Factory for ") + cnnLayer->type + " wasn't found!";
            errorMsg.copy(resp->msg, sizeof(resp->msg) - 1);
            return NOT_FOUND;
        }
        factory = new MyCustomLayerFactory(cnnLayer);
        return OK;
    }
};

Loading the Shared Library

Before loading a network with the plugin, you must load the library with kernels to avoid errors on the unknown layer types:

// Load Intel MKL-DNN plugin, refer to the samples for more examples
InferenceEngine::InferenceEnginePluginPtr plugin_ptr(selectPlugin(…, “CPU”));
InferencePlugin plugin(plugin_ptr);
 // Load CPU (MKL-DNN) extension as a shared library
auto extension_ptr = 
make_so_pointer<InferenceEngine:: IExtension>(“<shared lib path>”);
// Add extension to the plugin list
plugin.AddExtension(extension_ptr);

For code examples, see Inference Engine Samples. for code samples. All Inference Engine samples (except trivial hello_classification) feature a dedicated command-line option to load custom kernels. Use the following command-line code to execute the sample with custom CPU kernels:

$ ./classification_sample -i <path_to_image>/inputImage.bmp -m <path_to_model>/CustomAlexNet.xml -d CPU 
-l <absolute_path_to_library>/libmy_sample_extension.so 

GPU Kernels

General Workflow

Unlike CPU custom kernels, the GPU codepath abstracts many details about OpenCL. You do not need to use host -side OpenCL APIs. You only need to provide a configuration file and one or more kernel source files. See Example Configuration for examples of configuration and kernel files.

There are two options for using custom layer configuration file within the Inference Engine:

  • To include a section with your kernels into global automatically-loaded cldnn_global_custom_kernels/cldnn_global_custom_kernels.xml file (hosted in the <INSTALL_DIR> /deployment_tools/inference_engine/bin/intel64/{Debug/Release} folder)
  • To call the IInferencePlugin::SetConfig() method from the user application with the PluginConfigParams::KEY_CONFIG_FILE key and the configuration file name as the value before loading the network that uses custom layers to the plugin:
    // Load GPU plugin, refer to the samples for more examples
    InferenceEngine::InferenceEnginePluginPtr plugin_ptr(selectPlugin({…, “GPU”));            
    InferencePlugin plugin(plugin_ptr);
    // Load GPU Extensions            
    plugin.SetConfig({{PluginConfigParams::KEY_CONFIG_FILE, ”<path to the xml file>”}});
    …
    

All Inference Engine samples (except trivial hello_classification) feature a dedicated command-line option to load custom kernels with -c option, as follows:

$ ./classification_sample -m ./models/alexnet/bvlc_alexnet_fp16.xml -i ./validation_set/daily/227x227/apron.bmp -d GPU
 -c absolute_path_to_config /custom_layer_example.xml

Configuration File Format

The configuration file is expected to follow the .xml file structure with a node of type CustomLayer for every custom layer provided by the user.

The following definitions will use the notations:

  • (0/1) Can have 0 or 1 instances of this node/attribute
  • (1) Must have 1 instance of this node/attribute
  • (0+) Can have any number of instances of this node/attribute
  • (1+) Can have 1 or more instances of this node/attribute
CustomLayer Node and Sub-node Structure

CustomLayer node contains the entire configuration for a single custom layer.

Attribute Name#Description
name(1)

The name of the layer type to be used. This name should be identical to the type used in the IR.

type(1)Must be SimpleGPU
version(1)Must be 1

Sub-nodes: Kernel (1), Buffers (1), CompilerOptions (0+), WorkSizes (0/1)

Kernel Node and Sub-node Structure

Kernel node contains all kernel source code configuration. No kernel node structure exists.

Sub-nodes: Source (1+), Define (0+)

Source Node and Sub-node Structure

Source node points to a single OpenCL source file.

Attribute Name#Description
filename(1)

Name of the file containing OpenCL source code. Notice that path is relative to your executable.

Multiple source nodes will have their sources concatenated in order.

Sub-nodes: None

Define Node and Sub-node Structure

Define node configures a single #define instruction to be added to the sources during compilation (JIT).

Attribute Name#Description
name(1)

The name of the defined JIT. For static constants, this can include the value as well (taken as a string).

param(0/1)Name of one of the layer parameters in the IR.

This parameter value will be used as the value of this JIT definition.

type(0/1)The parameter type.

Accepted values: int, float, and int[], float[] for arrays

default(0/1)The default value to be used if the specified parameters is missing from the layer in the IR

Sub-nodes: None

The resulting JIT will be of the form: #define [name] [type] [value/default].

Buffers Node and Sub-node Structure

Buffers node configures all input/output buffers for the OpenCL entry function. No buffers node structure exists.

Sub-nodes:Data (0+), Tensor (1+)

Data Node and Sub-node Structure

Data node configures a single input with static data (for example, weight or biases).

Attribute Name#Description
name(1)Name of a blob attached to this layer in the IR
arg-index(1)0-based index in the entry function arguments to be bound to

Sub-nodes: None

Tensor Node and Sub-node Structure

Tensor node configures a single input or output tensor.

Attribute Name#Description
arg-index(1)0-based index in the entry function arguments to be bound to
type(1)input or output
port-index(1)0-based index in the layer’s input/output ports in the IR
format(0/1)Data layout declaration for the tensor

Accepted values: BFYX, BYXF, YXFB, FYXB (also in all lowercase)

Default value: BFYX

CompilerOptions Node and Sub-node Structure

CompilerOptions node configures the compilation flags for the OpenCL sources.

Attribute Name#Description
options(1)Options string to be passed to the OpenCL compiler

Sub-nodes: None

WorkSizes Node and Sub-node Structure

WorkSizes node configures the global/local work sizes to be used when queuing the OpenCL program for execution.

Attribute Name#Description
global(0/1)

An array of up to 3 integers (or formulas) for defining the OpenCL work-sizes to be used during execution.

The formulas can use the values of the B,F,Y,X dimensions and contain the operators: +,-,/,*,% (all evaluated in integer arithmetic)

Default value: global=”B*F*Y*X” local=””
local(0/1)

Sub-nodes: None

Example Configuration file

The following code sample provides an example configuration file (in .xml format). For information on configuration file structure, see Configuration File Format.

<!-- the config file introduces a custom "ReLU" layer-->
<CustomLayer name="ReLU" type="SimpleGPU" version="1">
  <!-- the corresponding custom kernel is "example_relu_kernel" from the specified .cl file-->
  <Kernel entry="example_relu_kernel">
    <Source filename="custom_layer_kernel.cl"/>
    <!-- the only ReLU specific parameter (for "leaky" one)-->
    <Define name="neg_slope" type="float" param="negative_slope" default="0.0"/>
  </Kernel>
  <!-- inputs and outputs of the kernel-->
  <Buffers>
    <Tensor arg-index="0" type="input" port-index="0" format="BFYX"/>
    <Tensor arg-index="1" type="output" port-index="0" format="BFYX"/>
  </Buffers>
  <!-- OpenCL compiler options-->
  <CompilerOptions options="-cl-mad-enable"/>
  <!-- define the global worksize. The formulas can use the values of the B,F,Y,X dimensions and contain the operators: +,-,/,*,% (all evaluated in integer arithmetic)
Default value: global="B*F*Y*X,1,1"-->
  <WorkSizes global="X,Y,B*F"/>
</CustomLayer>

Built-In Defines for Custom Layers

The following table includes definitions that will be attached before the user sources, where <TENSOR> is the actual input and output, (for example, INPUT0 or OUTPUT0).

For an example, see Example Kernel.

NameValue
NUM_INPUTSNumber of the input tensors bound to this kernel
GLOBAL_WORKSIZEAn array of global work sizes used to execute this kernel
GLOBAL_WORKSIZE_SIZEThe size of the GLOBAL_WORKSIZE array
LOCAL_WORKSIZEAn array of local work sizes used to execute this kernel
LOCAL_WORKSIZE_SIZEThe size of the LOCAL_WORKSIZE array
<TENSOR>_DIMS

An array of the tensor dimension sizes.

Always ordered as BFYX

<TENSOR>_DIMS_SIZEThe size of the <TENSOR>_DIMS array
<TENSOR>_TYPEThe data-type of the tensor: float, half or char
<TENSOR>_FORMAT_The format of the tensor, BFYX, BYXF, YXFB , FYXB or ANY
  • The format will be concatenated to the defined name
  • You can use the tensor format to define codepaths in your code with #ifdef/#endif
<TENSOR>_LOWER_PADDINGAn array of padding elements used for the tensor dimensions before they start.

Always ordered as BFYX.

<TENSOR>_ LOWER_PADDING_SIZEThe size of the <TENSOR>_LOWER_PADDING array
<TENSOR>_UPPER_PADDINGAn array of padding elements used for the tensor dimensions after they end.

Always ordered as BFYX.

<TENSOR>_UPPER_PADDING_SIZEThe size of the <TENSOR>_UPPER_PADDING array
<TENSOR>_PITCHESThe number of elements between adjacent elements in each dimension.

Always ordered as BFYX.

<TENSOR>_PITCHES_SIZEThe size of the <TENSOR>_PITCHES array
<TENSOR>_OFFSETThe number of elements from the start of the tensor to the first valid element (bypassing the lower padding)

All <TENSOR> values will be automatically defined for every tensor bound to this layer (INPUT0, INPUT1, OUTPUT0, and so on), as shown in the following for example:

#define INPUT0_DIMS_SIZE 4
#define INPUT0_DIMS (int []){ 1,96,55,55, } 

Refer to the Appendix A: Resulting OpenCL™ Kernel for more examples.

Example Kernel

#pragma OPENCL EXTENSION cl_khr_fp16 : enable
__kernel void example_relu_kernel(
    const __global INPUT0_TYPE*  input0,
          __global OUTPUT0_TYPE* output)
{
    const uint idx  = get_global_id(0);
    const uint idy  = get_global_id(1);
    const uint idbf = get_global_id(2);//batches*features, as OpenCL supports 3D nd-ranges only
    const uint feature = idbf%OUTPUT0_DIMS[1];
    const uint batch   = idbf/OUTPUT0_DIMS[1];
    //notice that pitches are in elements, not in bytes!
    const uint in_id  = batch*INPUT0_PITCHES[0] + feature*INPUT0_PITCHES[1]   + idy*INPUT0_PITCHES[2]  + idx*INPUT0_PITCHES[3]  + INPUT0_OFFSET;
    const uint out_id = batch*OUTPUT0_PITCHES[0] + feature*OUTPUT0_PITCHES[1]  + idy*OUTPUT0_PITCHES[2]  + idx*OUTPUT0_PITCHES[3]  + OUTPUT0_OFFSET;
   
    INPUT0_TYPE value = input0[in_id];
    //neg_slope (which is non-zero for leaky ReLU) is put automatically as #define, refer to the config xml
    output[out_id] = value < 0 ? value * neg_slope : value;
}

Note: As described in the previous section, all the things like INPUT0_TYPE are actually defined as OpenCL (pre-) compiler inputs by the Inference Engine for efficiency reasons. See Debugging Tips for information on debugging the results.

Debugging Tips

Dumping the Resulting Kernels

Compared to the CPU-targeted code, debugging the GPU kernels is less trivial.

First of all, it is recommended to get a dump of the kernel with all of the values set by the Inference Engine (all of the tensors sizes, floating-point, and integer kernel parameters). To get the dump, add a following line to your code that configures the GPU plugin to output the custom kernels:

plugin.SetConfig({{ PluginConfigParams::KEY_DUMP_KERNELS, PluginConfigParams::YES }});

When the Inference Engine compiles the kernels for the specific network, it will also output the resulting code for the custom kernels. In the directory of your executable, you will locate files like clDNN_program0.cl, clDNN_program1.cl. There are as many files as distinct sets of parameters for your custom kernel (different input tensor sizes, and kernel parameters). See Appendix A: Resulting OpenCL™ Kernel for an example of a dumped kernel.

Using printf in the OpenCL™ Kernels

To debug the specific values, you can use printf in your kernels. However, you should be careful: for instance, do not output excessively as it would generate too much data. Since the printf output is typical, your output can be truncated to fit the buffer. Also, because of buffering, you actually get an entire buffer of output when the execution ends.

For more information, refer to printf Function.

Appendix A: Resulting OpenCL™ Kernel

This is an example of the code produced by the Inference Engine that Compute Library for Deep Neural Networks (clDNN) generates internally for the specific layer, when all the params (like neg_slope value for the ReLU) and tensor dimensions are known.

Essentially, this is original user code (see Example Kernel) plus a bunch of define values from the clDNN. Notice that the layer name is also reported (relu1):

// Custom Layer Built-ins
#define NUM_INPUTS 1
#define GLOBAL_WORKSIZE_SIZE 3
#define GLOBAL_WORKSIZE (size_t []){ 55,55,96, } 
#define LOCAL_WORKSIZE_SIZE 0
#define LOCAL_WORKSIZE (size_t []){  } 
#define INPUT0_DIMS_SIZE 4
#define INPUT0_DIMS (int []){ 1,96,55,55, } 
#define INPUT0_TYPE float
#define INPUT0_FORMAT_BFYX 
#define INPUT0_LOWER_PADDING_SIZE 4
#define INPUT0_LOWER_PADDING (int []){ 0,0,0,0, } 
#define INPUT0_UPPER_PADDING_SIZE 4
#define INPUT0_UPPER_PADDING (int []){ 0,0,0,0, } 
#define INPUT0_PITCHES_SIZE 4
#define INPUT0_PITCHES (int []){ 290400,3025,55,1, } 
#define INPUT0_OFFSET 0
#define OUTPUT0_DIMS_SIZE 4
#define OUTPUT0_DIMS (int []){ 1,96,55,55, } 
#define OUTPUT0_TYPE float
#define OUTPUT0_FORMAT_BFYX 
#define OUTPUT0_LOWER_PADDING_SIZE 4
#define OUTPUT0_LOWER_PADDING (int []){ 0,0,0,0, } 
#define OUTPUT0_UPPER_PADDING_SIZE 4
#define OUTPUT0_UPPER_PADDING (int []){ 0,0,0,0, } 
#define OUTPUT0_PITCHES_SIZE 4
#define OUTPUT0_PITCHES (int []){ 290400,3025,55,1, } 
#define OUTPUT0_OFFSET 0

// Layer relu1 using Custom Layer ReLU
// Custom Layer User Defines
#define neg_slope 0.0
// Custom Layer Kernel custom_layer_kernel.cl
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
__kernel void example_relu_kernel(
    const __global INPUT0_TYPE*  input0,
          __global OUTPUT0_TYPE* output)
{
    const uint idx  = get_global_id(0);
    const uint idy  = get_global_id(1);
    const uint idbf = get_global_id(2);//batches*features, as OpenCL supports 3D nd-ranges only
    const uint feature = idbf%OUTPUT0_DIMS[1];
    const uint batch   = idbf/OUTPUT0_DIMS[1];

    //notice that pitches are in elements, not in bytes!
    const uint in_id  = batch*INPUT0_PITCHES[0]   + feature*INPUT0_PITCHES[1]   + idy*INPUT0_PITCHES[2]  + idx*INPUT0_PITCHES[3]  + INPUT0_OFFSET;
    const uint out_id = batch*OUTPUT0_PITCHES[0]  + feature*OUTPUT0_PITCHES[1]  + idy*OUTPUT0_PITCHES[2]  + idx*OUTPUT0_PITCHES[3]  + OUTPUT0_OFFSET;
   
    INPUT0_TYPE value = input0[in_id];
    //neg_slope (which is non-zero for leaky ReLU) is put automatically as #define by the clDNN, refer to the xml
    output[out_id] = value < 0 ? value * neg_slope : value;
}

The Tools for Production IoT Development

$
0
0

Tools for Production for IoT

There’s no way around it: IoT development requires a broad set of skills and expertise to be successful. You need working knowledge of hardware, software, application development, analytics, use cases, vertical markets and—as much as anything—the right tools. The core set of tools you select can mean the difference between a quick, relatively smooth development processes that leads to successful commercial production versus one that consumes valuable development time because the setup is too challenging. 

 

Optimizing Hardware Selection 

Ideally, tool selection begins with finding the right architecture. That’s the basis for gathering all the necessary components required to successfully prototype and deploy a commercial IoT solution, evolving your solution’s capabilities, and futureproofing your solution with new enhancements.

Consider the importance of your product’s lifespan. Tools must be capable of taking designs and making them extensible—so today’s IoT Product 1.0 can evolve to accommodate next-generation AI, security, connectivity, scalability, and other future attributes. Also, tools must be purpose-built and use-case-driven in order to speed development and minimize tinkering and trial and error. So, for example, if you want to develop an application for traffic management that counts vehicles and analyzes license plates, a 6th Generation Intel® Core™ processor, such as what is found in the iEi* Tank AIoT development kit, is an ideal choice. This kit scales to support complex and parallel video streams through CPU and GPU hardware acceleration. 

Included with the iEi* Tank AIoT development kit is the Intel® OpenVINO™ toolkit, a SDK designed to help developers build high-performance computer vision applications and integrate deep learning inference. For flexibility, developers can go to production and optimize performance using Intel® System Studio tool suite, or prototype with the cloud-based Arduino* Create IDE. This software is included on a pre-installed custom Ubuntu* Desktop OS configured to allow for computer vision development out of the box. Lastly, Intel® Media SDK is also part of the package which exposes the media acceleration capabilities of Intel platforms for decoding, encoding, video and photo processing, and capturing screen content. The Intel® Media SDK’s single API enables hardware acceleration for fast video transcoding, image processing, and media workflows. 

All in all, an Intel® Core™ processor is a great choice for applications requiring parallel workloads. A solid choice for early prototyping is the UP* Squared Grove IoT development kit from Aaeon*, which is ideal for easy setup and concept turnaround, especially when used with the cloud-based Arduino Create* tool designed to support Intel-based platforms. The Arduino Create* tool with integrated libraries and SDKs makes them available at your fingertips.

IoT Developer Solutions

Intel offers the tools and technology that provide a clear path to commercial production of innovative IoT solutions.

Get Started, Develop Efficiently, Then Scale

The business challenges commercial IoT developers face will always vary, which is why the development tools at their disposal must be flexible and wide-ranging. And that’s why Intel offers end-to-end, prototype-to-production solutions with variable options at the SDK level, IDE level, processor level, and cloud connection options to support this need. Again, it’s all about offering the right options to make the right choices with precision. No matter which tools and technology you select from the vast set of Intel tools, you can be assured you will get the essential building blocks you need to gain a clear path to the finished product you envision. 

Enter the Intel Developer Zone for IoT

Intel is committed to helping commercial IoT developers simplify and accelerate the development process every step of the way. That commitment is evident in the breadth of what we have to offer, the size and diversity of our edge-to-cloud ecosystem, and the production IoT solutions that developers are bringing to market using Intel tools and technology. 

With the Intel® Developer Zone for IoT as your source for tools, code samples, training, and ongoing support, you can access all the necessary resources—from seeking IoT solution inspiration to applying tangible developer insights to help you build and optimize your solution for commercialization. And you gain the flexibility to match the tools you choose with your current skills, while scaling up as you see fit. 
Discover all of the advantages of working with Intel at the Intel® Developer Zone for IoT.


Clone of Installing the OpenVINO™ Toolkit for Windows*

$
0
0

NOTEThe OpenVINO™ toolkit was formerly known as the Intel® Computer Vision SDK.
These steps apply to Windows* 10. For Linux* instructions, see the Linux installation guide

Introduction

The OpenVINO™ toolkit quickly deploys applications and solutions that emulate human vision. Based on Convolutional Neural Networks (CNN), the Toolkit extends computer vision (CV) workloads across Intel® hardware, maximizing performance. The OpenVINO™ Toolkit includes the Intel® Deep Learning Deployment Toolkit.For more information, see the OpenVINO Overview information on the Web site.

The OpenVINO™ toolkit for Linux:

  • Enables CNN-based deep learning inference on the edge
  • Supports heterogeneous execution across a CPU, Intel® Integrated Graphics, and Intel® Movidius™ Neural Compute Stick
  • Speeds time-to-market via an easy-to-use library of computer vision functions and pre-optimized kernels
  • Includes optimized calls for computer vision standards including OpenCV*, OpenCL™, and OpenVX*

Included with Installation

ComponentDescription
Deep Learning Model OptimizerModel import tool. Imports trained models and converts to IR format for use by Deep Learning Inference Engine. This is part of the Intel® Deep Learning Deployment Toolkit.
Deep Learning Inference EngineUnified API to integrate the inference with application logic. This is part of the Intel® Deep Learning Deployment Toolkit.
Drivers and runtimes for OpenCL™ version 2.1Enables OpenCL 1.2 on the GPU/CPU for Intel® processors
Intel® Media SDKOffers access to hardware accelerated video codecs and frame processing
OpenCV version 3.4.1OpenCV community version compiled for Intel® hardware. Includes PVL libraries for computer vision.
OpenVX* version 1.1Intel's implementation of OpenVX* 1.1 optimized for running on Intel® hardware (CPU, GPU, IPU).
Sample ApplicationsA set of simple console applications demonstrating how to use Intel's Deep Learning Inference Engine in your applications. Additional information about building and running the samples can be found in the Inference Engine Developer Guide.

System Requirements

This guide includes only information related to Microsoft Windows* 10 64-bit. See the Linux installation guide for Linux information and instructions.

NOTE: Only the CPU and Intel® Integrated Graphics processor options are available. Linux is required for using the FPGA or Intel® Movidius™ Myriad™ 2 VPU options.

Development and Target Platforms

The development and target platforms have the same requirements, but you can select different components during the installation, based on your intended use.

Processor

  • 6th-8th Generation Intel® Core™
  • Intel® Xeon® v5 family, Intel® Xeon® v6 family

Processor Notes:

  • Processor graphics are not included in all processors. See https://ark.intel.com/ for information about your processor.
  • A chipset that supports processor graphics is required for Intel® Xeon® processors.

Operating System

Microsoft Windows* 10 64-bit

Installation

The steps in this guide assume you have already downloaded a copy of OpenVINO™ Toolkit for Linux*. If you do not have a copy of the package you can download the latest version here, then return to this guide to proceed with installation.

Install External Software Dependencies

Install Core Components

  1. Download the OpenVINO toolkit. By default, the file is saved to Downloads as w_openvino_toolkit_p_2018.1.<version>.exe
  2. Go to the Downloads folder.
  3. Double-click w_openvino_toolkit_p_2018.1.<version>.exe. A screen displays with options to choose your installation directory and components:
  4. Click Next.
  5. The next screen warns you about any missing components and the effect the missing component has on installing or using the OpenVINO toolkit:
  6. If you are missing a critical component, click Cancel, resolve the issue, and then restart the installation.
  7. When the installation completes, click Finish to close the wizard and open the Getting Started Guide in a browser.
  8. Make sure the installation directory is populated with sub-folders. The default installation location is C:\Intel\computer_vision_sdk_2018.1.<versions>. 

Next Steps

IMPORTANT: Before using the Model Optimizer to work with your trained model, make sure your Caffe*TensorFlow*, or MXNet* framework is prepared for any custom layers you have in place. The next information will put you on the way to doing this.

Learn About the OpenVINO™ Toolkit

Before using the OpenVINO™ toolkit, read through the product overview information on the Web site to gain a better understanding of how the product works.

Compile the Extensions Library

Some topology-specific layers, like DetectionOutput used in the SSD*, are delivered in source code that assumes the extensions library is compiled and loaded. The extensions are required for pre-trained models inference. While you can build the library manually, the best way to compile the extensions library is to execute the demo scripts.

Run the Demonstration Applications

To verify the installation, run the demo apps in <INSTALL_FOLDER>\deployment_tools\demo. For demo app documentation, see the README.txt in <INSTALL_FOLDER>\deployment_tools\demo.

The demo apps and their functions are:

  • demo_squeezenet_download_convert_run.bat. This demo Illustrates the basic steps used to convert a model and run it. This enables the Intel® Deep Learning Deployment Toolkit to perform a classification task with the SqueezeNet model. This demo:
    • Downloads a public SqueezeNet model.
    • Installs all prerequisites to run the Model Optimizer.
    • Converts the model to an Intermediate Representation.
    • Builds the Inference Engine Image Classification Sample from the <INSTALL_FOLDER>\deployment_tools\inference_engine\samples\classification_sample
    • Runs the sample using cars.png from the demo folder.
    • Shows the label and confidence for the top-10 categories.
  • demo_security_barrier_camera_sample.bat. This demo shows an inference pipeline using three of the pre-trained models included with the OpenVINO. The region found by one model becomes the input to the next. Vehicle regions found by object recognition in the first phase become the input to the vehicle attributes model, which locates the license plate. The region identified in this step becomes the input to a license plate character recognition model. This demo:
    • Builds the Inference Engine Security Barrier Camera Sample from the <INSTALL_FOLDER>\deployment_tools\inference_engine\samples\security_barrier_camera_sample.
    • Runs the sample using car_1.bmp from the demo folder.
    • Displays the resulting frame with detections rendered as bounding boxes and text.

 

Legal Information

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at http://www.intel.com/ or from the OEM or retailer.

No computer system can be absolutely secure.

Intel, Arria, Core, Movidius, Xeon, OpenVINO, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

*Other names and brands may be claimed as the property of others.

Copyright © 2018, Intel Corporation. All rights reserved.

Computer Vision Glossary of Vocabulary and Concepts

$
0
0

Computer Vision is a rapidly evolving area. This guide is to provide a starting point to understanding some of the terminology used in computer vision and the OpenVINO™ project.

I hope to make this a useful reference for people that are learning to develop, sell, train or otherwise understand the concepts and vocabulary in this fascinating area of research.

This first article is for people who are in sales, marketing, project management or who would otherwise like to be knowledgeable about OpenVINO™, but are not specialists, researchers or engineers. 

If you have words, abbreviations or other concepts that you think should be include in this cheat sheet, then feel free to email them to me at daniel.w.holmlund@intel.com.

Glossary

Introductory Terms for Non-Developers

This section contains foundational key terms that any non-expert should know to speak knowledgeably on the topic of OpenVINO™.

  • Caffe*
  • Computer Vision
  • Convolutional Neural Network (CNN) - Convolutional Neural Networks are Neural Networks that make the explicit assumption that the inputs are 1d, 2d or multi-dimensional arrays. This assumption allows us to simplify the neural network architecture and make it more efficient for applications in computer vision that use images or video.
  • CPU
    • A central processing unit (CPU) is the electronic circuitry within a computer that carries out the instructions of a computer program by performing the basic arithmetic, logical, control and input/output (I/O) operations specified by the instructions.
    • https://en.wikipedia.org/wiki/Central_processing_unit 
  • Deep Learning Inference Engine
    • An inference engine is a component of a system that applies logical rules to a set of inputs to deduce new information. 
    • The Intel® deep learning inference engine is a piece of software that runs trained neural network models. It receives input, runs it through the trained neural network, and delivers the output.
  • FPGA
    • A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a customer or a designer after manufacturing. It can be specialized for accelerating the highly parallelled processing tasks required in computer vision.
    • https://en.wikipedia.org/wiki/Field-programmable_gate_array 
  • FPGA Inference Accelerator
    • A FPGA that is specialized for running the Intel(R) deep learning inference engine at high speeds.
  • GPU
    • A graphics processing unit (GPU) is a specialized electronic circuit designed to rapidly manipulate data associated with computer vision and computer graphics. 
  • Hardware Heterogeneity
    • Hardware Heterogeneity refers to the idea that computer software should be able to identify and run on a combination of different hardware. For example, if a computer vision program has access to a CPU, GPU and an FPGA then it should be able to use all three in an optimal manner. 
  • Intel® Arria® 10 FPGA GX
  • Intel® Media SDK
  • Intel® Movidius™ brand
    • The trademark name given to products developed by a computer vision company named Movidius™ that was acquired by Intel in September 2016.
  • Intel® Movidius™ Neural Compute Stick
    • The Intel® Movidius™ Neural Compute Stick (NCS) is a tiny, fanless, deep learning device that you can use to learn AI programming at the edge. NCS is powered by the same low power high performance Intel® Movidius™ Vision Processing Unit (VPU) that can be found in millions of smart security cameras, gesture controlled drones, industrial machine vision equipment, and more.
  • Model
    • Model is a trained neural network that specializes in a particular activity. More formally, it is a neural network that has been trained to approximate a particular function. 
  • Model Optimizer
    • The Model Optimizer is a cross-platform command-line tool that takes pre-trained deep learning models from Caffe*, Tensorflow* and MxNet* converts to an intermediate representation for use with the inference engine. It performs static model analysis and adjusts deep learning models for optimal execution on end-point target devices.
  • MxNet
  • Neural Network Topology
    • The total number of neurons and all of their connections and weights  are referred to as the Neural Network Topology. 
  • Neural Network 
  • Object Detection
  • OpenCL™
    • OpenCL™ (Open Computing Language) is a framework for writing programs that execute across heterogeneous  platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. - Wikipedia 
  • OpenCV
    • OpenCV (Open Source Computer Vision Library) is a software library that specializes in real-time computer vision algorithms. Originally, started by Intel, OpenCV is now open source and the most widely used computer vision library in the world. https://opencv.org/  
  • OpenVINO™
    • Open Visual Inference & Neural Network Optimization (OpenVINO™) toolkit provides computer vision libraries and deep neural network and convolutional neural networks (CNN) libraries, the toolkit extends workloads across Intel® hardware and maximizes performance. - https://software.intel.com/en-us/openvino-toolkit 
  • OpenVX*
    • OpenVX* is an open, royalty-free standard for cross platform acceleration of computer vision applications. OpenVX enables performance and power-optimized computer vision processing, especially important in embedded and real-time use cases such as face, body and gesture tracking, smart video surveillance, advanced driver assistance systems (ADAS), object and scene reconstruction, augmented reality, visual inspection, robotics and more. - https://www.khronos.org/openvx/
  • TensorFlow*
    • TensorFlow* is an open source software library for high performance numerical computation. it comes with strong support for machine learning and deep learning and the flexible numerical computation core is used across many other scientific domains.
  • The Edge (of the Network) 
    • Networks located on the periphery of a centralized network. Device’s attached at the edge are often user facing.
  • VPU
    • A Visual Processing Unit is dedicated silicon that is designed for processing computer vision media including images and video. It’s often used in conjunction with Intel® Movidius™ technology.

Smart Video   OpenVINO Toolkit™

Accessing Remote Persistent Memory with Block Semantics Using SPDK and PMDK

$
0
0

Introduction

Persistent memory enables persistence at cache line granularities, compared to block granularity for traditional block storage. But in some cases, legacy software may need to access remote persistent memory using block semantics. This is not a primary use case for persistent memory but may be useful to present a small portion of a larger persistent memory pool over a network fabric. This article describes how the open source Software Performance Development Kit (SPDK) integrates with the Persistent Memory Development Kit (PMDK) to enable low-latency remote access to persistent memory with traditional block semantics using NVM Express* (NVMe*) over Fabrics (NVMe-oF).

Persistent Memory Development Kit Support for Block Storage

A key aspect of block semantics is guaranteeing write atomicity. When a block is written and a power failure occurs, we want to ensure that either all of the data or none of the data is written. This is critical for writing correct storage software such as write-ahead logs. The PMDK libpmemblk library provides such a guarantee for implementing block storage on top of persistent memory. Libpmemblk utilizes a block translation table (BTT), which behaves similarly to a flash translation layer (FTL) found in modern solid-state drives (SSDs). The BTT acts as an indirection table, enabling a separation of copying user data to a block-sized region of persistent memory from the mapping of that region to a logical block address.

Storage Performance Development Kit

Next, how do we present this block storage over a network fabric? NVMe-oF is a popular answer. NVMe-oF is designed for modern multicore architectures, enabling multiple queues for parallel access, and using remote direct memory access (RDMA) protocols to reduce software overhead and minimize latency.

Enter the SPDK. It provides a set of tools, libraries, and applications for creating high performance, scalable, user-mode storage applications. One of SPDK's applications is a poll-mode NVMe-oF target. SPDK provides a block device layer called bdev, which provides a generic interface to a heterogenous set of block devices that are created by bdev modules. Examples of SPDK bdev modules include:

  • NVMe—for accessing either local or remote attached storage using the NVMe protocol
  • Malloc—for accessing DRAM as a RAM disk
  • Ceph RBD—for accessing Ceph* RADOS block devices
  • PMDK—for accessing PMDK libpmemblk pools

SPDK and PMDK Integration

The SPDK PMEM bdev driver uses the pmemblk pool as the target for block input/output (I/O) operations.

Schematic of block device abastraction

Here we see the block device abstraction, which presents the PMDK pool as a block device that can be served as an NVMe-oF namespace over the network fabric. The client system can then access this storage remotely using the NVMe-oF protocol with any NVMe-oF compliant driver.

Configuration

Let's walk through how to configure an SPDK NVMe-oF target with libpmemblk-based namespaces.

First, we need to configure the target system. These instructions assume that you are already familiar with PMDK and have installed PMDK on the target system. If this is not the case, instructions for installing PMDK can be found on the Intel® Developer Zone at Getting Started with Persistent Memory.  We also assume that you are familiar with RDMA and have RDMA interfaces configured on both the target and client systems.

Start by building SPDK on the target system. See the instructions for downloading the source code, installing prerequisite packages, and building SPDK. To enable PMDK with SPDK, you must pass --with-pmdk to the SPDK configure script.

Now we can start the NVMe-oF target:

cd <spdk root directory>
app/nvmf_tgt/nvmf_tgt

The NVMe-oF target should now be running. In a separate terminal window, we will use the SPDK RPC interface to configure a libpmemblk SPDK block device (bdev). Later, we will attach this bdev to an NVMe-oF namespace.

This example creates the backing storage in /tmp, but this can be changed to any directory in a persistent memory-enabled file system. The bdev will be 8192 blocks, with a block size of 4096, for a total of 32 MiB (mebibytes). The name of the bdev will be pmdk0.

cd <spdk root directory>
scripts/rpc.py create_pmem_pool /tmp/spdk_pool 8192 4096
scripts/rpc.py construct_pmem_bdev /tmp/spdk_pool –n pmdk0

Now we can create the NVMe-oF subsystem and attach the pmdk0 bdev as a namespace:

scripts/rpc.py construct_nvmf_subsystem \
nqn.2016-06.io.spdk:cnode1 '''' -a -s SPDK0001
scripts/rpc.py nvmf_subsystem_add_listener \
nqn.2016-06.io.spdk:cnode1 -t RDMA \
-a 192.168.10.1 -s 4420
scripts/rpc.py nvmf_subsystem_add_ns \
nqn.2016-06.io.spdk:cnode1 pmdk0

The client system should now be able to connect:

nvme connect -t rdma -n nqn.2016-06.io.spdk:cnode1 \
-a 143.182.136.99 -s 4420

Summary

Using block protocols over a network fabric can quickly enable legacy applications to take advantage of persistent memory. NVMe-oF is the ideal protocol for this type of persistent memory access. The SPDK NVMe-oF target provides an easy-to-use PMDK plugin to enable persistent memory access over NVMe-oF, and its user space poll mode architecture optimizes access latency, compared to traditional interrupt-driven target applications.

About the Author

Jim Harris is a principal software engineer in Intel's Data Center Group. His primary responsibility is architecture and design of the Storage Performance Development Kit (SPDK). Jim has been at Intel since 2001, serving in a wide variety of storage software related roles. Jim has a B.S. and M.S. in Computer Science from Case Western Reserve University.

Code Sample: Panaconda - A Persistent Memory Version of the Game Snake

$
0
0
File(s):Download
License:3-Clause BSD License
Optimized for... 
OS:Linux* kernel version 4.3 or higher
Hardware:Emulated: See How to Emulate Persistent Memory Using Dynamic Random-access Memory (DRAM)
Software:
(Programming Language, tool, IDE, Framework)
C++ Compiler and Persistent Memory Developers Kit (PMDK) 
Prerequisites:Familiarity with C++


Introduction

Panaconda Game

Snake is a beloved game from childhood where you use arrow keys to navigate the board, pick up food to grow your snake, and avoid hitting walls or your own tail. Panaconda is a game of Snake designed to demonstrate persistent memory pools, pointers, and transactions. All objects are stored in persistent memory, which means that in the case of a power failure or application crash, the state of the game will be retained and you can continue playing from the point you were at before the failure. In this example, we demonstrate the details of what makes this game persistent, discuss how you can make your applications persistent by using similar methods, and wrap up with how to play the Panaconda game.

This article assumes you have basic knowledge of persistent memory and the concepts used in Persistent Memory Development Kit (PMDK) libraries. In our article, Introduction to Programming with Persistent Memory from Intel we provide a great introduction to what persistent memory is and why it is revolutionary. For setting up your development environment, refer to our Getting Started guide. Pmem.io has a great tutorial series describing use of the libpmemobj library for persistent memory programming. It is highly recommended to at least read part 1, which introduces basic concepts demonstrated in this article. 


Game Design

In Panaconda, everything happens within a while loop in main. Until the snake is stopped for any reason, it will loop taking steps, setting food, and checking for collisions.

while (!snake_game->is_stopped()) {
	…
}

Data structures

data structure flowchart
Figure 1. Data structure for Panaconda.

The game has three main classes: game, game_board, and snake. In the above diagram, you can see the additional classes and how they interact. The game class is the root object. This object is what anchors all the other objects created in the persistent memory pool. Through the game class, all other objects in the pool can be reached. This happens in the init() function of game, as shown:

persistent_ptr<game_state> r = state.get_root();

Game

In addition to being the root object, the game class checks whether the game file specified already exits. In the snippet below, the pool checks LAYOUT_NAME stored in game_state to see if it matches the game file passed in. This code is looking to see if the pool already exists. Whether the pool is being created or it already exists and is being opened, it is being assigned to the Pool Object Pointer (pop) variable.

if (pool<game_state>::check(params->name, LAYOUT_NAME) == 1)	
    pop = pool<game_state>::open(params->name, LAYOUT_NAME);
else
	pop = pool<game_state>::create(params->name, LAYOUT_NAME,
				           PMEMOBJ_MIN_POOL * 10, 0666);

In game::init we see our first transaction. This transaction wraps the maze setup process. If a power failure happens, the data structure does not get corrupted because all changes are rolled back. More details about transactions and how they are implemented can be found in the C++ bindings for libpmemobj (part 6) – transactions on pmem.io.

transaction::exec_tx(state, [&r, &ret, this]() {
	r->init();
	if (params->use_maze)
		ret = parse_conf_create_dynamic_layout();
	else
		ret = r->get_board()->creat_static_layout();

	r->get_board()->create_new_food();
});

use_maze is set to true if the game was started with the "–m" tag. If a custom maze is passed in, the game creates a dynamic layout, else it creates a static, predefined layout.

In this implementation of snake, the game_player class stores score and play_state. The state can be: STATE_NEW, STATE_PLAY, or STATE_GAMEOVER.

Game_board

Game_board creates persistent pointers to food, layout (the map), and snake. This is also where you would change the game board size if you were to create your own map.

game_board::game_board()
{
	persistent_ptr<element_shape> shape =
		make_persistent<element_shape>(FOOD);
	food = make_persistent<board_element>(0, 0, shape,
					      direction::UNDEFINED);
	layout = make_persistent<list<board_element>>();
	anaconda = make_persistent<snake>();
	size_row = 20;
	size_col = 20;
}

In the above snippet, the following objects are returned as persistent object pointers:

food—a persistent pointer to a board element of shape FOOD, with no point nor direction defined

layout—a persistent pointer to a list of board elements

anaconda—a persistent pointer to a snake object, which ultimately is a list of board elements

All of these allocations are part of a transaction, so if the game aborts, the allocations are rolled back, reverting any memory allocation back to its original state. More information about the make_persistent function can be found by reading C++ bindings for libpmemobj (part 5) – make_persistent.

Keep in mind, in the game_board destructor, these pointers are deleted using the following syntax:

game_board::~game_board()
{
	layout->clear();
	delete_persistent>(layout);
	delete_persistent(anaconda);
	delete_persistent(food);
}

Another function of the game_board class is to keep track of collisions. If the snake's head hits food, the snake gets longer and the game gets harder. Collisions happen between the snake and food, between the snake and a wall, and between the snake and its own body.

bool is_snake_collision(point point);
bool is_wall_collision(point point);
bool is_snake_head_food_hit(void);

Snake

In the snake class, snake_segments is a persistent pointer to a list of board_element objects. Initially, the snake is populated with five segments. More segments are added as the snake hits food.

snake_segments = make_persistent<list<board_element>>();

The move function in snake uses persistent pointers and a for loop to iterate through each element of snake_segments. The loop iterates backwards to assign the previous snake_segments point and location to the next segment. This gives the look of the snake moving. When the loop reaches the first element of snake_segments, it calculates the next position and sets the direction based on the direction that was passed into the function.

void snake::move(const direction dir)
{
	int snake_size = snake_segments->size();
	persistent_ptr<point> new_position_point;

	last_seg_position = *(snake_segments->get(snake_size - 1)->get_position().get());
	last_seg_dir = snake_segments->get(snake_size - 1)->get_direction();

	for (int i = (snake_size - 1); i >= 0; --i) {
		if (i == 0) {
			new_position_point =
				snake_segments->get(i)->calc_new_position(dir);
			snake_segments->get(i)->set_direction(dir);
		} else {
			new_position_point =
				snake_segments->get(i)->calc_new_position(
					snake_segments->get(i - 1)->get_direction());
			snake_segments->get(i)->set_direction(
				snake_segments->get(i - 1)->get_direction());
		}
		snake_segments->get(i)->set_position(new_position_point);
	}
}

As you can see in the image below, the snake is basically a moving array of snake_segments. Each element of snake_segments contains the x, y point where it is located and the direction it is going. When snake::move(const direction dir) is called, each element takes the position of the one in front of it. The first element moves based on the direction that was passed into the function.

Panaconda Game
Figure 2. Image of snake_segments before and after move function.


To Play

First, download and build Panaconda. Installation assistance, including dependencies, can be found in the PMDK readme.

Launch the game

$./panaconda /path/game/session/gameFile

The gameFile is where the game session is stored. This is either created the first time you play, or you can open a game file where you previously played. If this is your first time, make up a name for your file and start the game like this:

$./panaconda myFirstGameFile

Additionally, you can create your own game maze or use a friend's. "-m" specifies that you want to use a custom maze.

$./panaconda /path/game/session/gameFile –m /path/myMapCfg

panaconda/conf.cfg contains an example of a predefined maze. The maze is defined using a bitmap, where "1" is a wall and "0" is an open space. Currently, the map size is limited to 80 x 40 bits, but that is configurable in the code. Try creating your own maze and see if you can beat it; then share your maze with a friend!

Controls

Panaconda uses the arrow keys to move. "q" quits the game and "n" creates a new game.

To simulate killing your game, press "ctrl+c", "q", or execute "kill –p `pgrep panaconda`" in another terminal. This returns you to the command line and exits you from your game. To resume, simply launch the game again using the same game file you previously specified. Because of the game's persistent aspects, it resumes exactly where you left off.

Objective

The goal of the game is to stay alive as long as possible while growing your snake longer and longer. The snake grows longer when it runs into a food block. But be careful to avoid hitting any obstacles, walls, or other parts of the snake itself.


Summary

In this code sample, we saw examples of transactions and persistent pointers. We examined game, the root object that anchors all the other objects in the persistent memory pool. This is just one example of how persistent memory can be used. Although simple and fun, this sample demonstrates fundamental persistent memory programming concepts. If you're interested in learning more, I encourage you to dig deeper into Panaconda, or explore our other code samples on software.intel.com or in our GitHub* repo.

The PMDK is available on GitHub and on the Persistent Memory Programming home page.


About the Author

Kelly Lyon is a Developer Advocate at Intel Corporation with three years of previous experience as a Software Engineer. Most recently, Kelly was a maintainer of Snap, Intel’s open source telemetry framework. Kelly is dedicated to advocating for users and looks forward to bringing clarity to complex ideas by researching and providing simple, easy to understand guides and tutorials. Follow her journey on Twitter @a_lyons_tale.



References

Hands-on with the OpenVINO™ Inference Engine Python* API

$
0
0

Introduction

The OpenVINO™ toolkit 2018 R1.2 Release introduces several new preview features for Python* developers:
  • Inference Engine Python API support.
  • Samples to demonstrate Python API usage.
  • Samples to demonstrate Python* API interoperability between AWS Greengrass* and the Inference Engine.
This paper presents a quick hands-on tour of the Inference Engine Python API, using an image classification sample that is included in the OpenVINO™ toolkit 2018 R1.2. This sample uses a public SqueezeNet* model that contains around one thousand object classification labels.
(Important: As stated in the Overview of Inference Engine Python* API documentation, “this is a preview version of the Inference Engine Python* API for evaluation purpose only. Module structure and API itself will be changed in future releases.”)

Prerequisites

Ensure your development system meets the minimum requirements outlined in the Release Notes. The system used in the preparation of this paper was equipped with an Intel® Core™ i7 processor with Intel® Iris® Pro graphics.
The Inference Engine Python API is supported on Ubuntu* 16.04 and Microsoft Windows® 10 64-bit OSes. The hands-on steps provided in this paper are based on development systems running Ubuntu 16.04.

Install the OpenVINO™ Toolkit

If you already have OpenVINO™ toolkit 2018 R1.2 installed on your computer you can skip this section. Otherwise, you can get the free download here: https://software.intel.com/en-us/openvino-toolkit/choose-download/free-download-linux
Next, install the OpenVINO™ toolkit as described here: https://software.intel.com/en-us/articles/OpenVINO-Install-Linux

Optional Installation Steps

After completing the toolkit installation steps, rebooting, and adding symbolic links as directed in the installation procedure, you can optionally add a command to your .bashrc file to permanently set the environment variables required to compile and run OpenVINO™ toolkit applications.

  1. Open .bashrc:
    cd ~
    
    sudo nano .bashrc
  2. Add the following command at the bottom of .bashrc:
    source/opt/intel/computer_vision_sdk_2018.1.265/bin/setupvars.sh
  3. Save and close the file by typing CTRL+X, Y, and then ENTER.

The installation procedure also recommends adding libOpenCL.so.1 to the library search path. One way to do this is to add an export command to setupvars.sh script:

  1. Open setupvars.sh:
    cd /opt/intel/computer_vision_sdk_2018.1.265/bin/
    
    sudo nano setupvars.sh
  2. Add the following command at the bottom of setupvars.sh:
    export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/libOpenCL.so.1:$LD_LIBRARY_PATH
  3. Save and close the file by typing CTRL+X, Y, and then ENTER.
  4. Reboot the system:
    reboot

    Run the Python* Classification Sample

    Before proceeding with the Python classification sample, run the demo_squeezenet_download_convert_run.sh sample script from the demo folder as shown below:

    cd /opt/intel/computer_vision_sdk/deployment_tools/demo
    
    sudo ./demo_squeezenet_download_convert_run.sh

    (Important: You must run the demo_squeezenet_download_convert_run.sh script at least once in order to complete the remaining steps in this paper, as it downloads and prepares the deep learning model used later in this section.)

    The demo_squeezenet_download_convert_run.sh script accomplishes several tasks:

    • Downloads a public SqueezeNet model, which is used later for the Python classification sample.
    • Installs the prerequisites to run Model Optimizer.
    • Builds the classification demo app.
    • Runs the classification demo app using the car.png picture from the demo folder. The classification demo app output should be similar to that shown in Figure 1.

    Figure 1. Classification demo app output

    If you did not modify your .bashrc file to permanently set the required environment variables (as indicated in the Optional Installation Steps section above), you may encounter problems running the demo. If this is the case, run setupvars.sh before running demo_squeezenet_download_convert_run.sh.
    The setupvars.sh script also detects the latest installed Python version and configures the required environment. To check this, type the following command:

    echo $PYTHONPATH

    Python 3.5.2 was installed on the system used in the preparation of this paper, so the path to the preview version of the Inference Engine Python API is:

    /opt/intel/computer_vision_sdk/deployment_tools/inference_engine/python_api/ubuntu_1604/python3

    Go to the Python samples directory and run the classification sample:

    cd /opt/intel/computer_vision_sdk/deployment_tools/inference_engine/samples/python_samples
    
    python3 classification_sample.py –m /opt/intel/computer_vision_sdk_2018.1.265/deployment_tools/demo/ir/squeezenet1.1/squeezenet1.1.xml -i /opt/intel/computer_vision_sdk_2018.1.265/deployment_tools/demo/car.png

    In the second command we are launching Python3 to run classification_sample.py, specifying the same model (-m) and image (-i) parameters that were used in the earlier demo (i.e., demo_squeezenet_download_convert_run.sh). (Troubleshooting: if you encounter the error "ImportError: No module named 'cv2'", run the command sudo pip3 install opencv-python to install the missing library.) The Python classification sample output should be similar to that shown in Figure 2.

    Figure 2. Python classification sample output

    Note that the numbers shown the second column (#817, #511…) refer to the index numbers in the labels file, which identifies all of the objects that are recognizable by the deep learning model. The labels file is located in:

    /opt/intel/computer_vision_sdk_2018.1.265/deployment_tools/demo/ir/squeezenet1.1/squeezenet1.1.labels

    Conclusion

    This paper presented an overview of the Inference Engine Python API, which was introduced as a preview in the OpenVINO™ toolkit 2018 R1.2 Release. It is important to remember that as a preview version of the Inference Engine Python API, it is intended for evaluation purposes only. 

    A hands-on demonstration of Python-based image classification was also presented in this paper, using the classification_sample.py example. This is only one of several Python samples contained in the OpenVINO™ toolkit, so be sure to check out the other Python features contained in this release of the toolkit.

    OpenVINO™ Toolkit

    Viewing all 3384 articles
    Browse latest View live