Quantcast
Channel: Intel Developer Zone Articles
Viewing all 3384 articles
Browse latest View live

Code Sample: Parallel Techniques in Modeling Particle Systems Using Vulkan* API

$
0
0
File(s):Download (Intel Software Github)
License:Intel Sample Source Code License Agreement
Optimized for... 
Operating System:Microsoft Windows® 10 (64 bit)
Hardware:GPU required
Software:
(Programming Language, tool, IDE, Framework)
Microsoft Visual Studio* 2017, Unreal Engine* 4, C#
Prerequisites:

Familiarity with Visual Studio, Unreal Engine API, 3D graphics, parallel processing.

Tutorial:Parallel Techniques in Modeling Particle Systems Using Vulkan* API

 

Introduction

Parallel processing has become one of the most important aspects of today's programming techniques. The shift in paradigm forced by the CPU's hitting the power wall enforced programming techniques that emphasize spreading computations over multiple cores/processors. As it turns out, many of the computation tasks that software developers face are parallelizable. One example is the case of modeling particle systems. Such systems are widely used in many fields from physical simulation to computer games.

The Vulkan* API is a collaborative effort by the industry to meet current demands of computer graphics. It is a new approach that emphasizes hiding the CPU bottleneck through parallelism, and allowing much more flexibility in application structure. Aside from components related only to graphics, the Vulkan API also defines the compute pipeline for numerical computation.

This code and accompanying article (see References below) discuss and compare aspects of the implementation of a particle system using CPU and GPU using a Vulkan-based renderer as an example. We recommend that you read the article while looking at the code. Make sure you have the examples downloaded and use your favorite code browsing tool.

The code demonstrates the following concepts:

  1. Vulkan renderer for CPU based simulator
  2. CPU-based particle simulator
  3. Multithreaded approach to computing particle system
  4. Vulkan compute in a modeling particle system

The code is organized to emphasize architectural concepts, as such, it makes two important assumptions: 1) the return status from Vulkan function calls is in most cases ignored to keep the code base as simple as possible. In production, every return status should be checked and reacted on; 2) conceptually close topics are kept within a single code flow. This reduces the need for jumping through the sources, but sometimes leads to rather long functions.

Get Started

Vulkan* renderer for CPU based simulator

At a high level, when programming using Vulkan, the goal is to construct a virtual device to which drawing commands will be submitted. The draw commands are submitted to constructs called “queues”. The number of queues available and their capabilities depend upon how they were selected during construction of the virtual device, and the actual capabilities of the hardware. The power of Vulkan lies in the fact that the workload submitted to queues could be assembled and sent in parallel to already executing tasks. Vulkan offers functionality to coherently maintain the resources and perform synchronization.

In our application, we took the approach of breaking up the Vulkan renderer setup into five components, with the intent of presenting an easy to understand codebase.

  1. Device
  2. AppInstance
  3. SurfaceWindows
  4. Scene
  5. ParticlesElement

CPU-based particle simulator

The next section describes the model of particle simulator that was implemented for this project. The implementation is based on a real-time infinite loop. The simulation loop contains three main operations:

  1. Obtain the time difference between the current and previous iteration (CPU_TP23)
  2. Compute the next Euler iteration for the particle model (CPU_TP24)
  3. Upload and render a new frame using the Vulkan renderer (CPU_TP25

These are described in detail in the article.

Multithreaded approach to computing particle system

This section covers various approaches to improving performance by parallelizing the workload. The code demonstrates application of the recommended choice.

Vulkan compute in a modeling particle system

Performance can further be improved by shifting the work to the GPU. This section, addresses the topic of porting a CPU-based particle simulator onto the GPU using Vulkan compute. Vulkan compute is an integral part of the Vulkan API. Although most of the simulation will be moved onto the GPU, the sorting and generation of particles will be still kept on the CPU. The goal of this section is to present how to set-up the compute pipeline, instantiate more advanced shader input constructs, and show how to send frequently changing, but small, portions of data onto the GPU without using memory mapping.

References

Tomasz Chadzynski, Integrated Computing Solutions, Inc., Parallel Techniques in Modeling Particle Systems Using Vulkan API, 2017

Updated Log

Created March 20, 2018


MADRaS: A Multi-Agent DRiving Simulator

$
0
0

simulator screen

Overview

In this article we present MADRaS: Multi-Agent DRiving Simulator. It is a multi-agent version of TORCS, a racing simulator popularly used for autonomous driving research by the reinforcement learning and imitation learning communities. See: 

MADRaS is a multi-agent extension of Gym-TORCS and is open source, lightweight, easy to install, and has the OpenAI Gym API, which makes it ideal for beginners in autonomous driving research. It enables independent control of tens of agents within the same environment, opening up a prolific direction of research in multi-agent reinforcement learning and imitation learning research aimed at acquiring human-like negotiation skills in complicated traffic situations—a major challenge in autonomous driving that all major players are racing to solve.

Most open-source autonomous driving simulators (like CARLA*, DeepDrive, AirSim, and Udacity* SDC) innately support only egocentric control; that is, single agent behavior, and have preprogrammed behaviors for the other agents. The difficulty in introducing agents with custom behaviors in these simulators restricts the diversity of real-world scenarios that can be simulated. To address this issue, we developed MADRaS, wherein each car on the racing track can be independently controlled, enabling the creation of rich, custom-made traffic scenarios, and learning the policy of control of multiple agents simultaneously.

More Detail

The task of negotiation in traffic can be posed as that of finding the winning strategy in a multi-agent game, wherein multiple entities (cars, buses, two-wheelers, and pedestrians) are trying to achieve their objectives of getting from one place to another fast, yet safely and reliably. Imitation learning algorithms like Behavioral Cloning, Active Learning, and Apprenticeship Learning (Inverse Reinforcement Learning followed by Reinforcement Learning) have proved to be effective for learning such sophisticated behaviors, under a multitude of simplifying assumptions and constraining conditions. A major portion of the contemporary literature makes the single-agent assumption; that is, the agent acts in an environment with a plethora of other agents—similar or different—but does not interact with any of them, robbing it of data and information that could potentially be extremely useful in decision making, at both the egocentric and collaborative levels.

What?

Driving, however, is inherently multi-agent, and the following is a partial list of things that become possible once we get rid of the single-agent assumption.

Platooning

One of the earliest instances of multi-agent systems being deployed in vehicles (starting way back in 1993!) was in the use of platooning, wherein vehicles travel at highway speeds with small inter-vehicle spacing to reduce congestion and still achieve high throughput without compromising safety. Now it seems obvious that autonomous cars in the near future will communicate, cooperate, and form platoons over intersecting lengths of their commutes.

virtual vehicles

Source : eDriving*

Pooling knowledge

traffic simulator

Source : phys.org

Apart from transferring information about pile-ups and possible diversions ahead to all the vehicles in the geographical vicinity, this power of reliable communication can be used to pool together the knowledge of multiple learning agents. An intuitive motivation could be to consider a large gridworld. With a single learning agent, one could solve the gridworld in n hours of training. With multiple learning agents pooling their experiences, we could cut down the training time significantly, possibly even linearly!

There’s a host of untapped literature on communication among multiple agents in various environments (not autonomous driving... yet.) See: 

Now this raises important questions about the reliability of the communication between vehicles. With the imminent advent of 5G,1 fast and reliable communication between vehicles can help lead to the training and deployment of completely hands-free autonomous cars.

Leveraging intent

Drivers on the road constantly anticipate the potential actions of fellow drivers. As an example, for close maneuvering in car parks and intersections, eye contact is made to ensure a shared understanding. Defense Advanced Research Projects Agency (DARPA) stated that traffic vehicle drivers, unnerved by being unable to make eye contact with the robots, had resorted to watching the front wheels of the robots for an indication of their intent.2

As inter vehicle communication becomes ubiquitous and reliable (shout out to 5G!), autonomous vehicles will be able to transmit their intent to neighboring vehicles to implement the level of coordination beyond what human drivers currently achieve using eye contact.

car, biker, and pedestrian

Source : The Star

But…

Multi-agent learning comes with its own share of complications:

  • Nonstationarity A moving-target problem, since the best policy changes as the other agents’ policies change.
  • Curse of dimensionality The exponential growth of state and action variables with the number of agents.
  • Specifying a good goal Difficult, since the agents’ returns are correlated and cannot be maximized independently.
  • Exploration Apart from having to explore the environment, the agents also have to obtain information about other agents.
  • Coordination The effect of an agent’s action on the environment also depends on the actions taken by other agents, hence the need of mutually consistent actions in order to achieve the intended effect.

Remember Why?

But remember why we started solving fully autonomous driving (FAD) in the first place. Writing for Technology Review, Will Knight outlines the possibilities of our driverless car future:

  • Safer transportation
    The National Highway Traffic Safety Administration estimates that more than 90 percent of road crashes involve human error, a figure that has led some experts to predict that autonomous driving will reduce the number of accidents on the road by a similar percentage. Assuming the technology becomes ubiquitous and does have such an effect, the benefits to society will be huge. Almost 33,000 people die on the roads in the United States each year, at a cost of USD 300 billion, according to the American Automobile Association. The World Health Organization estimates that, worldwide, over 1.2 million people die on roads every year.3 
  • Improved fuel efficiency
    Apart from lesser traffic congestion due to fewer accidents, it is also expected that the rise of self-driving taxis will help decrease the total number of cars on the road, alleviating the overall traffic. And because driverless vehicles will be designed to optimize efficiency in acceleration and braking, the adoption of autonomous cars could reduce CO2 emissions produced by cars by as much as 300 million tons per year. Autonomous vehicles traveling in high-speed platoons that reduce aerodynamic drag could also reduce fuel consumption by 20 percent.4
  • Less traffic?
    Driverless cars communicating with each other and their surroundings can find and exploit the optimal routes more effectively, which will help spread the demand for scarce open road spaces.
  • Improved human productivity
    With cars doing most or all of the driving, we’ll be free to make the most of our time spent in the vehicle—spending that time reading books, watching a game, interacting with family members, and even getting some work done!

The list goes on...

Epilogue

So, today we’re excited to release MADRaS for the community to kickstart research into making FAD a reality. With the ability of introducing multiple learning agents in the environment at the same time, this simulator, built on top of TORCS, can be used to benchmark and try out existing and new multi-agent learning algorithms for self-driving cars such as: Multi-Agent Deep Deterministic Policy Gradient (MADDPG), PSMADDPG, and the lot. And since this extends TORCS, it supports the deployment of all the single-agent learning algorithms as well. Scripts for training a DDPG agent are provided as a sample.

Check out the following video for an overview of the features and the general interface.

  • Check out the GitHub* repository and the wiki for the project.
  • Check out videos of a sample Deep Deterministic Policy Gradients (DDPG) agent that has learned to drive in traffic.

This project was developed by Abhishek Naik and Anirban Santara (an Intel® Student Ambassador for AI) during their internship at the Parallel Computing Lab, Intel Labs, Bangalore, India. This project was driven by Intel’s urge to address the absence of an open source multi-agent autonomous driving simulator that can be utilized by machine learning (particularly, reinforcement learning) scientists to rapidly prototype and evaluate their ideas. Although the system was developed and optimized entirely on the Intel® Core™ i7 processor and Intel® Xeon® processors, we believe that it would run smoothly on other x86 platforms, too. Currently, we are working on integrating MADRaS with the Intel® Nervana platform Reinforcement Learning Coach and we invite the community to participate in its development.

Please feel free to report any incompatibility or bug by creating an issue in the GitHub repository. We hope MADRaS enables new and veteran researchers in academia and the industry to make this FAD a reality!

Authors

Abhishek Naik: IIT Madras
Anirban Santara: Intel Student Ambassador for AI, IIT Kharagpur
Balaraman Ravindran: Head, RBC-DSAI, IIT Madras
Bharat Kaul: Parallel Computing Lab, Intel Labs

References

  1. Intel Is Accelerating the 5G Future and How 5G will change the world
  2. DARPA08Report
  3. Driverless Cars Are Further Away Than You Think
  4. Three Amazing Benefits Of Driverless Cars That You May Have Never Imagined

Intel® Developer Zone Feature Requests

$
0
0

Do you have a feature you've been dying to see on the site? Please send them our way. While we can't guarantee we can fold them into our development roadmap, we'd love to hear what would be of value to your developers.

Credit Risk Classification: Faster Machine Learning with Intel Optimized Packages

$
0
0

Abstract

This paper explains the importance of using Intel® Performance Libraries to solve a machine-learning problem such as credit risk classification. The objective of solving a credit risk classification problem is to minimize loss from the bank’s perspective; for example, when a person applies for a loan, the bank has to make a decision regarding whom to give approval for the loan and whom not to. The case study uses Intel® Distribution for Python* and Python API for Intel® Data Analytics Acceleration Library (Intel® DAAL)—named PyDAAL—to boost machine-learning and data analytics performance. Using the advantage of optimized scikit-learn* (Scikit-learn with Intel DAAL) that comes with Intel® Distribution for Python, we were able to achieve good results for the prediction problem.

Introduction

Credit risk is one of the major financial risks that exists in the banking system. How a bank manages its credit risk is very critical for its performance over time; capital depletion through loan losses has been the proximate reason for most organization failures. Distinguishing and rating credit risk is the primary step in managing it effectively. The applicant is a good credit risk if the user is likely to repay the loan. Conversely, if the user is not likely to repay the loan, the applicant is considered as a bad credit risk. The bank has to analyze applicant’s demographic and socio-economic profiles before a decision is made regarding his and her loan application.

With the help of Intel optimized computational packages and an Intel® Xeon® Gold 6128 processor, a faster predictive model is developed on this data. The machine-learning model helps to guide the bank manager to make a decision whether to approve a loan to a prospective applicant based on his and her profiles.

Hardware Details

The hardware configuration of Intel® Xeon® Gold 6128 processor is shown in table 1.

Table 1. Hardware Configuration.

NameDescription
Intel® Xeon® Gold processor based architecturex86_64
CPU op-mode(s)32-bit, 64-bit
Byte orderLittle Endian
CPU(s)24
On-line CPU(s) list0-23
Thread(s) per core2
Core(s) per socket6
Socket(s)2
NUMA node(s)2
Vendor IDGenuine Intel
CPU family6
Model85
Model nameIntel® Xeon® Gold 6128 processor 3.40 GHz
Stepping4
CPU MHz1199.960
BogoMIPS6800.00
Virtualization typeVT-x
L1d cache32K
L1i cache32K
L2 cache1024K
L3 cache19712K
NUMA node0 CPU(s)0-5,12-17
NUMA node1 CPU(s)6-11,18-23

Software Configuration

The development of this use case had the following dependencies (table 2).

Table 2. Software Configuration.

LibraryVersion
Anaconda* with Intel channel4.3.21
Intel® optimized for Python*3.6.3
Optimized scikit-learn*0.19.0
Intel® optimized for NumPy1.13.3

Dataset Description

The original dataset contains 1,000 entries with 20 categorical and symbolic attributes. Each entry represents a person who applies for a credit loan with a bank. Each person is classified as good or bad credit risks according to the set of attributes. Table 3 shows the set of attributes in the dataset.

Table 3. Dataset Description.

Attribute NameAttribute Description
checking_statusStatus of existing checking account, in Deutsche Marks (DM)
durationDuration in months
credit_historyCredit history (credits taken, paid back duly, delays, critical accounts)
purposePurpose of the credit (car, television, etc.)
credit_amountCredit loan amount, in Deutsche Marks (DM)
savings_statusStatus of savings account and bonds, in Deutsche Marks
employmentPresent employment, in number of years
installment_commitmentInstallment rate in percentage of disposable income
personal_statusPersonal status (married, single, etc.) and sex
other_partiesOther debtors and guarantors
residence_sincePresent residence since X years
property_magnitudeProperty (e.g., real estate)
ageAge in years
other_payment_plansOther installment plans (banks, stores, etc.)
housingHousing (rent, own)
existing_creditsNumber of existing credits at this bank
jobJob
num_dependentsNumber of people being liable to provide maintenance for
own_telephoneTelephone (yes and no)
foreign_workerForeign worker (yes and no)
classGood credit or bad credit

Intel® Distribution for Python*, Optimized scikit-learn*, and PyDAAL module

Machine learning and data analysis using Python get their power with Intel® Distribution for Python1. Intel® Distribution for Python is equipped with Intel optimized computational packages2 like NumPy, SciPY, scikit-learn* and PyDAAL (a free package which implements Python bindings to the Intel® Data Analytics Acceleration Library, or Intel® DAAL)3.

scikit-learn is a commonly used library for machine learning mainly because of its algorithm richness. Intel DAAL is more than just a set of machine-learning algorithms, as it implements and optimizes the full data analysis pipeline, from loading data and transforming and filtering it to analysis and modeling of data with classic statistical and machine-learning techniques, as well as advanced deep-learning methods. scikit-learn algorithms shipped within Intel® Distribution for Python take advantage of PyDAAL, bringing scikit-learn performance to new levels.

Solution Design

The major steps involved in solution design are shown in Figure 1.

Diagram Solution design
Figure 1. Solution design.

Detailed explanation of each step is given below.

Dataset analysis

The primarily part in any predictive modelling task is the dataset analysis, which is nothing but an initial exploration of data. The goal of dataset analysis is to have a solid understanding of data, to derive a better solution. Gathering good information about data also helps with feature engineering and feature selection. Data visualization using graphs helps us understand the relation between attributes and patterns in the data.

The German credit dataset4 has 21 features out of which 14 are categorical variables and the remaining 7 are numerical. The last column is the label, which denotes the credit risk and has only two possible values: good credit risk and bad credit risk. Since both categorical and numerical variables are included in the data set, appropriate statistical and distribution analysis are provided.

Numerical variable analysis

Table 4 shows the summary of all the numerical variables in the dataset. It includes count, mean, standard deviation (std), min, quartiles, and max in its output.

Table 4. Numerical Variable Analysis.

Here are few inferences that were obtained:

  1. There are no missing values in the dataset. Count is 1,000 for all columns which means that all the values are present for each attribute in 1000 rows of the dataset
  2. Most of the credit loan amount requests are in the range of 3,000DM-4,000DM.
  3. The mean average duration is about 21 months and age is around 36 years.

Categorical variable analysis.

histogram analysiS
Figure 2. Credit loan amount histogram analysis.

Credit loan amount
Figure 3. Credit loan amount box plot analysis.

Most of the credit loan amounts are in the range of 2,000DM-4,000DM (Figure 2). The largest amount given is as high as 18,000DM. Box plot analysis shows that higher credit loan amounts are mostly bad credit risks (Figure 3).

Purpose distribution
Figure 4. Purpose distribution analysis.

Television and radio or a new car are the top reasons for mosts applicants seeking a credit. loan. Very few applicants are seeking a credit loan for education or retraining. This may mean that either education and retraining are not worth a credit loan or that their costs have been covered entirely either by the schools, universities, government, or in some other way, which seems very unlikely.

Age distribution
Figure 5. Age distribution analysis.

Plotting the frequency of the credit loan against the age groups in Figure 5 shows that credit demands from applicants between the ages of 20 and 39 make up for about two-thirds of the total credit demands.

Data pre-processing

Data pre-processing involves the transformations being applied to the data before feeding it to the algorithm. Since most of the variables in the dataset are categorical, various techniques need to be applied for converting the categorical to numerical variables.

  1. Converting binary categorical data to numerical: In this step, the yes or no columns own_telephone, foreign_worker (Refer Table 3) are converted to 1’s and 0’s.
  2. One-hot encoding: This technique transforms each categorical feature with n possible values into n binary features, with only one active. The Pandas package in Python supports this feature using the get_dummies method. This function is named this way because it creates dummy and indicator columns. The following columns in the German credit data are one-hot encoded: checking_status, credit_history, savings_status, employment, personal_status, other_parties, job, housing, other_payment_plans, property_magnitude
  3. Label encoding: Another approach for converting categorical variables to numerical is by using label encoding, which converts each value in a column to a numeric value. The purpose column in the data is label-encoded. The optimized scikit-learn* has a pre-built functionality for preprocessing using the scikit-learn pre-processing module.

Feature selection

Datasets may contain irrelevant or redundant features that might make the machine-learning model more complicated. In this step, we aim to remove the irrelevant features which may cause an increase in run time, generate complex patterns, etc. The generated subset of features is used for further analysis. The feature selection can be done either by using Random Forest or Xgboost algorithm. In this experiment, the Xgboost algorithm is used to select the best features which has a score above a predefined threshold value . Table 5 shows some of the features sorted based on the score value.

Table 5. Feature Importance.

FeatureScore
credit_amount0.1724
duration0.1122
age0.1057
purpose0.0634
checking_status_no checking0.0423
installment_commitment0.0341
plan_none0.0309
employment_4<=X<70.0293
residence_since0.0260
credit_history_all paid0.0244

Data split

Splitting the train and test data: The data is then split into train and test sets for further analysis. 90% of the data is used for training and 10% is for testing. The train_test_split function in scikit-learn is used for data splitting.

Classifier implementation

Classifier is implemented using two packages: scikit-learn with Intel DAAL and PyDAAL.

scikit-learn with Intel DAAL

Balancing the dataset

The dataset is highly imbalanced with 70% of the data containing good credit risks. This data imbalance is handled by the SMOTE algorithm, which generates the new smoted dataset that addresses the unbalanced class problem. It artificially generates observations of minority classes using the nearest neighbors of this class of elements to balance the training dataset.

Model building and training

There are various machine-learning models (algorithms) that are created by researchers and data scientists over years. In this stage, machine-learning models are selected for training. All classifiers in scikit-learn use a fit (X, y) method to fit the model for the given train data X and train label y. To compare the performance of various models, an ensemble of classifiers is used. Once the model is trained, it can be used for prediction.

Prediction

During this stage, the trained model predicts the output for a given input based on its learning. That is, given an unlabeled observation X, predict (X) returns the predicted label y.

Evaluation

In order to measure the performance of model, various performance evaluation metrics are available. We have used accuracy, precision, and recall as our evaluation metrics to choose the best model for the problem.

Python API for Intel DAAL-PyDAAL

Data representation in PyDAAL

All the pre-processing steps upto Step 4 (data split) are the same for PyDAAL implementation. Every algorithm in DAAL accepts inputs in the form of NumericTables, a generic data type for representing data in memory. Since all the converted features are of the same type, we have used HomogenNumericTables for representation. The dataset obtained after feature selection in scikit-learn, is in the form of NumPy ndarray. This can be easily converted to a HomogenNumericTable using the built-in function in the DAAL data management module5.

Model building and training

An algorithm object is created to train the classifier in appropriate processing mode (Batch and Online and Distributed)5.The two pieces of input (i.e., data and labels) are set using the input.set member methods6 of the algorithm object. Further, the compute() method is used to update the partial model.

Prediction

After creating the model, a prediction object is defined. The testing data set and the trained model are passed to the algorithm using the input.setTable() and input.setModel() methods, respectively. The predictions are computed using the compute() method.

Evaluation

After finding the predictions using the compute() method, the accuracy, precision, recall, and f1-score of the model are calculated.

Experimental Results

Since the problem is binary classification, apart from the accuracy, the evaluation metrics such as precision, recall, and F1 score are used to identify the best classifier from the ensemble of algorithms. The classifiers are categorized as Best, Good, and Poor based on the performance evaluation metrics. The classifier that gives a stable value for Accuracy, Precision, Recall, and F1 score is categorized as the best classifier, whereas the classifier that gives highly varying values for the evaluation metrics is categorized as a poor classifier.

Using scikit-learn with Intel DAAL

Table 6 depicts the results of using various classifiers on the balanced and unbalanced datasets. Comparison of results for full and feature selected datasets are also given. All the features are pre-processed using encoding techniques discussed under the data pre-processing section. Receiver Operating Characteristic curve (or ROC curve) is also plotted for each classifier to analyze the classifier performance. The closer the curve follows the left-hand border and then the top border of the ROC space, is considered as the best classifier. Figure 6 shows the ROC curve for classifiers in scikit-learn with Intel® DAAL. ROC curve demonstrates that Random Forest Classifier and Ada Boost classifier are the best classifiers.

Table 6. Experiment Results of scikit-learn* with Intel® DAAL.

ClassifiersUnbalanced full training data (%)Unbalanced feature selected data (%)Balanced feature selected data (%)
 Acc.PrecisionRecallF1-scoreAcc.PrecisionRecallF1-scoreAcc.PrecisionRecallF1-score
Best

Random Forest

777677757678767482848281

Ada Boost

777777777474747381818181
Good

Decision Tree

697169706665666568676867

Support Vector Machine

797979777371737171747171

Gaussian Naïve Bayes

656765667375737475757575

Multinomial Naïve Bayes

606160606263626361616161

K Nearest Neighbors

666366646663666451545152
Poor

Stochastic Gradient Descent

704970587049705846724638

One Vs Rest

704970587049705865786553

ROC curve
Figure 6. ROC curve.

Using PyDAAL

The results of classification using PyDAAL algorithms are given in table 7.

Table 7. Experiment Results of PyDAAL.

ClassifiersAccuracy%Precision%Recall%F1-score%
Best
Decision Tree74727472
Support Vector Machine65797175
Good
Multinomial Naïve Bayes62646263
Poor
Ada Boost70497058

Observations:

  1. Random Forest Classifier and Ada Boost classifier in scikit-learn Intel DAAL are identified as the best classifiers for the given problem.
  2. There was only a slight improvement in classifier performance when the irrelevant features were removed, but there was a significant improvement in run time.
  3. Balancing the dataset using the SMOTE algorithm helped produce better results for scikit-learn, whereas there was no improvement in PyDAAL.
  4. PyDAAL is versatile enough to accommodate data of different memory layouts. Popular libraries used in the data analysis process, like NumPy, can be easily interfaced with Intel DAAL to create numeric tables.
  5. PyDAAL supports only a few machine-learning algorithms and the highest score is obtained for decision trees.

Conclusion

The scikit-learn algorithms shipped within Intel® Distribution for Python* take advantage of PyDAAL bringing scikit-learn performance to new levels. It is also a richer library in terms of machine-learning algorithms when compared to PyDAAL. Using Intel optimized performance libraries in Intel® Xeon® Gold 6128 processor helped machine-learning applications to make predictions faster. Higher accuracy is obtained for optimized scikit-learn algorithms than PyDAAL algorithms. PyDAAL mainly operates on numeric table data, which allows reduced memory footprint and efficient processing.

About the Authors

Nikhila Haridas is a Technical Consulting Engineer working with the Intel® AI Academy Program.

References

  1. Intel Distribution for Python
  2. Intel Optimized Packages for the Intel® Distribution for Python*
  3. PyDAAL
  4. German credit dataset
  5. DAAL: Data structures
  6. DAAL programming guide

Inference Engine Developer Guide

$
0
0

Deployment Challenges

Deploying deep learning networks from the training environment to embedded platforms for inference is a complex task that introduces technical challenges, such as:

  • Several deep learning frameworks are widely used in the industry, such as Caffe*, TensorFlow*, MXNet*, among others
  • Training deep learning networks is typically performed in data centers or server farms and the inference often take place on embedded platforms that are optimized for performance and power consumption.
    These platforms are typically limited from the software perspective:
    • programming languages
    • third party dependencies
    • memory consumption
    • supported operating systems
    and the platforms are limited from the hardware perspective:
    • different data types
    • limited power envelope
    Because of these limitations, it is usually not recommended, and sometimes not possible, to use original training framework for inference. As an alternative, use dedicated inference APIs that are optimized for specific hardware platforms.

For these reasons, ensuring the accuracy of the transforms networks can be a complex task.

Deployment Workflow

The Inference Engine deployment process assumes you used the Model Optimizer to convert your trained model to an Intermediate Representation. The scheme below illustrates the typical workflow for deploying a trained deep learning model.

Intel Computer Vision Basic Workflow

A summary of the steps for optimizing and deploying a trained model:

  1. Configure the Model Optimizer for your framework.
  2. Convert a trained model to produce an optimized Intermediate Representation (IR) of the model based on the trained network topology, weights, and biases values.
  3. Test the model in the Intermediate Representation format using the Inference Engine in the target environment via provided Inference Engine Validation application or the sample applications.
  4. Integrate the Inference Engine in your application to deploy the model in the target environment.

Introduction to the Inference Engine

After you have used the Model Optimizer to create an Intermediate Representation, use the Inference Engine to infer input data.

The Inference Engine is a C++ library with a set of C++ classes to infer input data (images) and get a result. The C++ library provides an API to read the Intermediate Representation, set the input and output formats, and execute the model on devices.

NOTE:

  • This section talks about API information. For more information about APIs, see the offline documentation that was included in your package. To locate the current API:
    1. Go to <INSTALL_DIR>/deployment_tools/documentation/ where <INSTALL_DIR> is the directory in which the Intel® CV SDK is installed.
    2. Open index.html in an Internet browser.
    3. Select Integrating Inference Engine in Your Application (legacy API) from the contents.
  • This document refers to APIs from previous releases as "legacy" API. It is best to stop using the legacy API since it will be removed in a future product release. To locate the legacy API:
    1. Go to <INSTALL_DIR>/deployment_tools/documentation/ under the directory in which the Intel® CV SDK is installed.
    2. Open index.html in an Internet browser.
    3. Select Integrating Inference Engine in Your Application (legacy API) from the contents.
  • Complete API documentation is also in the full offline package documentation.
    1. Go to <INSTALL_DIR>/deployment_tools/documentation/ under the directory in which the Intel® CV SDK is installed.
    2. Open index.html in an Internet browser.
    3. Select Open Data Structures from the menu at the top of the screen.

Modules in the Inference Engine Package

Your application must link to the core Inference Engine library and C++ header files in the include directory.

The library contains the classes for:

  • Linux: libinference_engine.so
  • Windows: inference_engine.dll

Using Plugins, Depending on the Target

Each supported target device has a plugin. The Heterogeneous plugin is also available for distributing a calculation workload across devices. Each plugin is a DLL/shared library. Make sure those libraries are in your computer's path or in the place you pointed to in the plugin loader. Make sure each plugin's related dependencies are in the:

  • Linux: LD_LIBRARY_PATH
  • Windows: PATH

On Linux, use the script bin/setupvars.sh to set the environment variables.

The table below shows the relationship between libraries and targets.

TargetLinux Library NameLinux Dependency LibrariesWindows Library NameWindows Dependency Libraries
CPUlibMKLDNNPlugin.solibmklml_tiny.so, libiomp5md.soMKLDNNPlugin.dllmklml_tiny.dll, libiomp5md.dll
Intel® Integrated GraphicslibclDNNPlugin.solibclDNN64.soclDNNPlugin.dllclDNN64.dll
FPGAlibdliaPlugin.solibdla.soNot supportedNot supported
Intel® Movidius™ Myriad™ 2 Vision Processing Unit (VPU)libmyriadPlugin.soNo dependenciesNot supportedNot supported
HeterogeneouslibHeteroPlugin.soSame as selected pluginsHeteroPlugin.dllSame as selected plugins

When using the Heterogeneous plugin, use the literal strings in the Target column in the getPluginByDevice method. For more information, see the getPluginByDevice API.

Common Workflow for Using the Inference Engine API

  1. Read the Intermediate Representation - Using the InferenceEngine::CNNNetReader class, read an Intermediate Representation file into a CNNNetwork class. This class represents the network in host memory.
  2. Prepare inputs and outputs format - After loading the network, specify input and output precision, and the layout on the network. For these specification, use the CNNNetwork::getInputInfo() and CNNNetwork::getOutputInfo()
  3. Select Plugin - Select the plugin on which to load your network. Create the plugin with the InferenceEngine::PluginDispatcher load helper class. Pass per device loading configurations specific to this device, and register extensions to this device.
  4. Compile and Load - Use the plugin interface wrapper class InferenceEngine::InferencePlugin to call the LoadNetwork() API to compile and load the network on the device. Pass in the per-target load configuration for this compilation and load operation.
  5. Set input data - With the network loaded, you have an ExecutableNetwork object. Use this object to create an InferRequest in which you signal the input buffers to use for input and output. Specify a device-allocated memory and copy it into the device memory directly, or tell the device to use your application memory to save a copy.
  6. Execute - With the input and output memory now defined, choose your execution mode:
    • Synchronously - Infer() method. Blocks until inference finishes.
    • Asynchronously - StartAsync() method. Check status with the wait() method (0 timeout), wait, or specify a completion callback.
  7. Get the output - After inference is completed, get the output memory or read the memory you provided earlier. Do this with the InferRequest GetBlob API.

For more information about integrating the Inference Engine in your your application, see How to integrate the Inference Engine in your application.

Using Inference Engine Samples

The Inference Engine sample applications are simple console applications that demonstrate how to use Intel's Deep Learning Inference Engine in your applications.

Samples in the Samples Directory

The following sample applications are available in the samples directory in the Inference Engine installation directory:

Sample Description
CPU ExtensionsLibrary with topology-specific layers, like DetectionOutput used in the SSD
Image Classification SampleInference of image classification networks like AlexNet and GoogLeNet (the sample supports only images as inputs)
Image Classification Sample, pipelinedMaximize performance via pipelined execution, the sample supports only images as inputs
Security Barrier Camera SampleVehicle Detection followed by the Vehicle Attributes
Object Detection for Faster R-CNN SampleInference of object detection networks like Faster R-CNN (the sample supports only images as inputs)
Image Segmentation SampleInference of image segmentation networks like FCN8 (the sample supports only images as inputs)
Object Detection for SSD Demonstration, Async API Performance ShowcaseDemonstration application for SSD-based Object Detection networks, new Async API performance showcase, and simple OpenCV interoperability (supports video and camera inputs)
Object Detection for SSD SampleInference of object detection networks based on the SSD, this sample is simplified version that supports only images as inputs
Neural Style Transfer SampleStyle Transfer sample (the sample supports only images as inputs)
Hello Infer Request Classification SampleInference of image classification networks via Infer Request API (the sample supports only images as inputs)
Interactive Face Detection SampleFace Detection coupled with Age-Gender and Head-Pose, supports video and camera inputs
Security Barrier Camera ExampleSupports images/video and camera inputs
Validation ApplicationInfers a pack of images, resulting in total accuracy (only images as inputs)

Samples That Support Pre-Trained Models Shipped With the Product

You are provided several pre-trained models. The table below shows the correlation between models and samples/devices.  The samples are available in <INSTALL_DIR>/deployment_tools/inference_engine/samples

ModelSample Supported on the ModelCPUIntel® Integrated GraphicsHETERO:FPGA,CPUIntel® Movidius™ Myriad™ 2 VPU
face-detection-adas-0001Interactive Face Detection Samplexx x
age-gender-recognition-retail-0013Interactive Face Detection Samplexxxx
head-pose-estimation-adas-0001Interactive Face Detection Samplexxxx
vehicle-license-plate-detection-barrier-0007Security Barrier Camera Samplexxxx
vehicle-attributes-recognition-barrier-0010Security Barrier Camera Samplexxxx
license-plate-recognition-barrier-0001Security Barrier Camera Samplexxxx
person-detection-retail-0001Object Detection Samplexxx 
person-detection-retail-00012Any sample that supports SSD-based modelsxx x
face-detection-retail-0004Any sample that supports SSD-based modelsxxxx
person-vehicle-bike-detection-crossroad-0066Any sample that supports SSD-based modelsxx x

Inferring Your Model with the Inference Engine Samples

Building the Sample Applications on Linux

Supported Linux build environment:

  • Ubuntu* 16.04 LTS 64-bit or CentOS* 7.4 64-bit
  • GCC* 5.4.0 (for Ubuntu* 16.04) or GCC* 4.8.5 (for CentOS* 7.4)
  • CMake* version 2.8 or higher.
  • OpenCV* 3.3 or later (required for some samples and demonstrations). Use the Intel® CV SDK installation download and instructions to complete this installation.

Follow these steps to prepare your Linux computer for the samples:

  1. Go to the samples directory: <INSTALL_DIR>/deployment_tools/inference_engine/samples/
  2. Create a directory. This example uses a directory named build
    mkdir build
  3. Go to the new directory:
    cd build
  4. Run CMake to generate the Make files with or without debug information:
    • Without debug information:
      cmake -DCMAKE_BUILD_TYPE=Release <path_to_inference_engine_samples_directory>
    • With debug information:
      cmake -DCMAKE_BUILD_TYPE=Debug <path_to_inference_engine_samples_directory>
  5. Build the application:
    make

The sample application binaries are in <INSTALL_DIR>/deployment_tools/inference_engine/samples/intel64/Release/

Building the Sample Applications on Windows*

Supported Windows build environment:

Follow these steps to prepare your Windows computer for the samples:

  1. Go to the samples directory.
  2. Double-click create_msvc_solution.bat
  3. Open Microsoft Visual Studio* 2015
  4. Build samples\build\Samples.sln

Set Your Environment Variables

Use these steps to make sure your application can find the Interface Engine libraries.

For Linux, execute the following command to set the environment variable:

source <INSTALL_DIR>/deployment_tools/inference_engine/bin/setupvars.sh

where <INSTALL_DIR> is the Intel CV SDK installation directory.

Running the Samples

Image Classification Sample

Description

The Image Classification sample application does inference using image classification networks, like AlexNet* and GoogLeNet*. The sample application reads command line parameters and loads a network and an image to the Inference Engine plugin. When inference is done, the application creates an output image and outputs data to the standard output stream.

Running the Application

Running the application with the -h option results in the message:

$ ./classification_sample -h
InferenceEngine: 
    API version ............ <version>
    Build .................. <number>
classification_sample [OPTION]
Options:
    -h                      
                            Print a usage message.
    -i "<path1>""<path3>"
                            Required. Path to a directory with images or path to an image files: a .ubyte file for LeNet*
                            and a .bmp file for the other networks.
    -m "<path>"             
                            Required. Path to an .xml file with a trained model.
        -l "<absolute_path>"
                            Optional. Absolute path to library with MKL-DNN (CPU) custom layers (*.so).
        Or
        -c "<absolute_path>"
                            Optional. Absolute path to Intel® Integrated Graphics custom layers config (*.xml).
    -pp "<path>"            
                            Path to a plugin directory.
    -d "<device>"           
                            Specify the target device to infer on; CPU, Intel® Integrated Graphics, or MYRIAD is acceptable. Sample will look for a suitable plugin for device specified
    -nt "<integer>"         
                            Number of top results (default 10)
    -ni "<integer>"         
                            Number of iterations (default 1)
    -pc                     
                            Enables per-layer performance report

Running the application with an empty list of options results in an error message and the usage list above.

To do inference on an image using a trained AlexNet network on Intel® Processors:

$ ./classification_sample -i <path_to_image]/cat.bmp -m <path_to_model]/alexnet_fp32.xml

Output Description

By default the application outputs top-10 inference results. Add the -nt option to the previous command to modify the number of top output results. For example, to get the top-5 results on Intel® HD Graphics, use the command:

$ ./classification_sample -i <path_to_image]/cat.bmp -m <path_to_model]/alexnet_fp32.xml

Image Classification - Pipelined

Description

This sample demonstrates how to build and execute inference in pipelined mode on example of classifications networks.

The pipelined mode might increase the throughput of the pictures. The latency of one inference will be the same as for syncronous execution. The throughput is increased due to follow reasons:

  • Some plugins have heterogenity inside themselves. Transferring of data, execution on remote device, pre-processing and post-processing on the host
  • Using of explicit heterogenious plugin with execution of different parts of network on differnet devices

When two and more devices are involved in inference process of one picture, creation of several infer requests and starting of asynchronious inference allows to utilize devices the most efficient way. If two devices are involved in execution, the most optimal value for -nireq2.

To do this efficiently, the Classification Sample Async uses a round-robin algorithm for inference requests. It starts by the executing the current inference request and switches to waiting for the previous request results. After finishing the wait, the application switches inference requests and repeats the procedure.

Another required aspect for good throughput is the number of iterations. Only with a large number of iterations can you emulate the application work and see performance results.

The batch mode is an independent attribute on the pipelined mode. The pipelined mode works efficiently with any batch size.

The sample application reads command line parameters and loads a network and an image to the Inference Engine plugin. Then the application creates several infer requests pointed in -nireq parameter and loads pictures for inference.

Then in the loop it starts inference for the current infer request and switch for waiting of another one. When results are ready, inference requests are swapped.

When inference is done, the application outputs data to the standard output stream.

Running the Application

Running the application with the -h option results in the message:

./classification_sample -h
InferenceEngine: 
    API version ............ <version>
    Build .................. <number>
classification_sample [OPTION]
Options:
    -h                      
                            Print a usage message.
    -i "<path1>""<path3>"
                            Required. Path to a directory with images or path to an image files: a .ubyte file for LeNet
                            and a .bmp file for the other networks.
    -m "<path>"             
                            Required. Path to an .xml file with a trained model.
        -l "<absolute_path>"
                            Optional. Absolute path to library with Intel® MKL-DNN (CPU) custom layers (*.so).
        Or
        -c "<absolute_path>"
                            Optional. Absolute path to Intel® Integrated Graphics custom layers config (*.xml).
    -pp "<path>"            
                            Path to a plugin directory.
    -d "<device>"           
                            Specify the target device to infer on; CPU, Intel® Integrated Graphics or MYRIAD is acceptable. Sample will look for a suitable plugin for device specified
    -nt "<integer>"         
                            Number of top results (default 10)
    -ni "<integer>"         
                            Number of iterations (default 1)
    -pc                     
                            Enables per-layer performance report

Running the application with an empty list of options results in an error message and the usage list above.

To do inference on an image using a trained AlexNet network on FPGA with a fallback to Intel® Processors:

$ ./classification_sample_async -i <path_to_image]/cat.bmp -m <path_to_model]/alexnet_fp32.xml -nt 5 -d HETERO:FPGA,CPU -nireq 2 -ni 200

Output Description

By default the application outputs top-10 inference results for each infer request. In addition to this information it will provide throughput value measured in frames per seconds.


Security Barrier Camera Sample

Description

Showcases Vehicle Detection, followed by Vehicle Attributes and License Plate Recognition applied on top of Vehicle Detection. The results are in the intel_models directory:

  • vehicle-license-plate-detection-barrier-0007: The primary detection network to find the vehicles and licence-plate
  • vehicle-attributes-recognition-barrier-0010: Executed on top of the results from vehicle-license-plate-detection-barrier-0007. The vehicle attributes execution barrier reports the general vehicle attributes, like the vehicle type and color, where type is something like car, van, or bus.
  • license-plate-recognition-barrier-0001: Executed on top of the results from vehicle-license-plate-detection-barrier-0007. The license plate recognition barrier network reports a string for each recognized license plate. For topology details, see the descriptions in the intel_models

Other demonstration objectives:

  • Show images/video/camera as inputs, via OpenCV*
  • Show an example of simple network pipelining: Attributes and LPR networks are executed on top of the Vehicle Detection results
  • Show vehicle attributes and licence plate information for each detected vehicle

How it Works

The application reads command line parameters and loads the specified networks. The Vehicle/License-Plate Detection network is required, and the other two are optional.

Upon getting a frame from the OpenCV's VideoCapture the app performs inference of Vehicles/License-Plates, then performs another two inferences using Vehicle Attributes and LPR detection networks (if those specified in command line) and displays the results.

Running the Application

Running the application with the -h option results in the message:

$ ./security_barrier_sample -h 
InferenceEngine:
        API version ............ 1.0
    [ INFO ] Parsing input parameters
    interactive_vehicle_detection [OPTION]
    Options:
        -h                         Print a usage message.
        -i "<path>"                Required. Path to a video or image file. Default value is "cam" to work with camera.
        -m "<path>"                Required. Path to the Vehicle/License-Plate Detection model (.xml) file.
        -m_va "<path>"             Optional. Path to the Vehicle Attributes model (.xml) file.
        -m_lpr "<path>"            Optional. Path to the License-Plate Recognition model (.xml) file.
          -l "<absolute_path>"     For Intel® MKL-DNN (CPU)-targeted custom layers, if any. Absolute path to a shared library with the kernels impl.
              Or
          -c "<absolute_path>"     For Intel® Integrated Graphics-targeted custom kernels, if any. Absolute path to the xml file with the kernels desc.
        -d "<device>"              Specify the target device for Vehicle Detection (CPU, Intel® Integrated Graphics, FPGA, MYRYAD, or HETERO).
        -d_va "<device>"           Specify the target device for Vehicle Attributes (CPU, Intel® Integrated Graphics, FPGA, MYRYAD, or HETERO).
        -d_lpr "<device>"          Specify the target device for License Plate Recognition (CPU, Intel® Integrated Graphics, FPGA, MYRYAD, or HETERO).
        -pc                        Enables per-layer performance statistics.
        -r                         Output Inference results as raw values.
        -t                         Probability threshold for Vehicle/Licence-Plate detections.

Running the application with an empty list of options results in an error message and the usage list above.

Demonstration Output

The demonstration uses OpenCV* to display the resulting frame with detections rendered as bounding boxes and text:

Automobile driving


Object Detection for Faster R-CNN Sample

Description

VGG16-Faster-RCNN is a public CNN that can be easily obtained from GitHub. 

The sample application reads command line parameters and loads a network and an image to the Inference Engine plugin. When inference is done, the application creates an output image and outputs data to the standard output stream.

Downloading and Converting a Caffe* Model

  1. Download test.prototxt from https://raw.githubusercontent.com/rbgirshick/py-faster-rcnn/master/models/pascal_voc/VGG16/faster_rcnn_end2end/test.prototxt
  2. Download the pretrained models from https://dl.dropboxusercontent.com/s/o6ii098bu51d139/faster_rcnn_models.tgz?dl=0
  3. Unzip the archive and make sure you have the file named VGG16_faster_rcnn_final.Caffe*model.

For correctly converting the source model, run the Model Optimizer with the extension for the Python proposal layer. To convert the source model:

python3 ${MO_ROOT_PATH}/mo_Caffe*.py --input_model <path_to_model]/VGG16_faster_rcnn_final.Caffe*model --input_proto <path_to_model]/deploy.prototxt --extensions <path_to_object_detection_sample]/fasterrcnn_extensions

Running the Application

Running the application with the -h option results in the message:

$ ./object_detection_sample -h
InferenceEngine: 
    API version ............ <version>
    Build .................. <number>
object_detection_sample [OPTION]
Options:
    -h                      
                            Print a usage message.
    -i "<path>"
                            Required. Path to an image file.
    -m "<path>"             
                            Required. Path to an .xml file with a trained model.
        -l "<absolute_path>"    
                            Optional. Absolute path to library with MKL-DNN (CPU) custom layers (*.so).
        Or
        -c "<absolute_path>"
                            Optional. Absolute path to Intel® Integrated Graphics custom layers config (*.xml).
    -pp "<path>"            
                            Path to a plugin directory.
    -d "<device>"           
                            Specify the target device to infer on; CPU or Intel® Integrated Graphics is acceptable. The sample looks for a suitable plugin for the device specified
    -ni "<integer>"         
                            Number of iterations (default 1)
    -pc                     
                            Enables per-layer performance report

Running the application with an empty list of options results in an error message and the usage list above.

Use the following command to do inference on Intel® Processors on an image using a trained Faster R-CNN network:

$ ./object_detection_sample -i <path_to_image>/inputImage.bmp -m <path_to_model>/faster-rcnn.xml -d CPU

Output Description

The application outputs an image named out_0.bmp with detected objects enclosed in rectangles. It outputs the list of classes of the detected objects along with the respective confidence values and the coordinates of the rectangles to the standard output stream.

Using this Sample with the Intel Person Detection Model

This model has a non-default (for Faster-RCNN) output layer name. To score it correctly, add the option --bbox_name detector/bbox/ave_pred to the command line.

Usage example:

./object_detection_sample -i /home/user/people.jpg -m /<ie_path]/intel_models/person-detection-retail-0001/FP32/person-detection-retail-0001.xml --bbox_name detector/bbox/ave_pred -d CPU

Object Detection SSD, Async API Performance Showcase Sample

Description

This demonstration showcases Object Detection with SSD and new Async API. Async API usage can improve overall frame-rate of the application, because rather than wait for inference to complete, the app can continue doing things on the host, while accelerator is busy. Specifically, this demonstration keeps two parallel infer requests and while the current is processed, the input frame for the next is being captured. This essentially hides the latency of capturing, so that the overall framerate is rather determined by the MAXIMUM(detection time, input capturing time) and not the SUM(detection time, input capturing time).

The technique can be generalized to any available parallel slack, such as doing inference while simultaneously encoding the resulting (previous) frames, or running further inference, like emotion detection on top of the face detection results.

Be aware of performance caveats though. When running tasks in parallel, avoid over-using shared compute resources. For example, if performing inference on the FPGA with a mostly idle CPU, perform parallel tasks on the CPU. When doing inference on Intel® Integrated Graphics, you have little gain in tasks like having resulting video encoding on the same Intel® Integrated Graphics in parallel because the device is already busy.

For more performance implications and tips for the Async API, see the Optimization Guide

Other demonstration objectives:

  • Video as input support via OpenCV*
  • Visualization of the resulting bounding boxes and text labels (from the .labels file) or class number (if no file is provided)
  • OpenCV* provides resulting bounding boxes, labels, and other information. You can copy and paste this code without pulling Inference Engine samples helpers into your application.
  • Demonstrate the Async API in action. For this, the demonstration features two modes with a Tab key toggle.
    • Old-style "Sync" way - The frame capturing with OpenCV* executes back-to-back with Detection
    • "Truly Async" way - The Detection is performed on the current frame, while the OpenCV* captures the next frame.

How it Works

The application reads command line parameters and loads a network to the Inference Engine. Upon getting a frame from the OpenCV*'s VideoCapture it performs inference and displays the results.

New "Async API" operates with new notion of the "Infer Request" that encapsulates the inputs/outputs and separates scheduling and waiting for result, next section. And here what makes the performance look different:

  1. In the default ("Sync") mode the frame is captured and then immediately processed, below in pseudo-code:
    while(true) {
        capture frame
        populate CURRENT InferRequest
        start CURRENT InferRequest //this call is async and returns immediately
        wait for the CURRENT InferRequest
        display CURRENT result
    }
    This is a reference implementation in which the new Async API is used in a serialized/synch fashion.
  2. In "true" ASync mode, the frame is captured and then immediately processed:
    while(true) {
            capture frame
            populate NEXT InferRequest
            start NEXT InferRequest //this call is async and returns immediately
                wait for the CURRENT InferRequest (processed in a dedicated thread)
                display CURRENT result
            swap CURRENT and NEXT InferRequests
        }
    In this case, the NEXT request is populated in the main (app) thread, while the CURRENT request is processed. This is handled in the dedicated thread, internal to the Inference Engine runtime.

Async API

In this release, the Inference Engine offers a new API based on the notion of Infer Requests. With this API, requests encapsulate input and output allocation. You access the blob with the GetBlob method.

You can execute a request asynchronously in the background and wait until you need the result. In the meantime your application can continue:

// load plugin for the device as usual
  auto enginePtr = PluginDispatcher({"../../../lib/intel64", ""}).getSuitablePlugin(
                getDeviceFromStr("GPU"));
// load network
CNNNetReader network_reader;
network_reader.ReadNetwork("Model.xml");
network_reader.ReadWeights("Model.bin");
// populate inputs etc
auto input = async_infer_request.GetBlob(input_name);
...
// start the async infer request (puts the request to the queue and immediately returns)
async_infer_request->StartAsync();
// Continue execution on the host until you need the request results
//...
async_infer_request.Wait(IInferRequest::WaitMode::RESULT_READY);
auto output = async_infer_request.GetBlob(output_name);

You have no direct way to measure execution time of the infer request that is running asynchronously, unless you measure the Wait executed immediately after the StartAsync. But this essentially would mean the serialization and synchronous execution.

This is what sample does for the default "SYNC" mode and reports as a Detection time/fps message on the screen. In the truly asynchronous ("ASYNC") mode the host continues execution in the master thread, in parallel to the infer request. If the request is completed before than the Wait is called in the main thread (i.e. earlier than OpenCV* decoded a new frame), that reporting the time between StartAsync and Wait would obviously incorrect. That is why in the "ASYNC" mode the inference speed is not reported.

For more information about the new, request-based Inference Engine API, including ASYNC execution, see the information about integrating a customer application new request API.

Running the Application

Running the application with the -h option results in the message:

$ ./object_detection_demo_ssd_async -h
InferenceEngine: 
    API version ............ [version]
    Build .................. 
object_detection_demo_ssd_async [OPTION]
Options:
    -h                      
                            Print a usage message.
    -i "[path]"
                            Required. Path to an video file. Use "cam" to capture input from the camera).
    -m "[path]"             
                            Required. Path to an .xml file with a trained model.
        -l "[absolute_path]"    
                            Optional. Absolute path to library with Intel® MKL-DNN (CPU) custom layers (*.so).
        Or
        -c "[absolute_path]"
                            Optional. Absolute path to Intel® Integrated Graphics custom layers config (*.xml).
    -d "[device]"
                            Specify the target device to infer on; CPU, Intel® Integrated Graphics, FPGA, and Intel® Movidius™ Myriad™ 2 Vision Processing Unit are accepted.
    -pc
                            Enables per-layer performance report.
    -t
                            Probability threshold for detections (default is 0.5).
    -r
                            Output inference results as raw values to the console.

Running the application with an empty list of options results in an error message and the usage list above.

Use the following command to do inference on a Intel® Integrated Graphics with an example pre-trained GoogleNet based SSD* available at https://software.intel.com/file/609199/download

Command Description

After reading through this demonstration, use this command to perform inference on a Intel® Integrated Graphics with the SSD you download from https://software.intel.com/file/609199/download

$ ./object_detection_demo_ssd_async -i <path_to_video>/inputVideo.mp4 -m <path_to_model>/ssd.xml -d GPU

The network must be converted from the Caffe* (*.prototxt + *.model) to the Inference Engine format (*.xml + *bin) before using this command. See the Model Optimizer Developer Guide.

The only GUI knob is using 'Tab' to switch between the synchronized execution and the true Async mode.

Output Description

The output uses OpenCV* to display the resulting frame with detections rendered as bounding boxes and labels, if provided. In default mode, the sample reports:

  • OpenCV* time: Frame decoding + time to render the bounding boxes, labels, and display of the results.
  • Detection time: Inference time for the objection network. This is reported in SYNC mode.
  • Wallclock time: The combined application-level performance.

Object Detection with SSD-VGG Sample

Description

How to run the Object Detection sample application, which does inference using object detection networks like SSD-VGG on Intel® Processors and Intel® HD Graphics.

The sample application reads command line parameters and loads a network and an image to the Inference Engine plugin. When inference is done, the application creates an output image and outputs data to the standard output stream.

Running the Application

Running the application with the -h option results in the message:

$./object_detection_sample_ssd -h
InferenceEngine: 
    API version ............ <version>
    Build .................. <number>
object_detection_sample_ssd [OPTION]
Options:
    -h                      
                            Print a usage message.
    -i "<path>"
                            Required. Path to an image file.
    -m "<path>"             
                            Required. Path to an .xml file with a trained model.
        -l "<absolute_path>"    
                            Optional. Absolute path to library with MKL-DNN (CPU) custom layers (*.so).
        Or
        -c "<absolute_path>"
                            Optional. Absolute path to Intel® Integrated Graphics custom layers config (*.xml).
    -pp "<path>"            
                            Path to a plugin directory.
    -d "<device>"           
                            Specify the target device to infer on; CPU, Intel® Integrated Graphics or MYRIAD is acceptable. The sample looks for a suitable plugin for the specified device.
    -ni "<integer>"         
                            Number of iterations (default 1)
    -pc                     
                            Enables per-layer performance report

Running the application with an empty list of options results in an error message and the usage list above.

Use the following command to do inference on Intel® Processors on an image using a trained SSD network:

$ ./object_detection_sample_ssd -i <path_to_image>/inputImage.bmp -m <path_to_model>/VGG_ILSVRC2016_SSD.xml -d CPU

Output Description

The application outputs an image named out_0.bmp with detected objects enclosed in rectangles. It outputs the list of classes of the detected objects along with the respective confidence values and the coordinates of the rectangles to the standard output stream.


Neural Style Transfer Sample

Description

How to build and run the Neural Style Transfer sample (NST sample) application, which does inference using models of style transfer topology.

Running the Application

Running the application with the -h option results in the message:

$ ./style_transfer_sample --h
InferenceEngine:
    API version ............ <version>
    Build .................. <number>
style_transfer_sample [OPTION]
Options:
    -h
                            Print a usage message.
    -i "<path1>""<path3>"
                            Required. Path to a directory with images or path to an image files: a .ubyte file for LeNet
                            and a .bmp file for the other networks.
    -m "<path>"
                            Required. Path to an .xml file with a trained model.
        -l "<absolute_path>"
                            Optional. Absolute path to library with MKL-DNN (CPU) custom layers (*.so).
        Or
        -c "<absolute_path>"
                            Optional. Absolute path to Intel® Integrated Graphics custom layers config (*.xml).
    -pp "<path>"
                            Path to a plugin directory.
    -p "<name>"
                            Plugin name. For example Intel® MKL-DNN. If this parameter is pointed, the sample looks for this plugin only
    -d "<device>"
                            Specify the target device to infer on; CPU or Intel® Integrated Graphics is acceptable. The sample looks for a suitable plugin for the specified device.
    -nt "<integer>"
                            Number of top results (default 10)
    -ni "<integer>"
                            Number of iterations (default 1)
    -pc
                            Enables per-layer performance report

Running the application with an empty list of options results in an error message and the usage list above.

To do inference on an image using a trained model of NST network on Intel® Processors using the following command:

$ ./style_transfer_sample -i <path_to_image>/cat.bmp -m <path_to_model>/1_decoder_FP32.xml

Output Description

The application outputs one or more styled image, starting with named out1.bmp, which were redrawn in style of model which used for inference. Style of output images depend on models which use for sample.


Hello Infer Request Classification

Description

How to run the Hello Infer Classification sample application. The sample is simplified version of the Image Classification Sample. It's intended to demonstrate using of new Infer Request API of Inference Engine in applications. See Integrate with customer application New Request API for details.

Running the Application

To do inference on an image using a trained AlexNet network on Intel® Processors:

$ ./hello_request_classification <path_to_model>/alexnet_fp32.xml <path_to_image>/cat.bmp CPU

Output Description

The top-10 inference results

Interactive Face Detection

Description

Showcases the Object Detection task applied to face recognition using a sequence of neural networks. The Async API can improve the overall frame-rate of the application because the application can continue operating while the accelerator is busy. This demonstration maintains two parallel inferance requests for the Age Gender and Head Pose detection that are run simultaneously.

Other demonstration objectives:

  • Video as input support via OpenCV*.
  • Visualization of the resulting face bounding boxes from Face Detection network.
  • Visualization of age gender and head pose information for each detected face.
  • The OpenCV* provides resulting bounding boxes, labels, and other information. You can copy and paste this code without pulling Inference Engine sample helpers into your application.

How it Works

  1. The application loads up to three networks, depending on the -d option.
  2. The application gets a frame from the OpenCV's video capture
  3. The application performs inference on the frame detection network
  4. The application performs two simultaneous inferences, using the Age Gender and Head Pose detection networks, if these are specified in the command-line.
  5. The application displays the results.

The new Async API operates with new notion of the Infer Request that encapsulates the inputs/outputs and separates scheduling and waiting for result. This operation changes the performance, as follows:

In the default mode (Sync mode), the frame is captured and immediately processed:

while(true) {
    capture frame
    populate FaceDetection InferRequest
    wait for the FaceDetection InferRequest
    populate AgeGender InferRequest using dyn batch technique
    populate HeadPose InferRequest using dyn batch technique
    wait AgeGender
    wait HeadPose
    display detection results
}

Running the Application

Running the application with the -h option results in the message:

$ ./interactive_face_detection -h
InferenceEngine: 
    API version ............ <version>
    Build .................. <number>
interactive_face_detection [OPTION]
Options:
    -h                         Print a usage message.
    -i "<path>"                Optional. Path to an video file. Default value is "cam" to work with camera.
    -m "<path>"                Required. Path to an .xml file with a trained face detection model.
    -m_ag "<path>"             Optional. Path to an .xml file with a trained age gender model.
    -m_hp "<path>"             Optional. Path to an .xml file with a trained head pose model.
      -l "<absolute_path>"     Required for Intel® MKL-DNN (CPU)-targeted custom layers.Absolute path to a shared library with the kernels impl.
          Or
      -c "<absolute_path>"     Required for Intel® Integrated Graphics-targeted custom kernels.Absolute path to the xml file with the kernels desc.
    -d "<device>"              Specify the target device for Face Detection (CPU, Intel® Integrated Graphics, FPGA, or MYRYAD. The sample looks for a suitable plugin for the specified device.
    -d_ag "<device>"           Specify the target device for Age Gender Detection (CPU, Intel® Integrated Graphics, FPGA, or MYRYAD. The sample looks for a suitable plugin for the specified device.
    -d_hp "<device>"           Specify the target device for Head Pose Detection (CPU, Intel® Integrated Graphics, FPGA, or MYRYAD. The sample looks for a suitable plugin for the specified device.
    -pc                        Enables per-layer performance report.
    -r                         Inference results as raw values.
    -t                         Probability threshold for detections.

Running the application with an empty list of options results in an error message and the usage list above.

To do inference on a Intel® Integrated Graphics with an example pre-trained GoogleNet based SSD* available at example pre-trained GoogLeNet-based SSD:

./object_detection_demo_ssd_async -i <path_to_video>/inputVideo.mp4 -m <path_to_model>/ssd.xml -d Intel® Integrated Graphics

Before using this, use the Model Optimizer to convert the network from the Caffe* (*.prototxt + *.model) to the Inference Engine format (*.xml + *bin)

Demonstration Output

The demonstration uses OpenCV* to display the resulting frame with detections that are rendered as bounding boxes. Labels are included if available. In default mode, the sample reports:

  • OpenCV* time: frame decoding + time to render the bounding boxes, labels, and displaying the results
  • Face detection time: inference time for the face Detection network
  • Age Gender + Head Pose time: combined inference time of simultaneously executed age gender and head pose networks

Image Segmentation Sample

Description

How to run the Image Segmentation sample application, which does inference using image segmentation networks like FCN8.

The sample application reads command line parameters and loads a network and an image to the Inference Engine plugin. When inference is done, the application creates an output image.

Running the Applicaiton

Running the application with the -h option results in the message:

$ ./segmentation_sample -h
InferenceEngine: 
    API version ............ <version>
    Build .................. <number>
segmentation_sample [OPTION]
Options:
    -h                      
                            Print a usage message.
    -i "<path1>""<path3>"
                            Required. Path to a directory with images or path to an image files: a .ubyte file for LeNet
                            and a .bmp file for the other networks.
    -m "<path>"             
                            Required. Path to an .xml file with a trained model.
        -l "<absolute_path>"    
                            Optional. Absolute path to library with MKL-DNN (CPU) custom layers (*.so).
        Or
        -c "<absolute_path>"
                            Optional. Absolute path to Intel® Integrated Graphics custom layers config (*.xml).
    -pp "<path>"            
                            Path to a plugin directory.
    -d "<device>"           
                            Specify the target device to infer on; CPU or Intel® Integrated Graphics is acceptable. The sample looks for a suitable plugin for the specified device.
    -ni "<integer>"         
                            Number of iterations (default 1)
    -pc                     
                            Enables per-layer performance report

Running the application with an empty list of options results in an error message and the usage list above.

To do inference on Intel® Processors using an image from a trained FCN8 network:

$ ./segmentation_sample -i <path_to_image>/inputImage.bmp -m <path_to_model>/fcn8.xml

Output Description

The application outputs are a segmented image named out.bmp.

How to Integrate the Inference Engine in Your Application

  • This section talks about API information. For more information about APIs, see the offline documentation that was included in your package. To locate the current API:
    1. Go to <INSTALL_DIR>/deployment_tools/documentation/ where <INSTALL_DIR> is the directory in which the Intel® CV SDK is installed.
    2. Open index.html in an Internet browser.
    3. Select Integrating Inference Engine in Your Application (legacy API) from the contents.
  • This document refers to APIs from previous releases as "legacy" API. It is best to stop using the legacy API since it will be removed in a future product release. To locate the legacy API:
    1. Go to <INSTALL_DIR>/deployment_tools/documentation/ under the directory in which the Intel® CV SDK is installed.
    2. Open index.html in an Internet browser.
    3. Select Integrating Inference Engine in Your Application (legacy API) from the contents.
  • Complete API documentation is also in the full offline package documentation.
    1. Go to <INSTALL_DIR>/deployment_tools/documentation/ under the directory in which the Intel® CV SDK is installed.
    2. Open index.html in an Internet browser.
    3. Select Open Data Structures from the menu at the top of the screen.

Integration With the API

This section provides a high-level description of the process of integrating the Inference Engine into your application. See Using Inference Engine Samples for examples of using the Inference Engine in applications.

Using the Inference Engine API in Your Code

The core libinference_engine.so library implements loading and parsing a model Intermediate Representation, and triggers inference using a specified plugin. The core library has the following API:

  • InferenceEngine::IInferencePlugin - The main plugin interface. Every Inference Engine plugin implements this method. Use it through an InferenceEngine::InferenceEnginePluginPtr instance.
  • InferenceEngine::PluginDispatcher - This class allows find suitable plugin for specified device in given directories.
  • InferenceEngine::CNNNetReader
  • InferenceEngine::CNNNetwork
  • InferenceEngine::Blob, InferenceEngine::TBlob
  • InferenceEngine::BlobMap
  • InferenceEngine::InputInfo, InferenceEngine::InputsDataMap 

The Integration Process

  1. Load a plugin by creating an instance of InferenceEngine::InferenceEnginePluginPtr. Specify the plugin or let the Inference Engine choose it with InferenceEngine::PluginDispatcher. See the selectPlugin() function in the samples.
    InferenceEngine::PluginDispatcher dispatcher(pluginDirs);
    InferenceEngine::InferenceEnginePluginPtr enginePtr (dispatcher.getSuitablePlugin(TargetDevice::eCPU);
  2. Create an Intermediate Representation reader by creating an instance of InferenceEngine::CNNNetReader and read a model Intermediate Representation:
    auto netBuilder = new InferenceEngine::CNNNetReader();
    netBuilder->ReadNetwork("Model.xml");
    netBuilder->ReadWeights("Model.bin");
  3. Request information about inputs (an image and any other input data required), using the InferenceEngine::CNNNetReader::getNetwork() and InferenceEngine::CNNNetwork::getInputsInfo() methods. Set the input number format (precision) using InferenceEngine::InputInfo::setInputPrecision to match the input data format (precision). Allocate input blobs of the appropriate types and feed an image and the input data to the blobs:
    /** Taking information about all topology inputs **/
    InferenceEngine::InputsDataMap inputInfo(netBuilder.getNetwork().getInputsInfo());
    /** Stores all input blobs data **/
    InferenceEngine::BlobMap inputBlobs;
    /** Iterating over all input blobs **/
    for (auto & item : inputInfo) {
        /** Creating input blob **/
        item.second->setInputPrecision(Precision::U8);
        InferenceEngine::TBlob[unsigned char]::Ptr input;
        input = InferenceEngine::make_shared_blob[unsigned char, InferenceEngine::SizeVector](Precision::U8, item.second->getDims());
        input->allocate();
        inputBlobs[item.first] = input;
        /** Fill input tensor with planes. First b channel, then g and r channels **/
        ...
    }
  4. Request information about outputs, using the InferenceEngine::CNNNetReader::getNetwork() and InferenceEngine::CNNNetwork::getOutputsInfo() methods. Allocate output blobs of the appropriate types:
    InferenceEngine::OutputsDataMap outputInfo(netBuilder.getNetwork().getOutputsInfo());
    InferenceEngine::BlobMap outputBlobs;
    for (auto & item : outputInfo) {
        InferenceEngine::TBlob[float]::Ptr output;
        output = InferenceEngine::make_shared_blob(Precision::FP32, item.second->dims);
        output->allocate();
        outputBlobs[item.first] = output;
    }
  5. Load the model to the plugin using InferenceEngine::IInferencePlugin::LoadNetwork():
    InferenceEngine::StatusCode status = enginePtr->LoadNetwork(netBuilder.getNetwork(), &resp);
    if (status != InferenceEngine::OK) {
        throw std::logic_error(resp.msg);
    }
  6. Do inference by calling the InferenceEngine::IInferencePlugin::Infer method:
    enginePtr->Infer(inputBlobs, outputBlobs, &resp);
    
  7. Go over the output blobs and process the results.
    /** Pointer to the output blob **/
    const TBlob[float]::Ptr fOutput = std::dynamic_pointer_cast<tblob[float>>(outputBlobs.begin()->second);
    /** fOutput->data()[] - accessing output blob data **/</tblob[float]>

Building Your Application

For details about building your application, see the CMake files for the sample applications. All samples reside in the samples directory in the Inference Engine installation directory.

Running the Application

Before running compiled binary files:

  • Make sure your application can find the Inference Engine libraries. On Linux* operating systems, the LD_LIBRARY_PATH environment variable specifies the library directories. Update LD_LIBRARY_PATH with paths to the directories in the Inference Engine installation directory in which the libraries reside.
  • Add a path the directory containing the core and plugin libraries:
    • For Inference Engine installed within the Intel® CV SDK package:
      $ export LD_LIBRARY_PATH=/opt/intel/computer_vision_sdk_<version>/inference_engine/lib/<linux_version>/intel64:$LD_LIBRARY_PATH
      
    • For Intel's Deep Learning Deployment Toolkit installation:
      $ export LD_LIBRARY_PATH=/opt/intel/deep_learning_sdk_<version>/deployment_tools/inference_engine/lib/<linux_version>/intel64:$LD_LIBRARY_PATH
      
  • Add paths to the directories containing the required third-party libraries:
    • For Inference Engine installed within the Intel® CV SDK package:
      $ export LD_LIBRARY_PATH=/opt/intel/computer_vision_sdk_<version>/inference_engine/external/mklml_lnx/lib:$LD_LIBRARY_PATH
      $ export LD_LIBRARY_PATH=/opt/intel/computer_vision_sdk_<version>/inference_engine/external/cldnn/lib:$LD_LIBRARY_PATH
      
    • For Intel Deep Learning Deployment Toolkit installation:
      $ export LD_LIBRARY_PATH=/opt/intel/deep_learning_sdk_<version>/deployment_tools/external/mklml_lnx/lib:$LD_LIBRARY_PATH
      $ export LD_LIBRARY_PATH=/opt/intel/deep_learning_sdk_<version>/deployment_tools/external/cldnn/lib:$LD_LIBRARY_PATH

As an alternative, use the following scripts in the Inference Engine directory of the Intel® CV SDK and Deep Learning Deployment Toolkit installation directoriess, respectively:

opt/intel/computer_vision_sdk_<version>/bin/setupvars.sh
/opt/intel/deep_learning_sdk_<version>/deployment_tools/inference_engine/bin/setvars.sh

To run compiled applications on Microsoft* Windows* OS, make sure that Microsoft* Visual C++ 2015 Redistributable and Intel® C++ Compiler 2017 Redistributable packages are installed and <INSTALL_DIR>/bin/intel64/Release/*.dll files are placed to the application directory or accessible via PATH% environment variable.


Integration With the Legacy API

NOTE: The subject of this section is Legacy APIs. Legacy APIs are deprecated and will be removed in a future release. It is best to use the current APIs.

This section provides a high-level description of the process of integrating the Inference Engine into your application. See Using Inference Engine Samples for examples of using the Inference Engine in applications.

Using the Inference Engine API in Your Code

The core libinference_engine.so library implements loading and parsing a model Intermediate Representation, and triggers inference using a specified plugin. The core library has the following API:

  • InferenceEngine::IInferencePlugin - The main plugin interface. Every Inference Engine plugin implements this method. Use it through an InferenceEngine::InferenceEnginePluginPtr instance.
  • InferenceEngine::PluginDispatcher - This class finds the suitable plugin for a specified device in given directories.
  • InferenceEngine::CNNNetReader
  • InferenceEngine::CNNNetwork
  • InferenceEngine::Blob, InferenceEngine::TBlob
  • InferenceEngine::BlobMap
  • InferenceEngine::InputInfo, InferenceEngine::InputsDataMap

The Integration Process

  1. Load a plugin by creating an instance of InferenceEngine::InferenceEnginePluginPtr.
  2. Specify the plugin or let the Inference Engine choose it with InferenceEngine::PluginDispatcher. See the selectPlugin() function in the samples.
    InferenceEngine::PluginDispatcher dispatcher(pluginDirs);
    InferenceEngine::InferenceEnginePluginPtr enginePtr (dispatcher.getSuitablePlugin(TargetDevice::eCPU);
  3. Create an Intermediate Representation reader by creating an instance of InferenceEngine::CNNNetReader and read a model Intermediate Representation:
    auto netBuilder = new InferenceEngine::CNNNetReader();
    netBuilder->ReadNetwork("Model.xml");
    netBuilder->ReadWeights("Model.bin");
  4. Request information about inputs (an image and any other input data required), using the InferenceEngine::CNNNetReader::getNetwork() and InferenceEngine::CNNNetwork::getInputsInfo() methods.
  5. Set the input number format (precision) using InferenceEngine::InputInfo::setInputPrecision to match the input data format (precision). Allocate input blobs of the appropriate types and feed an image and the input data to the blobs:
    /** Taking information about all topology inputs **/
    InferenceEngine::InputsDataMap inputInfo(netBuilder.getNetwork().getInputsInfo());
    /** Stores all input blobs data **/
    InferenceEngine::BlobMap inputBlobs;
    /** Iterating over all input blobs **/
    for (auto & item : inputInfo) {
        /** Creating input blob **/
        item.second->setInputPrecision(Precision::U8);
        InferenceEngine::TBlob[unsigned char]::Ptr input;
        input = InferenceEngine::make_shared_blob[unsigned char, InferenceEngine::SizeVector](Precision::U8, item.second->getDims());
        input->allocate();
        inputBlobs[item.first] = input;
        /** Fill input tensor with planes. First b channel, then g and r channels **/
        ...
    }
  6. Request information about outputs, using the InferenceEngine::CNNNetReader::getNetwork() and InferenceEngine::CNNNetwork::getOutputsInfo() methods. Allocate output blobs of the appropriate types:
    InferenceEngine::OutputsDataMap outputInfo(netBuilder.getNetwork().getOutputsInfo());
    InferenceEngine::BlobMap outputBlobs;
    for (auto & item : outputInfo) {
        InferenceEngine::TBlob[float]::Ptr output;
        output = InferenceEngine::make_shared_blob[float, InferenceEngine::SizeVector](Precision::FP32, item.second->dims);
        output->allocate();
        outputBlobs[item.first] = output;
    }
  7. Load the model to the plugin using InferenceEngine::IInferencePlugin::LoadNetwork():
    InferenceEngine::StatusCode status = enginePtr->LoadNetwork(netBuilder.getNetwork(), &resp);
    if (status != InferenceEngine::OK) {
        throw std::logic_error(resp.msg);
    }
  8. Do inference by calling the InferenceEngine::IInferencePlugin::Infer method:
    enginePtr->Infer(inputBlobs, outputBlobs, &resp);
    
  9. Go over the output blobs and process the results.
    /** Pointer to the output blob **/
    const TBlob[float]::Ptr fOutput = std::dynamic_pointer_cast[TBlob[float]](outputBlobs.begin()->second);
    /** fOutput->data()[] - accessing output blob data **/

Building Your Application

For details about building your application, see the CMake files for the sample applications. All samples reside in the samples directory in the Inference Engine installation directory.

Running the Application

Before running compiled binary files:

Make sure your application can find the Inference Engine libraries. On Linux* operating systems, the LD_LIBRARY_PATH environment variable specifies the library directories.

Update LD_LIBRARY_PATH with directory paths under the Inference Engine installation directory in which the libraries reside.

Add a path the directory containing the core and plugin libraries:

  • For Inference Engine installed within the Intel® CV SDK package:
    $ export LD_LIBRARY_PATH=/opt/intel/computer_vision_sdk_<version>/inference_engine/lib/<linux_version>/intel64:$LD_LIBRARY_PATH
    </linux_version>
  • For Intel's Deep Learning Deployment Toolkit installation:
    $ export LD_LIBRARY_PATH=/opt/intel/deep_learning_sdk_<version>/deployment_tools/inference_engine/lib/<linux_version>/intel64:$LD_LIBRARY_PATH
    /<linux_version>

Add paths the directories containing the required third-party libraries:

  • For Inference Engine installed within the Intel® CV SDK package:
    $ export LD_LIBRARY_PATH=/opt/intel/computer_vision_sdk_<version>/inference_engine/external/mklml_lnx/lib:$LD_LIBRARY_PATH
    $ export LD_LIBRARY_PATH=/opt/intel/computer_vision_sdk_<version>/inference_engine/external/cldnn/lib:$LD_LIBRARY_PATH
    
  • For Intel Deep Learning Deployment Toolkit installation:
    $ export LD_LIBRARY_PATH=/opt/intel/deep_learning_sdk_<version>/deployment_tools/external/mklml_lnx/lib:$LD_LIBRARY_PATH
    $ export LD_LIBRARY_PATH=/opt/intel/deep_learning_sdk_<version>/deployment_tools/external/cldnn/lib:$LD_LIBRARY_PATH

As an alternative, use scripts under the Inference Engine directory for the Intel® CV SDK and Deep Learning Deployment Toolkit installations respectively:

/opt/intel/computer_vision_sdk_<version>/bin/setupvars.sh
/opt/intel/deep_learning_sdk_<version>/deployment_tools/inference_engine/bin/setvars.sh

To run compiled applications on Microsoft* Windows* OS, make sure that Microsoft* Visual C++ 2015 Redistributable and Intel® C++ Compiler 2017 Redistributable packages are installed and INSTALL_DIR/bin/intel64/Release/*.dll files are in the application directory or accessible through the PATH% environment variable.

Adding Your Own Kernels in the Inference Engine

A Layer is a CNN building block is implemented in the training framework, such as "Convolution" in Caffe*. Kernel is defined as the corresponding implementation in Inference Engine.

Plug your kernel implementations into the Inference Engine and map them to the layers in the original framework. See the Model Optimizer guide for information about how a mapping between framework's layers and Inference Engine kernels is registered.

The rest of the section covers custom kernels and how to integrate them into the Inference Engine.

Example of Custom Kernels Support in the Samples

Every sample uses the Inference Engine API to load custom kernels depending on the device type. Specifically, for the CPU this is a shared library that exports certain interface that registers the kernels. For Intel® Integrated Graphics, it is an xml file that lists the kernels along with params that the kernels accept, and how these map to the specific Intermediate Representation values.

Example Custom Kernels

The "extension" directory in the "samples" dir comes with few real example of CPU-targeted kernels, like DetectionOutput (used in SSD*), etc.

Bunch the Intel® Integrated Graphics-targeted kernels to the binaries upon compiling the samples so the samples' applications can easily load them. See the cldnn_global_custom_kernels directory in the GPU plugin installation directory.

How to Implement Custom Intel® Integrated Graphics Layers

You must provide the kernel code in the OpenCL C, and the configuration file that connects the kernel and its parameters to the params of the layer.

You have two options for using the custom layer configuration file.

  • Include a section with your kernels into the global auto-loading file cldnn_global_custom_kernels/cldnn_global_custom_kernels.xml
  • Second one is to provide a separate configuration file and load it using IInferencePlugin::SetConfig() method with the PluginConfigParams::KEY_CONFIG_FILE key and the configuration file name as the value, before loading the network that features the custom layers:
    // Load the Intel® Integrated Graphics plugin InferenceEngine::InferenceEnginePluginPtr plugin_ptr(selectPlugin({…, “GPU”)); InferencePlugin plugin(plugin_ptr); 
    // Load the Intel® Integrated Graphics Extensions plugin.SetConfig({{PluginConfigParams::KEY_CONFIG_FILE, ”<path to the xml file>”}});

For details about the configuration parameters and OpenCL kernel see the tutorial at https://software.intel.com/en-us/cvsdk-custom-layers-support-in-inference-engine-tutorial-custom-layers-workflow

How to Implement Custom CPU Layers

The instructions below are a brief summary of the Custom Layers tutorial available at https://software.intel.com/en-us/cvsdk-custom-layers-support-in-inference-engine-tutorial-custom-layers-workflow

For more details, see the sample source.

  1. Create a custom layer factory CustomLayerFactory class.
    // custom_layer.h
    // A CustomLayerFactory class is an example layer which make exponentiation by 2 for the input and doesn't change dimensions
    class CustomLayerFactory {
    };
  2. Inherit it from the abstract class InferenceEngine::ILayerImplFactory:
    // custom_layer.h
    class CustomLayerFactory: public InferenceEngine::ILayerImplFactory {
    };
  3. Create constructor and virtual destructor, and a data member to keep the layer info
    // custom_layer.h
    class CustomLayerFactory: public InferenceEngine::ILayerImplFactory {
    public:
        explicit CustomLayerFactory(const CNNLayer *layer): cnnLayer(*layer) {}
    private:
        CNNLayer cnnLayer;
    };
  4. Overload and implement the abstract methods (getShapes, getImplementations) of the InferenceEngine::ILayerImplFactory class
    // custom_layer.h
    class CustomLayerFactory: public InferenceEngine::ILayerImplFactory {
    public:
        // ... constructor and destructor
        StatusCode getShapes(const std::vector& inShapes, std::vector& outShapes, ResponseDesc *resp) noexcept override {
            if (cnnLayer == nullptr) {
                std::string errorMsg = "Cannot get cnn layer!";
                errorMsg.copy(resp->msg, sizeof(resp->msg) - 1);
                return GENERAL_ERROR;
            }
            if (inShapes.size() != 1) {
                std::string errorMsg = "Incorrect input shapes!";
                errorMsg.copy(resp->msg, sizeof(resp->msg) - 1);
                return GENERAL_ERROR;
            }
            outShapes.clear();
            outShapes.emplace_back(inShapes[0]);
            return OK;
        }
        StatusCode getImplementations(std::vector& impls, ResponseDesc *resp) noexcept override {
            // You can put cnnLayer to implimentation if it is necessary.
            impls.push_back(ILayerImpl::Ptr(new CustomLayerImpl()));
            return OK;
        }
    };
  5. Create your custom layer implementation CustomLayerImpl class:
    // custom_layer.h
    // A CustomLayerImpl class is an example implementation
    class CustomLayerImpl {
    };
  6. Because the layer uses the execute method to change data, inherit it from the abstract class InferenceEngine::ILayerExecImpl, and overload and implement the abstract methods of this class.
    // custom_layer.h
    // A CustomLayerImpl class is an example implementation
    class CustomLayerImpl: public ILayerExecImpl {
    public:
        explicit CustomLayerImpl(const CNNLayer *layer): cnnLayer(*layer) {}
        StatusCode getSupportedConfigurations(std::vector& conf, ResponseDesc *resp) noexcept override;
        StatusCode init(LayerConfig& config, ResponseDesc *resp) noexcept override;
        StatusCode execute(std::vector& inputs, std::vector& outputs, ResponseDesc *resp) noexcept override;
    private:
        CNNLayer cnnLayer;
    };
  7. Implement the getSupportedConfigurations to return all supported configurations for this implementation. To specify formats of data use InferenceEngine::TensorDesc:
    // custom_layer.cpp
    virtual StatusCode CustomLayerImpl::getSupportedConfigurations(std::vector& conf, ResponseDesc *resp) noexcept {
        try {
            // This layer can be in-place but not constant!!!
            if (cnnLayer == nullptr)
                THROW_IE_EXCEPTION << "Cannot get cnn layer";
            if (cnnLayer->insData.size() != 1 || cnnLayer->outData.empty())
                THROW_IE_EXCEPTION << "Incorrect number of input/output edges!";
            LayerConfig config;
            DataPtr dataPtr = cnnLayer->insData[0].lock();
            if (!dataPtr)
                THROW_IE_EXCEPTION << "Cannot get input data!";
            DataConfig dataConfig;
            dataConfig.inPlace = -1;
            dataConfig.constant = false;
            SizeVector order;
            for (size_t i = 0; i < dataPtr->getTensorDesc().getDims().size(); i++) {
                order.push_back(i);
            }
            // Planar formats for N dims
            dataConfig.desc = TensorDesc(dataPtr->getTensorDesc().getPrecision(),
                                         dataPtr->getTensorDesc().getDims(),
                                         {dataPtr->getTensorDesc().getDims(), order});
            config.inConfs.push_back(dataConfig);
            DataConfig outConfig;
            outConfig.constant = false;
            outConfig.inPlace = 0;
            order.clear();
            for (size_t i = 0; i < cnnLayer->outData[0]->getTensorDesc().getDims().size(); i++) {
                order.push_back(i);
            }
            outConfig.desc = TensorDesc(cnnLayer->outData[0]->getTensorDesc().getPrecision(),
                                        cnnLayer->outData[0]->getDims(),
                                        {cnnLayer->outData[0]->getDims(), order});
            config.outConfs.push_back(outConfig);
            config.dynBatchSupport = 0;
            conf.push_back(config);
            return OK;
        } catch (InferenceEngine::details::InferenceEngineException& ex) {
            std::string errorMsg = ex.what();
            errorMsg.copy(resp->msg, sizeof(resp->msg) - 1);
            return GENERAL_ERROR;
        }
    }
  8. Implement init and execute methods. init is necessary to get selected configuration and check parameters.
    // custom_layer.cpp
    virtual StatusCode CustomLayerImpl::init(LayerConfig& config, ResponseDesc *resp) noexcept {
        StatusCode rc = OK;
        if (config.dynBatchSupport) {
            config.dynBatchSupport = 0;
            rc = NOT_IMPLEMENTED;
        }
        for (auto& input : config.inConfs) {
            if (input.inPlace >= 0) {
                input.inPlace = -1;
                rc = NOT_IMPLEMENTED;
            }
            for (auto& offset : input.desc.getBlockingDesc().getOffsetPaddingToData()) {
                if (offset) {
                    return GENERAL_ERROR;
                }
            }
            if (input.desc.getBlockingDesc().getOffsetPadding()) {
                return GENERAL_ERROR;
            }
            for (size_t i = 0; i < input.desc.getBlockingDesc().getOrder().size(); i++) {
                if (input.desc.getBlockingDesc().getOrder()[i] != i) {
                    if (i != 4 || input.desc.getBlockingDesc().getOrder()[i] != 1)
                        return GENERAL_ERROR;
                }
            }
        }
        for (auto& output : config.outConfs) {
            if (output.inPlace < 0) {
                // NOT in-place
            }
            for (auto& offset : output.desc.getBlockingDesc().getOffsetPaddingToData()) {
                if (offset) {
                    return GENERAL_ERROR;
                }
            }
            if (output.desc.getBlockingDesc().getOffsetPadding()) {
                return GENERAL_ERROR;
            }
            for (size_t i = 0; i < output.desc.getBlockingDesc().getOrder().size(); i++) {
                if (output.desc.getBlockingDesc().getOrder()[i] != i) {
                    if (i != 4 || output.desc.getBlockingDesc().getOrder()[i] != 1)
                        return GENERAL_ERROR;
                }
            }
        }
        return rc;
    }
    virtual StatusCode CustomLayerImpl::execute(std::vector& inputs, std::vector& outputs, ResponseDesc *resp) noexcept {
        if (inputs.size() != 1 || outputs.empty()) {
            std::string errorMsg = "Incorrect number of input or output edges!";
            errorMsg.copy(resp->msg, sizeof(resp->msg) - 1);
            return GENERAL_ERROR;
        }
        const float* src_data = inputs[0]->buffer();
        float* dst_data = outputs[0]->buffer();
        for (size_t o = 0; o < outputs->size(); o++) {
            if (dst_data == src_data) {
                dst_data[o] *= dst_data[o];
            } else {
                dst_data[o] = src_data[o]*src_data[o];
            }
        }
    }
  9. Create a factory for your own primitives, inherited from the abstract class InferenceEngine::IExtension
    // custom_extension.h
    class CustomExtention : public InferenceEngine::IExtension {
    }; 
    Implement the utility methods Unload, Release, SetLogCallback:
    // custom_extension.h
    class CustomExtention : public InferenceEngine::IExtension {
    public:
        // could be used to cleanup resources
        void Unload() noexcept override {
        }
        // is used when destruction happens
        void Release() noexcept override {
            delete this;
        }
        // logging is used to track what is going on inside
        void SetLogCallback(InferenceEngine::IErrorListener &listener) noexcept override {}
    };
  10. Implement the utility method GetVersion:
    // custom_extension.h
    class CustomExtention : public InferenceEngine::IExtension {
    private:
        static InferenceEngine::Version ExtensionDescription = {
            {1, 0},             // extension API version
            "1.0",              
            "CustomExtention"   // extension description message
        };
    public:
        // gets extension version information
        void GetVersion(const InferenceEngine::Version *& versionInfo) const noexcept override {
            versionInfo = &ExtensionDescription;
        }
    }; 
    Implement main extension methods:
    // custom_extension.h
    class CustomExtention : public InferenceEngine::IExtension {
    public:
        // ... utility methods
        StatusCode getPrimitiveTypes(char**& types, unsigned int& size, ResponseDesc* resp) noexcept override {
            std::string type_name = "CustomLayer";
            types = new char *[1];
            size = 1;
            types[0] = new char[type_name.size() + 1];
            std::copy(type_name.begin(), type_name.end(), types[0]);
            types[0][type_name.size()] = '\0';
            return OK;
        }
        StatusCode getFactoryFor(ILayerImplFactory *&factory, const CNNLayer *cnnLayer, ResponseDesc *resp) noexcept override {
            if (cnnLayer->type != "CustomLayer") {
                std::string errorMsg = std::string("Factory for ") + cnnLayer->type + " wasn't found!";
                errorMsg.copy(resp->msg, sizeof(resp->msg) - 1);
                return NOT_FOUND;
            }
            factory = new CustomLayerFactory(cnnLayer);
            return OK;
        }
    };
  11. To use your custom layers, compile the code as the shared library, and then use the AddExtension method of the general plugin interface to load your primitives:
    auto extension_ptr = make_so_pointer<inferenceengine::iextension>(“<shared lib path>”);
    // Add extension to the plugin’s list
    plugin.AddExtension(extension_ptr);</inferenceengine::iextension>

Using the Validation Application to Check Accuracy on a Dataset

The Inference Engine Validation application lets you score common topologies with standard inputs and outputs configuration. These topologies include AlexNet and SSD. The Validation application allows the user to collect simple validation metrics for the topologies. It supports Top-1/Top-5 counting for classification networks and 11-points mAP calculation for object detection networks.

Possible Validation application uses:

  • Check if Inference Engine scores the public topologies well
  • Verify if the user's custom topology compatible with the default input/output configuration and compare its accuracy with the public ones
  • Using Validation application as another sample: although the code is much more complex than in classification and object detection samples, it's still open and could be re-used

The application loads a network to the Inference Engine plugin. Then:

  1. The application reads the validation set (the -i option):
    • If -i specifies a directory. The application tries to load labels first. To do so, the application searches for a file with the same base name as the model, but with a .labels extension. The application then searches the specified directory and adds all images from sub-directories whose names are equal to a known label to the validation set. If there are no sub-directories whose names are equal to known labels, the validation set is considered empty.
    • If -i specifies a .txt file. The application reads the .txt file, considering every line that has the format: <relative_path_from_txt_to_img] <ID] where ID is the image number that the network should classify.
  2. The application reads the number of images specified by -b and loads the images to the plugin. When all images are loaded, the plugin does inference and the Validation application collects the statistics.

NOTE: Image load time is not part of of the inference time reported by the application.

As an option, use the -dump option to retrieve the inference results. This option creates an inference report with the name in as dumpfileXXXX.csv. in this format, using semicolon separated values:

  • Image_path
  • Flag representing correctness of prediction
  • ID of the Top-1 class
  • Probability that the image belongs to the Top-1 class
  • ID of the Top-2 class
  • Probability that the image belongs to the Top-x class, where x is an integer

CLI Options

Usage: validation_app [OPTION]
Available options:
    -h                        Print a usage message
    -t                  Type of the network being scored ("C" by default)
      -t "C" for classification
      -t "OD" for object detection
    -i [path]                 Required. Directory with validation images, directorys grouped by labels or a .txt file list for classification networks or a VOC-formatted dataset for object detection networks
    -m [path]                 Required. Path to an .xml file with a trained model
    -l [absolute_path]        Required for Intel® MKL-DNN (CPU)-targeted custom layers.Absolute path to a shared library with the kernel implementations
    -c [absolute_path]        Required for Intel® Integrated Graphics-targeted custom kernels.Absolute path to the xml file with the kernel descriptions
    -d [device]               Specify the target device to infer on; CPU, Intel® Integrated Graphics, FPGA or MYRIAD is acceptable. The sample looks for a suitable plugin for the specified device. The plugin is CPU by default.
    -b N                      Batch size value. If not specified, the batch size value is determined from IR
    -ppType             Preprocessing type. One of "None", "Resize", "ResizeCrop"
    -ppSize N                 Preprocessing size (used with ppType="ResizeCrop")
    -ppWidth W                Preprocessing width (overrides -ppSize, used with ppType="ResizeCrop")
    -ppHeight H               Preprocessing height (overrides -ppSize, used with ppType="ResizeCrop")
    --dump                    Dump filenames and inference results to a csv file

    Classification-specific options:
      -Czb true               "Zero is a background" flag. Some networks are trained with a modified dataset where the class IDs are enumerated from 1, but 0 is an undefined "background" class (which is never detected)

    Object detection-specific options:
      -ODkind           Kind of an object detection network: SSD
      -ODa [path]             Required for OD networks. Path to the directory containing .xml annotations for images
      -ODc              Required for OD networks. Path to the file containing classes list
      -ODsubdir         Directory between the image path (-i) and image name, specified in the .xml. Use JPEGImages for VOC2007

Option Categories

  • Common options are usually named with a single letter or word, such as -b or –dump. These options have a common sense in all validation_app modes.
  • Network type-specific options are named as an acronym of the network type (such as C or OD, followed by a letter or a word addendum. These options are specific for the network type. For instance, ODa makes sense only for an object detection network.

The next section shows how to use the Validation application in classification mode to score a classification CNN on a pack of images.

Running the Application in Classification Mode

This section demonstrates how to run the Validation application in classification mode to score a classification CNN on a pack of images.

To do inference of a chosen pack of images:

$ ./validation_app -t C -i <path to images main directory or .txt file] -m <model to use for classification] -d <CPU|Intel® Integrated Graphics]

Source dataset format: directories as classes

A correct list of files looks similar to:

<path]/dataset
    /apron
        /apron1.bmp
        /apron2.bmp
    /collie
        /a_big_dog.jpg
    /coral reef
        /reef.bmp
    /Siamese
        /cat3.jpg

To score this dataset put the -i <path]/dataset option in the command line.

Source dataset format: a list of images

This example uses a single list file in the format image_name-tabulation-class_index. The correct list of files:

<path]/dataset
    /apron1.bmp
    /apron2.bmp
    /a_big_dog.jpg
    /reef.bmp
    /cat3.jpg
    /labels.txt

where labels.txt:

apron1.bmp 411
apron2.bmp 411
cat3.jpg 284
reef.bmp 973
a_big_dog.jpg 231

To score this dataset put the -i <path>/dataset/labels.txt option in the command line.

Output Description

A progress bar shows the inference progress. Upon completion, the common information is displayed.

Network load time: time spent on topology load in ms
Model: path to chosen model
Model Precision: precision of a chosen model
Batch size: specified batch size
Validation dataset: path to a validation set
Validation approach: Classification networks
Device: device type

You see statistics such as the average inference time, and top-1 and top-5 accuracy:

Average infer time (ms): 588.977 (16.98 images per second with batch size = 10)

Top1 accuracy: 70.00% (7 of 10 images were detected correctly, top class is correct)
Top5 accuracy: 80.00% (8 of 10 images were detected correctly, top five classes contain required class)

Using Object Detection with the Validation Application

Description

Running the Validation application in object detection mode to score an object detection on the SSD CNN pack of images.

Running SSD on the VOC Dataset

Use these steps to score SSD on the original dataset that was used to test it during its training.

./validation_app -d CPU -t OD -ODa "<...>/VOCdevkit/VOC2007/Annotations" -i "<...>/VOCdevkit" -m "<...>/vgg_voc0712_ssd_300x300.xml" -ODc "<...>/VOC_SSD_Classes.txt" -ODsubdir JPEGImages
  1. Go to the SSD author's github page to select the pre-trained SSD-300.
  2. From the same page, download the VOC2007 test dataset:
    $wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar
    tar -xvf VOCtest_06-Nov-2007.tar
  3. Use the Model Optimizer to convert the model. For help, see https://software.intel.com/en-us/articles/CVSDK-ModelOptimizer
  4. Create a proper class file (made from the original labelmap_voc.prototxt) none_of_the_above 0 aeroplane 1 bicycle 2 bird 3 boat 4 bottle 5 bus 6 car 7 cat 8 chair 9 cow 10 diningtable 11 dog 12 horse 13 motorbike 14 person 15 pottedplant 16 sheep 17 sofa 18 train 19 tvmonitor 20
  5. Save it as VOC_SSD_Classes.txt
  6. Score the model on the dataset:

  7. You see a progress bar followed by your data:
    Progress: [....................] 100.00% done    
    [ INFO ] Processing output blobs
    Network load time: 27.70ms
    Model: /home/user/models/ssd/withmean/vgg_voc0712_ssd_300x300/vgg_voc0712_ssd_300x300.xml
    Model Precision: FP32
    Batch size: 1
    Validation dataset: /home/user/Data/SSD-data/testonly/VOCdevkit
    Validation approach: Object detection network
    
    Average infer time (ms): 166.49 (6.01 images per second with batch size = 1)
    Average precision per class table: 
    
    Class   AP
    1   0.796
    2   0.839
    3   0.759
    4   0.695
    5   0.508
    6   0.867
    7   0.861
    8   0.886arXiv
    9   0.602
    10  0.822
    11  0.768
    12  0.861
    13  0.874
    14  0.842
    15  0.797
    16  0.526
    17  0.792
    18  0.795
    19  0.873
    20  0.773
    Mean Average Precision (mAP): 0.7767

The Mean Value Precision is in a table on the SSD author's page and in the arXiv paper.

Advanced Topics

Key terms in this section

Acronym/TermDescription
C, CHW, NC

Tensor memory layout. For example, the CHW value at index (c,h,w) is physically located at index (c * H + h) * W = w, for others by analogy.

DLDeep Learning
FP16 formatHalf-precision floating-point format
FP32 formatSingle-precision floating-point format
I16 format2-byte unsigned integer format
NCHW, NHWC

Image data layout. Refers to the representation of batches of images.

  • N - Number of images in a batch
  • H - Number of pixels in the vertical dimension
  • W - Number of pixels in the horizontal dimension
  • C - Channels
U16 format2-byte signed integer format
U8 format1-byte unsigned integer format

 

Supported Model Formats

DeviceFP32FP16
CPUSupported and PreferredNot Supported
Intel® Integrated GraphicsSupportedSupported and Preferred
FPGASupportedSupported
Intel® Movidius™ Myriad™ 2 Vision Processing UnitNot SupportedSupported

Understanding Inference Engine Memory Primitives

Blobs

InferenceEngine::Blob is the main class intended for working with memory. This class lets you read and write memory and get information about the memory structure, among other tasks.

To create Blob objects with a specific layout, use constructors with InferenceEngine::TensorDesc.

InferenceEngige::TensorDesc tdesc(FP32, {1, 3, 227, 227}, InferenceEngine::Layout::NCHW);
InferenceEngine::Blob::Ptr blob = InferenceEngine::make_shared_blob(tdesc);

Layouts

InferenceEngine::TensorDesc is a special class that provides layout format description.

This class allows to create planar layouts using the standard formats, like InferenceEngine::Layout::NCHW, InferenceEngine::Layout::NC, InferenceEngine::Layout::C, and non-planar layouts using InferenceEngine::BlockingDesc.

To create a complex layout, use InferenceEngine::BlockingDesc, which allows to define the blocked memory with offsets and strides.

Examples

  • Define a blob with dimensions, {N: 1, C: 25, H: 20, W: 20}, and format, NHWC:
    InferenceEngine::BlockingDesc({1, 20, 20, 25}, {0, 2, 3, 1}); // or
    InferenceEngine::BlockingDesc({1, 20, 20, 25}, InferenceEngine::Layout::NHWC);
  • If you have memory with real dimensions {N: 1, C: 25, H: 20, W: 20}, but with channels that are blocked by 8, define the memory with parameters:
    InferenceEngine::BlockingDesc({1, 4, 20, 20, 8}, {0, 1, 2, 3, 1})
  • Set strides and offsets if the layout contains them. If your blob layout is complex and you don't want to calculate the real offset to data, use InferenceEngine::TensorDesc::offset(size_t l) or InferenceEngine::TensorDesc::offset(SizeVector v).
    For example:
    InferenceEngine::BlockingDesc blk({1, 4, 20, 20, 8}, {0, 1, 2, 3, 1});
    InferenceEngine::TensorDesc tdesc(FP32, {1, 25, 20, 20}, blk);
    tdesc.offset(0); // = 0
    tdesc.offset(1); // = 8
    tdesc.offset({0, 0, 0, 2}); // = 16
    tdesc.offset({0, 1, 0, 2}); // = 17
  • If you want to create a TensorDesc with a planar format and for N dimensions (N can be different 1, 2, 4 and etc), use:
    InferenceEngine::TensorDesc::getLayoutByDims.
    InferenceEngine::TensorDesc::getLayoutByDims({1}); // InferenceEngine::Layout::C
    InferenceEngine::TensorDesc::getLayoutByDims({1, 2}); // InferenceEngine::Layout::NC
    InferenceEngine::TensorDesc::getLayoutByDims({1, 2, 3, 4}); // InferenceEngine::Layout::NCHW
    InferenceEngine::TensorDesc::getLayoutByDims({1, 2, 3}); // InferenceEngine::Layout::BLOCKED
    InferenceEngine::TensorDesc::getLayoutByDims({1, 2, 3, 4, ...}); // InferenceEngine::Layout::BLOCKED
    Documentation for Intel(R) Deep Learning Deployment Toolkit Developer Guide

Supported Devices

The Inference Engine can infer models in different formats with various input and output formats. This section provides supported and optimal configurations per device.

The Inference Engine provides unique capabilities to infer deep learning models on these device types:

  • CPU
  • Intel® Integrated Graphics
  • FPGA
  • Myriad
  • Heterogeneous execution

 

Supported Input Precision

DeviceFP32FP16U8U16I16
CPUSupportedNot SupportedSupportedSupportedSupported
Intel® Integrated GraphicsSupportedSupported* - See NOTE belowSupported*Supported*Supported*
FPGASupportedSupported*SupportedSupportedSupported
Intel® Movidius™ Myriad™ 2 Vision Processing UnitSupportedSupportedSupported and PreferredNot SupportedNot Supported

* NOTE: Supported through SetBLob only. GetBlob returns FP32. Supported without mean image.

 

Supported Output Precision

PluginFP32FP16
CPUSupportedNot Supported
Intel® Integrated GraphicsSupportedSupported
FPGASupportedSupported
Intel® Movidius™ Myriad™ 2 Vision Processing UnitSupportedSupported and Preferred

 

Supported Input Layout

PluginFP32FP16
CPUSupportedNot Supported
Intel® Integrated GraphicsSupportedNot Supported
FPGASupportedNot Supported
Intel® Movidius™ Myriad™ 2 Vision Processing UnitSupportedSupported and Preferred
 

 

Supported Output Layout

Number of Dimension4321
LayoutNCHWCHWNCC

Intel CPU Plugin

The Intel CPU plugin provides an opportunity for high-performance scoring of neural networks on the CPU, using the Intel® MKL-DNN library.

The Intel CPU plugin uses OpenMP* to parallelize calculations.

Supported Layers

  • BatchNorm
  • Clamp
  • Concat
  • Convolution
  • Crop
  • Deconvolution
  • Eltwise
  • ELU
  • FullyConnected
  • Logistic
  • LRN
  • Permute
  • Pooling
  • Power
  • ReLU
  • Reshape
  • ROIPooling
  • ScaleShift
  • Softmax
  • Split
  • TanH
  • Tile

The set of supported layers can be expanded with the extensibility library. To add a new layer in this library, use the extensibility mechanism.

Supported Platforms

The Intel® Computer Vision SDK is supported and validated on these platforms:

Host64-bit OS
Development
  • Ubuntu* 16.04
  • CentOS 7.4/MS
  • Windows* 10
Target
  • Ubuntu* 16.04
  • CentOS 7.4/MS
  • Windows* 10

The CPU plugin supports inference on Intel® Xeon® with Intel® AVX2 and AVX512, Intel® Core™ Processors with Intel® AVX2, Intel Atom® Processors with Intel® SSE.

Use the -pc flag for samples to learn which configuration is used by a layer. -pc shows execution statistics to use for information about the layer name, execution status, layer type, execution time, and the type of the execution primitive.

Internal Intel CPU Plugin Optimizations

The Intel CPU Plugin supports several graph optimization algorithms:

  • Merging of group convolutions - If topology contains the next pipeline. The Intel® MK-DNN Plugin merges it into the one Convolution with the group parameter (Convolutions should have the same parameters). 

Fusing Convolution + Sum or Convolution + Sum + Relu

  • Fusing Convolution with ReLU or ELU. Intel CPU plugin is fusing all Convolution with ReLU or ELU layers if these layers are located after the Convolution layer.
  • Removing the power layer. Intel CPU plugin removes Power layer from topology if it has next parameters: power = 1, scale = 1, offset = 0.
  • Fusing Convolution + Sum or Convolution + Sum + Relu. To improve performance, the Intel CPU plugin fuses the next structure: 
  • This fuse allows you to upgrade the graph to the following structure: 


Upgrade the graph optimization algorithm graph

Supported Configuration Parameters

The plugin supports the configuration parameters listed below. All parameters must be set before calling InferenceEngine::IInferencePlugin::LoadNetwork().

Parameter NameParameter ValuesDefaultDescription
KEY_CPU_BIND_THREADYES/NOYESThis parameter allows to bind OpenMP threads. It means that the number of OpenMP threads are equal to the number of HW cores if the value is YES.
KEY_DYN_BATCH_LIMITnumberNetwork batch sizeThis key allows to set the batch size to all following Infer calls. If the input blob has sizes 32x3x224x224 after applying plugin.SetConfig({KEY_DYN_BATCH_LIMIT, 10}) Inference Engine primitives process only beginner sub blobs with size 10x3x224x224. This value can be changed before any Infer call to specify a new limit
EXCLUSIVE_ASYNC_REQUESTSYES/NONOThis key enables exclusive mode for async requests of different executable networks and the same plugin.
PERF_COUNT/YES/NONOThis key enables performance counters option
CPU Extensions

The CPU extensions library contains code of important layers that do not come with the CPU plugin. You should compile this library and use the AddExtension method in your application to load the extensions when for models featuring layers from this library. See other samples for AddExtension code examples.

When you compile the entire list of the samples, the cpu_extension library is also compiled.

For performance, the library's cmake script detects your computer configuration and enables platform optimizations. Alternatively, you can explicitly use cmake flags: -DENABLE_AVX2=ON, -DENABLE_AVX512F=ON or -DENABLE_SSE42=ON when cross-compiling this library for another platform.

List of layers that come in the library:

  • ArgMax
  • CTCGreedyDecoder
  • DetectionOutput
  • GRN
  • Interp
  • MVN
  • Normalize
  • PowerFile
  • PReLU
  • PriorBox
  • PriorBoxClustered
  • Proposal
  • PSROIPooling
  • Resample
  • SimplerNMS
  • SpatialTransformer

Use the extensibility mechanism to add a layer. For information, see Adding Your Own Kernels in the Inference Engine.


Intel® Integrated Graphics Plugin

The Intel® Integrated Graphics plugin uses the Intel® Compute Library for Deep Neural Networks to infer deep neural networks. This is an open source performance library for Deep Learning applications intended for acceleration of deep learning inference on Intel® Processor Graphics, including HD Graphics and Iris® Graphics.

Supported Layers

  • Activation (ReLU, Sigmoid, Logistic, TanH, ELU, Clamp)
  • BatchNormalization
  • Concatenate
  • Convolution
  • Copy
  • Crop
  • Deconvolution
  • DetectionOutput
  • Eltwise
  • Flatten
  • FullyConnected
  • LRN
  • Normalize
  • Permute
  • Pooling
  • Power
  • PReLU
  • PriorBox
  • Proposal
  • PSROIPooling
  • Reshape
  • ROIPooling
  • ScaleShift
  • SimplerNMS
  • SoftMax
  • Split
  • Upsampling

Supported Optimizations

  • Fused layers:
    • Convolution - Activation
    • Deconvolution - Activation
    • Eltwise - Activation
    • Fully Connected - Activation
  • Layers optimized out when conditions allow:
    • Crop
    • Concatenate
    • Reshape
    • Flatten
    • Split
    • Copy
  • Layers executed during load time (not during inference):
    • PriorBox

CPU Executed Layers

The following layers aren't accelerated on the Intel® Integrated Graphics and instead are executed on the host CPU.

  • Proposal
  • SimplerNMS
  • PriorBox
  • DetectionOutput

Supported Configuration Parameters

The plugin supports the configuration parameters listed below. All parameters must be set before calling InferenceEngine::IInferencePlugin::LoadNetwork().

NameValueDefaultDescription
KEY_PERF_COUNTYES / NONOCollect performance counters during inference
KEY_CONFIG_FILE"file1 [file2 ...]"""Load custom layer configuration files
KEY_DUMP_KERNELSYES / NONODump the final kernels used for custom layers
KEY_TUNING_MODE

TUNING_DISABLED

TUNING_CREATE

TUNING_USE_EXISTING

TUNING_DISABLED

Disable inference kernel tuning

Create tuning file (expect much longer runtime)

Use an existing tuning layer

KEY_TUNING_FILE"filename"""Tuning file to create / use
KEY_PLUGIN_PRIORITY<0-3>0OpenCL queue priority
KEY_PLUGIN_THROTTLE<0-3>0OpenCL queue throttling

Debug Capabilities in the Intel® Integrated Graphics Plugin

The Intel® Integrated Graphics plugin provides the possibility to dump the user custom OpenCL™ kernels to a file to allow you to  debug compilation issues in your custom kernels.

The application can use the SetConfig() function with the key PluginConfigParams::KEY_DUMP_KERNELS and value: PluginConfigParams::YES. Then during network loading, all custom layers print their OpenCL kernels with the JIT instrumentation added by the plugin. The kernels are stored in the working directory under files named in the format: clDNN_program0.cl, clDNN_program1.cl

The Debug option is disabled by default. Additionally, the application can call the SetConfig() function with the key PluginConfigParams::KEY_DUMP_KERNELS and value: PluginConfigParams::NO before network loading.

How to Verify Debug is Disabled

  1. Delete all clDNN_program*.cl files from the current directory
  2. Run your application to load a network
  3. Examine the working directory for the presence of any kernel file, such as clDNN_program0.cl

FPGA Plugin

The FPGA plugin provides an opportunity for high-performance scoring of neural networks on Intel® FPGA devices.

Supported Layers

  • Batch_norm (being converted by Model Optimizer to ScaleShift layer)
  • Concat
  • Convolution (dilated convolutions are supported, depthwise are not supported)
  • Eltwise (operation sum is supported)
  • Fully Connected
  • LRN Normalization
  • Pooling
  • Power (scale and offset parameters are supported)
  • ReLu (with negative slope)
  • ScaleShift

NOTE: Support is limited to the specific parameters, and depends on the bitstream.

Heterogeneous Execution

If a topology contains layers that aren't supported on the FPGA, use the Heterogeneous plugin with a dedicated fallback device.

If a network has layers that aren't supported by either the FPGA plugin or in fallback plugin, implement a custom layer with the CPU or Intel® Integrated Graphics and use the extensibility mechanism described in Inference Engine Kernels Extensibility. In addition of adding custom kernels, point to the CPU or Intel® Integrated Graphics plugin as fallback devices for the Heterogeneous plugin.

Supported Platforms

The Intel® Computer Vision SDK is officially supported and validated on the following FPGA setup:

Host64-bit OSPlatform
Development
  • Ubuntu* 16.04
  • CentOS* 7.4
6th Generation Intel® Core™ Processors
Target
  • Ubuntu* 16.04
  • CentOS* 7.4
Intel® Arria® 10GX/A10PL4 FPGA

How to Interpret Performance Counters

As a result of collecting performance counters using InferenceEngine::IInferencePlugin::GetPerformanceCounts performance data is available for execution on FPGA, pre- and post-processing data, and transferring data from and to the FPGA card.

If your network is divided in two with CPU execution, performance data is available about Intel® MKL-DNN kernels, their types and other useful information.

FPGA Support Limitations for CNN

The FPGA Beta release has limitations for the network topologies, kernel parameters, and batch size.

  • Depending on the bitstream loaded on the target device, FPGA actually performs calculations with precision rates ranging from FP11 to FP16. This may have potential accuracy implications. Use the Validation application to verify the network accuracy on validation data set.
  • Networks having many non supported layers on FPGA stayed in topologies between supported layers might lead to dividing of graph to many subgraphs that might lead to CL_OUT_OF_HOST_MEMORY error. These topologies are non FPGA friendly for this release.
  • During the using of Heterogeneous plugin, the affinity and distribution of nodes by devices depends on bitstream. Some layers might not be supported by bitstream or parameters of the layer are not supported by bitstreams.
  • Any fully-connected layer can only be followed by another fully-connected (possibly with the ReLU) layer. No convolution layer can follow a fully-connected layer, otherwise the graph verification fails and returns an error message.
  • Single output from a fully-connected layer (potentially coupled with ReLU) is supported.
  • Several outputs from Convolution (and other layers except fully-connected) layer are supported, but this output cannot be passed to the other layers on the FPGA.
  • When executing on the FPGA, the first iteration is much slower than the next iterations. You can perform multiple iterations when assessing inference performance.
  • Consider batching for performance conclusions. Depending on the bitstream loaded on the FPGA, the batch size is typically limited to 96.

Bitstream Availability

Various FPGA bitstreams that support CNN are available in Intel® CV SDK package for FPGA.


Intel® Movidius™ Myriad™ 2 Vision Processing Unit Stick Plugin

High performance scoring is available on neural networks that use the Intel® Movidius™ Myriad™ 2 Vision Processing Unit.

Supported Layers

  • BatchNormalization
  • Bias
  • Concatenate
  • Convolution
  • Copy
  • Crop
  • CTCDecoder
  • Deconvolution
  • DepthwiseConvolution
  • DetectionOutput
  • Eltwise (SUM, MAX, MUL)
  • ELU
  • Flatten
  • FullyConnected
  • Leaky ReLU
  • LRN
  • Normalize
  • Permute
  • Pooling (MAX, AVG)
  • Power
  • PReLU
  • PriorBox
  • PriorBoxClustered
  • ReLU
  • Reshape
  • Scale
  • ScaleShift
  • Sigmoid
  • Slice
  • SoftMax
  • Split
  • TanH
  • Tile

Installing USB Rules

To do inference on the Intel® Movidius™ Myriad 2™ Vision Processing Unit install USB rules by running the commands:

cat <<EOF > 97-usbboot.rules
SUBSYSTEM=="usb", ATTRS{idProduct}=="2150", ATTRS{idVendor}=="03e7", GROUP="users", MODE="0666", ENV{ID_MM_DEVICE_IGNORE}="1"
SUBSYSTEM=="usb", ATTRS{idProduct}=="f63b", ATTRS{idVendor}=="03e7", GROUP="users", MODE="0666", ENV{ID_MM_DEVICE_IGNORE}="1"
EOF
sudo cp 97-usbboot.rules /etc/udev/rules.d/
sudo udevadm control --reload-rules
sudo udevadm trigger
sudo ldconfig
rm 97-usbboot.rules

Supported Configuration Parameters

NameValuesDefaultDescription
KEY_VPU_LOG_LEVELLOG_WARNING LOG_INFO LOG_DEBUGLOG_NONESet log level for devices
KEY_VPU_INPUT_NORMreal number1.0Normalization coefficient for the network input
KEY_VPU_INPUT_BIASreal number0.0Bias value that is added to each element of the network input
KEY_VPU_PRINT_RECEIVE_TENSOR_TIMEYES/NONOAdd device-side time spent to receive input to PerformanceCounts

Heterogeneous Plugin

The Heterogeneous plugin enables computing for inference on one network on several devices. Purposes to execute networks in Heterogeneous mode:

  • To utilize accelerators power and calculate heaviest parts of network on accelerator and execute not supported layers on fallback devices like CPU
  • To utilize all available hardware more efficiently during one inference

The execution through the Heterogeneous plugin can be divided into two steps:

  • Setting of affinity to layers (binding them to devices in InferenceEngine::ICNNNetwork)
  • Loading the Network to the Heterogeneous plugin, splitting the network into parts and their execution through dedicated plugin.

These steps are decoupled. The setting of affinity can be done automatically using fallback policy or in manual mode.

The fallback automatic policy means greedy behavior and assigns all layers which can be executed on certain device on that device follow priorities.

Some topologies are not friendly or cannot be executed in heterogeneous execution on some devices. These networks might be have activation layers that are't supported on the primary device. If transmitting data from one part of the network to another in heterogeneous mode is time-consuming, then it does not make sense to execute these data in heterogeneous mode on these devices. Instead, define the heaviest part manually and set affinity to avoid sending data back and forth several times in one inference.

Annotation of Layers per Device and Default Fallback Policy

Default fallback policy decides which layer goes to which device automatically according to the support in dedicated plugins (FPGA, Intel® Integrated Graphics, CPU).

Alternative way to annotate network - to set affinity manually using CNNLayer::affinity field. This field accepts string values of devices like "CPU" or "FPGA".

The fallback policy does not work if even one layer has initialized affinity. The sequence should be calling of automating affinity settings and then fix manually.

// This example demonstrate how to do default affinity initialization and then
// correct affinity manually for some layers
InferenceEngine::PluginDispatcher dispatcher({ FLAGS_pp, archPath , "" });
InferenceEngine::InferenceEnginePluginPtr enginePtr;
enginePtr = dispatcher.getPluginByDevice("HETERO:FPGA,CPU");
HeteroPluginPtr hetero(enginePtr);
hetero->SetAffinity(network, { }, &resp);
network.getLayerByName("qqq")->affinity = "CPU";
InferencePlugin plugin(enginePtr);
auto executable_network = plugin.LoadNetwork(network, {});

If you rely on default affinity distribution, you can avoid calling IHeteroInferencePlugin::SetAffinity by calling ICNNNetwork::LoadNetwork

InferenceEngine::PluginDispatcher dispatcher({ FLAGS_pp, archPath , "" });
InferenceEngine::InferenceEnginePluginPtr enginePtr;
enginePtr = dispatcher.getPluginByDevice("HETERO:FPGA,CPU");
InferencePlugin plugin(enginePtr);
CNNNetReader reader;
reader.ReadNetwork("Model.xml");
reader.ReadWeights("Model.bin");
auto executable_network = plugin.LoadNetwork(network, {});

Splitting the Network and Execution

During loading of the network to Heterogeneous plugin, network is divided to separate parts and loaded to dedicated plugins. Intermediate blobs between these sub graphs are allocated automatically in the most efficient way.

Execution Precision

Precision for inference in Heterogeneous plugin is defined by:

  • Precision of the Intermediate Representation
  • Ability of final plugins to execute in precision defined in the Intermediate Representation

Examples:

  • To execute Intel® Integrated Graphics with a CPU fallback with the FP16 on Intel® Integrated Graphics, use only FP16 for the Intermediate Representation. The Heterogeneous plugin converts the weight from FP16 to FP32 for execution on the CPU.
  • To execute on FPGA with a CPU fallback, use any precision for the Intermediate Representation. The execution on FPGA is defined by bitstream, the execution on CPU happens in FP32.

Use these samples with the command:

 ./object_detection_sample_ssd -m <path_to_model]/ModelSSD.xml -i <path_to_pictures]/picture.jpg -d HETERO:FPGA,CPU

where:

  • HETERO is the Heterogeneous plugin and FPGA
  • CPU is the fallback policy with the priority on FPGA and the fallback to the CPU

To point to more than two devices, use -d HETERO:FPGA,GPU,CPU

Analyzing With the Heterogeneous Execution

After enabling the KEY_HETERO_DUMP_GRAPH_DOT config key, dump the GraphViz dot files with annotations of devices per layer.

The Heterogeneous plugin can generate two files:

  • hetero_affinity.dot - annotation of affinities per layer. This file is written to the disk only if the default fallback policy is executed.
  • hetero_subgraphs.dot - annotation of affinities per graph. This file is written to the disk during the execution of ICNNNetwork::LoadNetwork() for the hetero plugin.
    #include "ie_plugin_config.hpp"
    #include "hetero/hetero_plugin_config.hpp"
    using namespace InferenceEngine::PluginConfigParams;
    using namespace InferenceEngine::HeteroConfigParams;
    ...
    enginePtr = dispatcher.getPluginByDevice("HETERO:FPGA,CPU");
    InferencePlugin plugin(enginePtr);
    plugin.SetConfig({ {KEY_HETERO_DUMP_GRAPH_DOT, YES} });

Use the graphviz utility or converters to create png formats. Ubuntu* utilities:

  • sudo apt-get install xdot
  • xdot hetero_subgraphs.dot

Use option -pc, in sample data to get performance data on each subgraph.

Output example for Googlenet v1 running on FPGA with a fallback to the CPU:

subgraph1: 1. input preprocessing (mean data/FPGA):EXECUTED       layerType:                    realTime: 129        cpu: 129            execType:
subgraph1: 2. input transfer to DDR:EXECUTED       layerType:                    realTime: 201        cpu: 0              execType:
subgraph1: 3. FPGA execute time:EXECUTED       layerType:                    realTime: 3808       cpu: 0              execType:
subgraph1: 4. output transfer from DDR:EXECUTED       layerType:                    realTime: 55         cpu: 0              execType:
subgraph1: 5. FPGA output postprocessing:EXECUTED       layerType:                    realTime: 7          cpu: 7              execType:
subgraph1: 6. softmax/copy:   EXECUTED       layerType:                    realTime: 2          cpu: 2              execType:
subgraph2: out_prob:          NOT_RUN        layerType: Output             realTime: 0          cpu: 0              execType: unknown
subgraph2: prob:              EXECUTED       layerType: SoftMax            realTime: 10         cpu: 10             execType: ref
Total time: 4212     microseconds

Known Issues

Multiple OpenMP Loadings

If the application uses the Inference Engine with third-party components that depend on Intel® OpenMP, multiple loadings of the libiomp library may occur and cause OpenMP runtime initialization conflicts. This might happen if the application uses the Intel® Math Kernel Library (Intel® MKL) through the “Single Dynamic Library” (libmkl_rt.so) mechanism and calls Intel® MKL after loading the Inference Engine plugin.

Error log report:

OMP: Error #15: Initializing libiomp5.so, but found libiomp5.so already initialized.
OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, see http://www.intel.com/software/products/support/.

Possible workarounds:

  • Preload the OpenMP runtime using the LD_PRELOAD variable:
    This eliminates multiple loadings of libiomp, and makes all components use this specific version of OpenMP.
    LD_PRELOAD=<path_to_libiomp5.so] <path_to your_executable]
  • Set KMP_DUPLICATE_LIB_OK=TRUE. This option might result in performance degradation or incorrect results.

 

Legal Information

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at http://www.intel.com/ or from the OEM or retailer.

No computer system can be absolutely secure.

Intel, Arria, Core, Movidia, Movidius, Xeon, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

*Other names and brands may be claimed as the property of others.

Copyright © 2018, Intel Corporation. All rights reserved.

 

Install the Intel® Computer Vision SDK 2018 (Intel® CV SDK) on Windows* 10

$
0
0

Introduction

NOTE: These steps apply only to Windows* For Linux instructions, see the Linux installation guide

The Intel® Computer Vision SDK (Intel® CV SDK) is a toolkit to accelerate development of computer vision solutions for smart cameras, robotics, autonomous navigation, and retail/industrial automation. It simplifies access to hardware benefits from a range of Intel hardware: CPU, Intel® Integrated Graphics, FPGA, and the Intel® Movidius™ Myriad™ 2 Vision Processing Unit (VPU).

For more information, see the Intel® Computer Vision SDK 2018 Overview.

These instructions describe:

  • What is included in the free download
  • System requirements
  • Software dependencies
  • Installing the Intel® CV SDK on Microsoft Windows* 10
  • Next steps

What's Included

ComponentDescription
Deep Learning Module OptimizerModel import tool. Imports trained models and converts to IR format for use by Deep Learning Inference Engine. This is part of the Intel® Deep Learning Deployment Toolkit.
Deep Learning Inference EngineUnified API to integrate the inference with application logic. This is part of the Intel® Deep Learning Deployment Toolkit.
OpenCV version 3.4.1OpenCV* community version compiled for Intel® hardware. Includes PVL libraries for computer vision.
OpenVX* version 1.1Intel's implementation of OpenVX* 1.1 optimized for running on Intel® hardware (CPU, GPU, IPU).
Documents and tutorialshttps://software.intel.com/en-us/computer-vision-sdk/documentation/view-all

Where to Download This Release

https://software.intel.com/en-us/computer-vision-sdk/computer-vision-sdk/choose-download/free-download-windows

System Requirements

This guide includes only information related to Microsoft Windows* 10 64-bit. See the Linux installation guide for Linux information and instructions.

Note: Only the CPU and Intel® Integrated Graphics processor options are available. Linux is required for using the FPGA or Intel® Movidius™ Myriad™ 2 VPU options.

Development and Target Platforms

The development and target platforms have the same requirements, but you can select different components during the installation, based on your intended use.

Processor

  • 6th-8th Generation Intel® Core™
  • Intel® Xeon® v5 family, Intel® Xeon® v6 family

Processor Notes:

  • Processor graphics are not included in all processors. See https://ark.intel.com/ for information about your processor.
  • A chipset that supports processor graphics is required for Intel® Xeon® processors.

Operating System

Microsoft Windows* 10 64-bit

External Software Dependencies

Install these dependencies before installing the Intel® CV SDK:

Installation Steps

The Windows version of the Intel® Computer Vision SDK (Intel® CV SDK) installation package comes as an easy-to-install executable file.

  1. Download the Intel® CV SDK. By default, the file is saved to Downloads as w_intel_cv_sdk_2018.0.<version>.exe
  2. Go to the Downloads folder.
  3. Double-click w_intel_cv_sdk_2018.0.<version>.exe. A screen displays with options to choose your installation directory and components:
    Components to install
  4. Click Next.
  5. The next screen warns you about any missing components and the effect the missing component has on installing or using the Intel® CV SDK.
    Windows warning messages
  6. If you are missing a critical component, click Cancel, resolve the issue, and then restart the installation.
  7. When the installation completes, click Finish to close the wizard and open the Getting Started Guide in a browser.
  8. Make sure the installation directory is populated with sub-folders. The default installation location is C:\Intel\computer_vision_sdk_2018.0.<versions>. 

Next Steps

IMPORTANT: Before using the Model Optimizer to work with your trained model, make sure your Caffe*TensorFlow*, or MXNet* framework is prepared for any custom layers you have in place. The next information will put you on the way to doing this.

Learn About the Intel® CV SDK

Before using the Intel® CV SDK, read through the product overview to gain a better understanding of how the product works.

Compile the Extensions Library

Some topology-specific layers, like DetectionOutput used in the SSD*, are delivered in source code that assumes the extensions library is compiled and loaded. The extensions are required for pre-trained models inference. While you can build the library manually, the best way to compile the extensions library is to execute the demo scripts.

Run the Demonstration Applications

To verify the installation, run the demo apps in <INSTALL_FOLDER>\deployment_tools\demo. For demo app documentation, see README.txt in <INSTALL_FOLDER>\deployment_tools\demo.

The demo apps and their functions are:

  • demo_squeezenet_download_convert_run.bat. This demo Illustrates the basic steps used to convert a model and run it. This enables the Intel® Deep Learning Deployment Toolkit to perform a classification task with the SqueezeNet model. This demo:
    • Downloads a public SqueezeNet model.
    • Installs all prerequisites to run the Model Optimizer.
    • Converts the model to an Intermediate Representation.
    • Builds the Inference Engine Image Classification Sample from the <INSTALL_FOLDER>\deployment_tools\inference_engine\samples\classification_sample
    • Runs the sample using cars.png from the demo folder.
    • Shows the label and confidence for the top-10 categories.
  • demo_security_barrier_camera_sample.bat. This demo shows an inference pipeline using three of the pre-trained models included with the Intel CV SDK. The region found by one model becomes the input to the next. Vehicle regions found by object recognition in the first phase become the input to the vehicle attributes model, which locates the license plate. The region identified in this step becomes the input to a license plate character recognition model. This demo:
    • Builds the Inference Engine Security Barrier Camera Sample from the <INSTALL_FOLDER>\deployment_tools\inference_engine\samples\security_barrier_camera_sample.
    • Runs the sample using car_1.bmp from the demo folder.
    • Displays the resulting frame with detections rendered as bounding boxes and text.

Other Important Information

  • See the <INSTALL_FOLDER>/deployment_tools/inference_engine/samples/ folder and the samples overview documentation to learn about the range of samples available for the Inference Engine.
  • For developer guides, API references, tutorials, and other online documentation, see the Intel® CV SDK documentation.

Legal Information

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at http://www.intel.com/ or from the OEM or retailer.

No computer system can be absolutely secure.

Intel, Arria, Core, Movidius, Xeon, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

*Other names and brands may be claimed as the property of others.

Copyright © 2018, Intel Corporation. All rights reserved.

Model Optimizer Developer Guide

$
0
0

Introduction

The Model Optimizer is a cross-platform command-line tool that facilitates the transition between the training and deployment environment, performs static model analysis, and adjusts deep learning models for optimal execution on end-point target devices.

The Model Optimizer process assumes you have a network model trained using one of the supported frameworks. The diagram below illustrates the typical workflow for deploying a trained deep learning model:

Intel Computer Vision Basic Workflow

A summary of the steps for optimizing and deploying a trained model:

  1. Configure the Model Optimizer for your framework.
  2. Convert a trained model to produce an optimized Intermediate Representation (IR) of the model based on the trained network topology, weights, and bias values.
  3. Test the model in the Intermediate Representation format using the Inference Engine in the target environment via provided Inference Engine validation application or sample applications.
  4. Integrate the Inference Engine into your application to deploy the model in the target environment. See the Inference Engine Guide.

Model Optimizer Workflow

The Model Optimizer process assumes you have a network model that was trained with a supported framework. The workflow is:

  1. Configure the Model Optimizer for the framework that was used to train the network. To perform this configuration, use the configuration bash script for Linux* OS, or the batch file for Windows* OS. The script and batch file are in: <INSTALL_DIR>/deployment_tools/model_optimizer/install_prerequisites
    • For Linux* OS:
      install_prerequisites.sh
    • For Windows* OS:
      install_prerequisites.bat
    For more information about configuring the Model Optimizer, see Configuring the Model Optimizer.
  2. Provide as input a trained model that contains a specific topology and the adjusted weights and biases described in the framework-specific files.
  3. Convert the trained model to an optimized Intermediate Representation.

The Model Optimizer produces an Intermediate Representation (IR) of the network as output. The Inference Engine reads, loads, and infers the Intermediate Representation. The Inference Engine API offers a unified API across supported Intel® platforms. The Intermediate Representation is a pair of files that describe the whole model:

  • .xml: Describes the network topology
  • .bin: Contains the weights and biases binary data

Configuring the Model Optimizer

You must configure the Model Optimizer for the framework that was used to train the model. This section tells you how to configure the Model Optimizer either through scripts or by using a manual process.

Using Configuration Scripts

You can either configure all three frameworks at the same time, or install an individual framework. The scripts install all required dependencies and provide the fastest and easiest way to configure the Model Optimizer.

To configure all three frameworks: Go to the <INSTALL_DIR>/deployment_tools/model_optimizer/install_prerequisites directory and run:

  • For Linux*:
    install_prerequisites.sh
  • For Windows*:
    install_prerequisites.bat

To configure a specific framework: Go to the <INSTALL_DIR>/model_optimizer/install_prerequisites directory and run:

CAFFE* NOTE: By default, you do not need to install Caffe to create an Intermediate Representation for a Caffe model unless you use Caffe for custom layer shape inference and do not write Model Optimizer extensions. To learn more about implementing Model Optimizer custom operations and the limitations of using Caffe for shape inference, see Caffe Models with Custom Layers.

TENSORFLOW* NOTE: To offload part of the inference to the TensorFlow framework, additional configuration steps are required:

  • For Caffe on Linux:
    install_prerequisites_caffe.sh
  • For Caffe on Windows:
    install_prerequisites_caffe.bat
  • For TensorFlow on Linux:
    install_prerequisites_tf.sh
  • For TensorFlow on Windows:
    install_prerequisites_tf.bat
  • For MXNet on Linux:
    install_prerequisites_mxnet.sh
  • For MXNet on Windows:
    install_prerequisites_mxnet.bat

Using a Manual Configuration Process

If you prefer, you can manually configure the Model Optimizer for one framework at a time.

  1. Go to the Model Optimizer directory:
    cd <INSTALL_DIR>/deployment_tools/model_optimizer/
  2. Strongly recommended for all global Model Optimizer dependency installations: Create and activate a virtual environment. While not required, this option is strongly recommended since the virtual environment creates a Python* sandbox, and dependencies for the Model Optimizer do not influence the global Python configuration, installed libraries, or other components. In addition, a flag ensures that system-wide Python libraries are available in this sandbox:
    • Create a virtual environment:
      virtualenv -p /usr/bin/python3.5 .env3 --system-site-packages
    • Activate the virtual environment:
      virtualenv -p /usr/bin/python3.5 .env3/bin/activate
  3. Install all dependencies or only the dependencies for a specific framework:
    • To install dependencies for all frameworks:
      pip3 install -r requirements.txt
    • To install dependencies only for Caffe:
      pip3 install -r requirements_caffe.txt
    • To install dependencies only for TensorFlow:
      pip3 install -r requirements_tensorflow.txt
    • To install dependencies only for MXNet:
      pip3 install -r requirements_mxnet.txt

Using the protobuf Library in the Model Optimizer, for Caffe* on Windows*

These procedures require:

  • Access to github and the ability to use git commands
  • Microsoft Visual Studio* 2013 for Win64*
  • C/C++

The Model Optimizer uses the protobuf library to load trained Caffe models. By default, the library executes the pure Python* language implementation, which is slow. These steps implement the faster, C implementation of the protobuf library on Windows OS or Linux OS.

Building the protobuf Library on Windows* OS

  1. Clone protobuf:
    git clone https://github.com/google/protobuf.git
    cd protobuf
  2. Create a Visual Studio solution file. Run these two commands:
    C:\Path\to\protobuf\cmake\build>mkdir solution
    cd solution C:\Path\to\protobuf\cmake\build\solution
    cmake -G "Visual Studio 12 2013 Win64" ../..
  3. Change the runtime library option for libprotobuf and libprotobuf-lite:
    • Open the project's Property Pages dialog box.
    • Expand the C/C++ tab.
    • Select the Code Generation property page.
    • Change the Runtime Library property to Multi-thread DLL (/MD).
  4. Build the libprotoc, protoc, libprotobuf, and libprotobuf-lite projects in the Release configuration.
  5. Add a path to the build directory to the PATH environment variable:
    set PATH=%PATH%;C:\Path\to\protobuf\cmake\build\solution\Release
  6. Go to the python directory:
    cd C:\Path\to\protobuf\python
  7. Use a text editor to open and change these setup.py options:
    • Change from ​libraries = ['protobuf']
      to libraries = ['libprotobuf', 'libprotobuf-lite']
    • Change from extra_objects = ['../src/.libs/libprotobuf.a', '../src/.libs/libprotobuf-lite.a']
      to extra_objects = ['../cmake/build/solution/Release/libprotobuf.lib', '../cmake/build/solution/Release/libprotobuf-lite.lib']
  8. Build the Python package with the CPP implementation:
    python setup.py build –cpp_implementation
  9. Install the Python package with the CPP implementation:
    python -m easy_install dist/protobuf-3.5.1-py3.5-win-amd64.egg
  10. Set an environment variable to boost the protobuf performance:
    set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp

How the Model Optimizer Works

The Model Optimizer loads a model into memory, reads it, builds the internal representation of the model, optimizes it and produces the Intermediate Representation. The Intermediate Representation is the only format the Inference Engine accepts.

NOTE: The Model Optimizer does not infer models. The Model Optimizer is an offline tool that runs before the inference takes place.

The Model Optimizer has two main purposes:

  • Produce a valid Intermediate Representation. If this main conversion artifact is not valid, the Inference Engine can not run. The primary responsibility of the Model Optimizer is to produce the two files that form the Intermediate Representation.
  • Produce an optimized Intermediate Representation. Pretrained models contain layers that are important for training, such as the dropout layer. These layers are useless during inference and might increase the inference time.
    In many cases, these layers can be automatically removed from the resulting Intermediate Representation. However, if a group of layers can be represented as one mathematical operation, and thus as a single layer, the Model Optimizer recognizes such patterns and replaces these layers with one. The result is an Intermediate Representation that has fewer layers than the original model. This decreases the inference time.

To produce a valid Intermediate Representation, the Model Optimizer must be able to read the original model layers and to handle their properties and represent them in Intermediate Representation format, while maintaining validity of the resulting Intermediate Representation.

For example, according to the catalog of Intermediate Representation layers, every layer must have an output. The layer output is represented in the Intermediate Representation by the output blob dimensions.

What You Need to Know About Your Model

Many common layers exist across known frameworks and neural network topologies. Examples of these layers are Convolution, Pooling, and Activation. To read the original model and produce the Intermediate Representation of a model, the Model Optimizer must be able to work with these layers.

The layer list varies by framework. See the Caffe*, TensorFlow* and MXNet* documentation for the topologies supported by each of these frameworks. If your topology contains only layers from the list of layers, as is the case for the topologies used by most users, the Model Optimizer easily creates the Intermediate Representation, after which you can proceed to working with the Inference Engine.

However, if you use a topology with layers that are not recognized by the Model Optimizer out of the box. See Custom Layers in the Model Optimizer to learn how to work with custom layers.

Model Optimizer Directory Structure

The Model Optimizer directory has the following structure:

|-- model_optimizer
    |-- extensions
        |-- front/caffe
            |-- CustomLayersMapping.xml.example - example of file for registering custom Caffe layers (compatible with the 2017R3 release)
    |-- mo
        |-- back - Back-End logic: contains IR emitting logic
        |-- front - Front-End logic: contains matching between Framework-specific layers and IR specific, calculation
        of output shapes for each registered layer
        |-- graph - Graph utilities to work with internal IR representation
        |-- middle - Graph transformations - optimizations of the model
        |-- pipeline - Sequence of steps required to create IR for each framework
        |-- utils - Utility functions
    |-- tf_call_ie_layer - Source code that enables TensorFlow fallback in Inference Engine during model inference
    |-- mo.py - Centralized entry point that can be used for any supported framework
    |-- mo_caffe.py - Entry point particularly for Caffe
    |-- mo_mxnet.py - Entry point particularly for MXNet
    |-- mo_tf.py - Entry point particularly for TensorFlow
    |-- ModelOptimizer - Entry point particularly for Caffe that contains same CLI as 2017R3 publicly released Model Optimizer

Custom Layers in the Model Optimizer

The Model Optimizer searches for each layer of the input model in the list of known layers before building the model's internal representation, optimizing the model, and producing the Intermediate Representation.

The list of known layers is different for each of supported frameworks. To see the layers supported by your framework, see the Caffe*, TensorFlow* or MXNet* documentation.

Custom layers are layers that are not included into a list of known layers. If your topology contains any layers that are not in the list of known layers, the Model Optimizer classifies them as custom.

Caffe Models with Custom Layers

You have two options if your Caffe model has custom layers:

  • Register the custom layers as extensions to the Model Optimizer. For instructions, see Extending Model Optimizer with New Primitives. When your custom layers are registered as extensions, the Model Optimizer generates a valid and optimized Intermediate Representation. You only need to write a small chunk of Python* code that lets the Model Optimizer:
    • Generate a valid Intermediate Representation according to the rules you specified
    • Be independent from the availability of Caffe* on your computer
  • Register the custom layers as Custom and use the system Caffe to calculate the output shape of each Custom Layer, which is required by the Intermediate Representation format. For this method, the Model Optimizer requires the Caffe* Python* interface on your system. When registering the custom layer in the CustomLayersMapping.xml file, you can specify if layer parameters should appear in Intermediate Representation or if they should be skipped. To read more about the expected format and general structure of this file, see Legacy Mode for Caffe Custom Layers. This approach has several limitations:
    • If your layer output shape depends on dynamic parameters, input data or previous layers parameters, calculation of output shape of the layer via Caffe can be incorrect. In this case, you need to patch Caffe on your own.
    • If the calculation of output shape of the layer via Caffe fails inside the framework, Model Optimizer is unable to produce any correct Intermediate Representation and you also need to investigate the issue in the implementation of layers in the Caffe* and patch it.
    • You are not able to produce Intermediate Representation on any machine that does not have Caffe installed. If you want to use Model Optimizer on multiple machines, your topology contains Custom Layers and you use CustomLayersMapping.xml to fallback on Caffe, you need to configure Caffe on each new machine.
    For these reasons, it is best to use the Model Optimizer extensions for Custom Layers is the preferable option: you do not depend on the framework and fully control the workflow.

If your model contains Custom Layers, It is important to understand the internal workflow of Model Optimizer. Consider the following example.

Example:

The example network has:

  • One input layer (#1)
  • One output Layer (#5)
  • Three internal layers (#2, 3, 4)

The custom and standard layer types are:

  • Layers 2 and 5 are implemented as Model Optimizer extensions
  • Layers 1 and 4 are supported in Model Optimizer out-of-the box
  • Layer 3 is neither in the list of supported layers nor in extensions, but is specified in CustomLayersMapping.xml

NOTE: If any of the layers are not in one of three categories described above, the Model Optimizer fails with an appropriate message and a link to the corresponding question in Model Optimizer FAQ.

The general process is as shown:

Example custom layer network

  1. The example model is fed to Model Optimizer that loads the model with the special parser, built on top of caffe.proto file. In case of failure, Model Optimizer asks you to prepare the parser that can read the model. For more information, refer to Model Optimizer, FAQ #1.
  2. The Model Optimizer Extracts the attributes of all layers. In particular, it goes through the list of layers and attempts to find the appropriate extractor. In order of priority, Model Optimizer checks if the layer is:
    • Registered in CustomLayersMapping.xml
    • Registered as a Model Optimizer extension
    • Registered as a standard Model Optimizer layer.

    When the Model Optimizer finds a satisfied condition from the list above, it extracts the attributes according to the following rules:

    • For bullet #1 - either takes all parameters or no parameters, according to the content of CustomLayersMapping.xml
    • For bullet #2 - takes only those parameters specified in the extension
    • For bullet #3 - takes only those parameters specified in the standard extractor
  3. The Model Optimizer calculates the output shape of all layers. The logic is the same as it is for the priorities. Important: the Model Optimizer always takes the first available option.
  4. The Model Optimizer optimizes the original model and produces the Intermediate Representation.

Extending the Model Optimizer with New Primitives

This section explains how to register a custom layer in the Model Optimizer, including how to register Proposal as a custom layer. This section also demonstrates how Proposal works as a custom layer.

The Model Optimizer loads the model, goes through the topology, and tries to find each layer type in the list of known layers. If the Model Optimizer does not find a layer in that list, it looks for the layer in the list of custom layers. If the Model Optimizer fails to find the layer among the defined custom layers, it registers a Caffe* fallback for for the output shape inference. If the Model Optimizer does not find Caffe* and cannot infer shapes, the Model Optimizer fails with an appropriate message.

You must know two things about custom layers with the Model Optimizer:

  • How to map a subgraph in a FW model to a subgraph consisting of Inference Engine layers. For Caffe*, the subgraph is a 1-to-1 mapping of a Caffe layer to an Inference Engine layer.
  • How to infer shapes for unknown subgraphs. This can be either for a step in which the internal representation consists of framework-specific layers, or for a step in which the internal representation consists of Inference Engine layers.

You also have the option of a framework fallback for unknown subgraphs, for when the original framework is used for inference of output shapes of operations. The example below demonstrates the case in which the framework is not available or should not be used.

Preparing an Example Topology

NOTE: Skip this section if you have a topology with a layer that is not known to the Model Optimizer.

The information in this section prepares a Caffe model with the provided, deployment-ready prototxt for a well-known topology called Faster-R-CNN to demonstrate the workflow. To use this example, you must have weights and biases for inference.

  1. Download the .caffemodel file.
  2. Run the Model Optimizer on the .caffemodel file:
    python mo.py --input_model ZF_faster_rcnn_final.caffemodel --input_proto test.prototxt
    You will likely see the error message:
    Error parsing text-format caffe.NetParameter: 196:16: Message type "caffe.DropoutParameter" has no field named "scale_train".
    Whether you see the error depends on your Caffe version. For example, BVLC Caffe does not support the boolean parameter scale_train for the dropout layer. The error message does not matter because the dropout layer is needed only for training, and the Model Optimizer removes it.
  3. Comment out these lines in test.prototxt:
    ...
    layer {
      name: "drop6"
      type: "Dropout"
      bottom: "fc6"
      top: "fc6"
      dropout_param {
        dropout_ratio: 0.5
        # scale_train: false # <-------------- comment out this line
      }
    }
    ...
    layer {
      name: "drop7"
      type: "Dropout"
      bottom: "fc7"
      top: "fc7"
      dropout_param {
        dropout_ratio: 0.5
        # scale_train: false # <-------------- comment out this line
      }
    }
    ...
  4. Run the Model Optimizer on this model again:
    python mo.py --input_model ZF_faster_rcnn_final.caffemodel --input_proto test.prototxt
    

You will see the message:

[ ERROR ]  Found custom layer proposal. Model Optimizer does not support this layer. 
Please, register it in CustomLayersMapping.xml or implement extension. 
For more information please refer to Model Optimizer FAQ, question #45.

This message means the Model Optimizer can load the model, but is unable to infer the shape and handle the custom layer properties.

Registering a Custom Layer as a Model Optimizer Extension

In the following sections, you will learn how to make the Model Optimizer independent from Caffe* when processing a model that has a custom layer. In this example, the custom layer is referred to as the Proposal layer.

Use this section to implement the mapping rules for the Proposal layer attributes and the output shape calculation. As part of these steps, you must first create a class for the Proposal layer and inherit it from general-purpose Op that defines the interface of every new custom layer.

In this section, it is important to understand the Op class and its function. The implementation of this class shows that it expects a graph and attributes to be passed when initializing. The graph and attributes are in PATH_TO_MO/mo/ops/op.py

Op is keeps the attributes for each operation and contains logic for handling node creation for internal model representation. Op is responsible for dumping each particular operation to the XML format for the Intermediate Representation. By inheriting from it, the technical items are complete and you concentrate on the specificity of this layer: the attributes it supports and the rules on computing its output shape.

Follow these steps:

  1. Create the file python_proposal.py in the directory extensions/ops:
    from mo.ops.op import Op
    class PythonProposalOp(Op):
        pass
  2. Define the name of the operation and make a stub constructor:
    from mo.ops.op import Op
    class PythonProposalOp(Op):
        op = 'Python'
        def __init__(self, graph, attrs):
            super().__init__(graph)
  3. Every Op must have three specific fields defined: type, op, and infer. In most cases, the type and op names are the same, and infer is defined as a function to compute the output shape. Reflect these fields in your constructor:
    from mo.ops.op import Op
    class PythonProposalOp(Op):
        op = 'Python'
        def __init__(self, graph, attrs):
            mandatory_props = {
                'type': __class__.op,
                'op': __class__.op,
                'infer': None
            }
            super().__init__(graph, mandatory_props, attrs)
    According to the Intermediate Representation catalog, Proposal has the attributes:
    • pre_nms_topn
    • post_nms_topn
    • nms_thresh
    • feat_stride
    • min_size
    • base_size
    • ratio
    • scale
  4. In defining supported attribute names, it is best to use the same names as are used in the original models. The names are similar to parameters, and have no connection with the model layer properties. For clarity, you can use the name my_ratio for ratio. Other than defining the list of supported parameters, you can define only the parameters that appear in the Intermediate Representation in the backend_attrs method.
    Define your attributes:
    class PythonProposalOp(Op):
        # ... constructor
         def supported_attrs(self):
                return [
                    'pre_nms_topn',
                    'post_nms_topn',
                    'nms_thresh',
                    'feat_stride',
                    'min_size',
                    'base_size',
                    'ratio',
                    'scale'
                ]
  5. The Model Optimizer now knows how to create the layer called Proposal when it is in the topology and the Model Optimizer knows what attributes this layer has. However, the Model Optimizer does not know how to calculate the output shape of this operation. Define a rule to calculate the output shape:
    import numpy as np
    from mo.graph.graph import Node
    from mo.ops.op import Op
    class PythonProposalOp(Op):
       def __init__(self, graph, attrs):
           mandatory_props = {
               'type': __class__.op,
               'op': __class__.op,
               'infer': PythonProposalOp.calculate_output_shape
           }
           super().__init__(graph, mandatory_props, attrs)
        # ... supported attrs
        @staticmethod
        def calculate_output_shape(node: Node):
            node.out_node().shape = (1, 1, 1, 1) # any Proposal now has always the same output
  6. According to the Intermediate Representation catalog, Proposal has the following output calculation formula, where shape dynamically depends on the post_nms_topn parameter.
    Implement the output calculation formula in Python:
    import numpy as np
    class PythonProposalOp(Op):
        # ... static fields
        # ... constructor
        # ... supported attrs
        @staticmethod
        def calculate_output_shape(node: Node):
            input_shape = node.in_node(0).shape
            out_shape = np.array([0, 0], dtype=np.int64)
            # rois blob: holds R regions of interest, each is a 5 - tuple
            # (n, x1, y1, x2, y2) specifying an image batch index n and a
            # rectangle(x1, y1, x2, y2)
            out_shape[0] = input_shape[0] * node.post_nms_topn
            out_shape[1] = 5
            node.out_node(0).shape = out_shape
    The node does not contain this parameter because it should be initialized in the constructor and in other parameters. The Inference Engine contains the implementation of a Caffe-like Proposal layer and works well with the default values from caffe.proto:
    // Message that stores parameters used by ProposalLayer message ProposalParameter { optional uint32 feat_stride = 1 [default = 16]; optional uint32 base_size = 2 [default = 16]; optional uint32 min_size = 3 [default = 16]; repeated float ratio = 4; repeated float scale = 5; optional uint32 pre_nms_topn = 6 [default = 6000]; optional uint32 post_nms_topn = 7 [default = 300]; optional float nms_thresh = 8 [default = 0.7]; }
  7. Change the constructor as follows:
    class PythonProposalOp(Op):
        # ... static fields
        def __init__(self, graph, attrs):
            mandatory_props = {
                'type': __class__.op,
                'op': __class__.op,
                'feat_stride': 16,
                'base_size': 16,
                'min_size': 16,
                'ratio': [0.5, 1, 2],
                'scale': [8, 16, 32],
                'pre_nms_topn': 6000,
                'post_nms_topn': 300,
                'nms_thresh': 0.7,
                'infer': PythonProposalOp.calculate_output_shape
            }
            super().__init__(graph, mandatory_props, attrs)
        # ... supported attrs
        # ... calculate output shape

Summary

In this section you implemented support for a custom layer with type 'Python' that is 'Proposal' layer in the topology. You learned how to calculate output shape of this layer.

The values of attributes are hardcoded, and in the next section you will learn how to extract these values from original framework model.

Registering Rules to pass Extension Layer Properties from a Caffe* Model to the Intermediate Representation

The Model Optimizer now knows how to set the shape of the PythonProposalOp operation, but it is incorrect to initialize attributes with same values for every operation. Instead, the values should be extracted from the original topology. The Model Optimizer does not know how to map the custom layer properties to the PythonProposalOp. For this, you must register the FrontExtractorOp instance.

NOTE: This step is required only if the layer requires parameters from the original model.

  1. Create the file python_proposal_ext.py in the folder PATH_TO_MO/extensions/front/caffe
    from mo.front.extractor import FrontExtractorOp
    class PythonProposalFrontExtractor(FrontExtractorOp):
        pass
  2. Specify the operation that the extractor refers to and a specific flag. The flag represents whether the operation should be used by the Model Optimizer or should be excluded from processing:
    from mo.front.extractor import FrontExtractorOp
    class PythonProposalFrontExtractor(FrontExtractorOp):
        op = 'Python'
        enabled = True
  3. Register a mapping rule between the original model and the PythonProposalOp attributes, by overriding the following function:
    from mo.front.extractor import FrontExtractorOp
    from mo.ops.op import Op
    class PythonProposalFrontExtractor(FrontExtractorOp):
        op = 'Python'
        enabled = True
        @staticmethod
        def extract(node):
            proto_layer = node.pb
            param = proto_layer.python_param # each layer has a specific parameter, take a look at caffe.proto
            python_params = str(param.param_str) # for Python layers, all params are in param_str
            attrs = {
                'feat_stride': int(python_params.split(':')[-1])
            }
            # update the attributes of the node
            Op.get_op_class_by_name(__class__.op).update_node_stat(node, attrs)
            return __class__.enabled
    You have successfully extracted the parameter feat_stride from prototxt, assuming it is the only parameter in this layer.
  4. To increase the implementation's flexibility:
    import ast
    from mo.front.extractor import FrontExtractorOp
    from mo.ops.op import Op
    class PythonProposalFrontExtractor(FrontExtractorOp):
        op = 'Python'
        enabled = True
        @staticmethod
        def extract(node):
            proto_layer = node.pb
            param = proto_layer.python_param
            attrs = PythonProposalFrontExtractor.parse_param_str(str(param.param_str))
            # update the attributes of the node
            Op.get_op_class_by_name(__class__.op).update_node_stat(node, attrs)
            return __class__.enabled
        @staticmethod
        def parse_param_str(param_str: str):
            if param_str[0] != '{' and param_str[-1] != '}':
                param_str = '{' + param_str + '}'
            return ast.literal_eval(param_str)
    You can successfully convert the model. Open the .xml file and view your code:
    ...
    <layer id="42" name="proposal" precision="FP32" type="Python">
        <data base_size="16" feat_stride="16" min_size="16" nms_thresh="0.7" post_nms_topn="300" pre_nms_topn="6000" ratio="[0.5, 1, 2]" scale="[8, 16, 32]"/>
        <input>
            <port id="0">
                <dim>1</dim>
                <dim>18</dim>
                <dim>15</dim>
                <dim>15</dim>
            </port>
            <port id="1">
                <dim>1</dim>
                <dim>36</dim>
                <dim>15</dim>
                <dim>15</dim>
            </port>
            <port id="2">
                <dim>1</dim>
                <dim>3</dim>
            </port>
        </input>
        <output>
            <port id="3">
                <dim>300</dim>
                <dim>5</dim>
            </port>
        </output>
    </layer>
    ...

Look at the output shape of the custom layer you implemented. The shape was calculated according to the rules specified in PythonProposalOp.The ratio and scale properties have the value [0.5, 1, 2] and [8, 16, 32]. They have square brackets because they are originally a repeated parameter. You converted the parameter to a list in PythonProposalOp. The Model Optimizer cast the value to a string. According to Python rules, a list has a string representation of opening and closing square brackets and values joined by commas.

This is not a valid notation for the Intermediate Representation specification, because repeated parameters must be separated by a comma but without the brackets. Therefore, you must override the Model Optimizer default behavior regarding how it handles those parameters during the Intermediate Representation emitting stage, after the optimizations are complete. To do so, implement backend_attrs() in the PythonProposalOp class:

class PythonProposalOp(Op):
    ... other methods
     def backend_attrs(self) -> list:
            """
            Gets list of attributes that should appear in resulting IR
            Returns:
                list of attributes names or list of tuples (name of attribute, pre-processing rule)
            """
            return [
                (  # a tuple per attribute
                    'ratio',  # name of attribute
                    # pre-processing rule in a form of lambda
                    # lambda takes a PythonProposalOp node with all defined properties
                    # it translates [1,2,3] -> "1,2,3"
                    lambda node: ','.join(map(str, node['ratio']))
                ),
                (
                    'scale',
                    lambda node: ','.join(map(str, node['scale']))
                ),
                'feat_stride',
                'base_size',
                'min_size',
                'pre_nms_topn',
                'post_nms_topn',
                'nms_thresh'
            ]

The model can now be successfully converted.

Open the .xml file. ratio and scale have the expected correct values 0.5,1,2 and 8,16,32.

NOTE: The Model Optimizer supports the Faster-R-CNN topology. Run the following command for the same Intermediate Representation:

Summary

In this section you learned how to:

  1. Create a framework-independent extension implementation of the Intermediate Representation custom layer with unified logic for calculating output shapes, specified set of attributes, and so on;
  2. Use the Framework-Specific property extractor to map original model custom layer properties to the expected properties of the Framework-Independent extension;
  3. Manipulate the custom layer properties representation in the resulting Intermediate Representation.

Files used in this section:

  • extensions/ops/python_proposal.py:
    import numpy as np
    from mo.graph.graph import Node
    from mo.ops.op import Op
    class PythonProposalOp(Op):
        op = 'Python'
        def __init__(self, graph, attrs):
            mandatory_props = {
                'type': __class__.op,
                'op': __class__.op,
                'feat_stride': 16,
                'base_size': 16,
                'min_size': 16,
                'ratio': [0.5, 1, 2],
                'scale': [8, 16, 32],
                'pre_nms_topn': 6000,
                'post_nms_topn': 300,
                'nms_thresh': 0.7,
                'infer': PythonProposalOp.calculate_output_shape
            }
            super().__init__(graph, mandatory_props, attrs)
        def supported_attrs(self):
            return [
                'pre_nms_topn',
                'post_nms_topn',
                'nms_thresh',
                'feat_stride',
                'min_size',
                'base_size',
                'ratio',
                'scale'
            ]
        def backend_attrs(self) -> list:
            """
            Gets list of attributes that should appear in resulting IR
            Returns:
                list of attributes names or list of tuples (name of attribute, pre-processing rule)
            """
            return [
                (  # a tuple per attribute
                    'ratio',  # name of attribute
                    # pre-processing rule in a form of lambda
                    # lambda takes a PythonProposalOp node with all defined properties
                    # it translates [1,2,3] -> "1,2,3"
                    lambda node: ','.join(map(str, node['ratio']))
                ),
                (
                    'scale',
                    lambda node: ','.join(map(str, node['scale']))
                ),
                'feat_stride',
                'base_size',
                'min_size',
                'pre_nms_topn',
                'post_nms_topn',
                'nms_thresh'
            ]
        @staticmethod
        def calculate_output_shape(node: Node):
            input_shape = node.in_node(0).shape
            out_shape = np.array([0, 0], dtype=np.int64)
            # rois blob: holds R regions of interest, each is a 5 - tuple
            # (n, x1, y1, x2, y2) specifying an image batch index n and a
            # rectangle(x1, y1, x2, y2)
            out_shape[0] = input_shape[0] * node.post_nms_topn
            out_shape[1] = 5
            node.out_node(0).shape = out_shape
  • extensions/front/caffe/python_proposal_ext.py:
    import ast
    from mo.front.extractor import FrontExtractorOp
    from mo.ops.op import Op
    class PythonProposalFrontExtractor(FrontExtractorOp):
        op = 'Python'
        enabled = True
        @staticmethod
        def extract(node):
            proto_layer = node.pb
            param = proto_layer.python_param
            attrs = PythonProposalFrontExtractor.parse_param_str(str(param.param_str))
            # update the attributes of the node
            Op.get_op_class_by_name(__class__.op).update_node_stat(node, attrs)
            return __class__.enabled
        @staticmethod
        def parse_param_str(param_str: str):
            if param_str[0] != '{' and param_str[-1] != '}':
                param_str = '{' + param_str + '}'
            return ast.literal_eval(param_str)

Legacy Mode for Caffe* Custom Layers

The Model Optimizer can register custom layers in a way that the output shape is calculated by the Caffe* framework installed on your system. This chapter covers this option.

NOTE: The Caffe Python API has an issue when layer name does not correspond to the name of its top. The fix was implemented on BVLC Caffe*. The Caffe framework on your computer must contain this fix. Otherwise, Caffe framework can unexpectedly fail during the fallback procedure.

NOTE: The Caffe fallback feature was validated against this github revision. You may have issues with forks or later Caffe framework versions.

  1. Create a file CustomLayersMapping.xml:
    mv extensions/front/caffe/CustomLayersMapping.xml.example extensions/front/caffe/CustomLayersMapping.xml
  2. Add (register) custom layers to CustomLayersMapping.xml:
    <CustomLayer NativeType="${Type}" hasParam="${has_params}" protoParamName="${layer_param}"/>

Where:

  • ${Type} is a type of the layer in the Caffe*;
  • ${has_params} is "true" if the layer has parameters, and is "false" otherwise.
  • ${layer_param} is a name of the layer parameters in caffe.proto if the layer has it.

Example:

  1. Proposal layer has parameters and they appear in the Intermediate Representation. The parameters are stored in the proposal_param property of the layer: ```sh <CustomLayer NativeType="Proposal" hasParam ="true" protoParamName = "proposal_param"/> ```
  2. CustomLayer layer has no parameters: ```sh <CustomLayer NativeType="CustomLayer" hasParam ="false"/> ```

For this feature, you need an appropriate version of Caffe installed on the computer on which you run the Model Optimizer.

Constraints of Using the Caffe Fallback

Several layers in the Caffe framework can have shapes that dynamically depend on the input data, not only the layers that proceed the layer and its parameters. For example, SimplerNMS is filtering out bounding boxes that do not satisfy the condition. Internally, Caffe* fallback forwards the whole net without any meaningful data - just some noise. It is natural to get only one bounding box (0,0,0,0) instead of expected number (for example 15). There is an option to patch Caffe* accordingly, however it makes success of Intermediate Representation generation on the patched Caffe* on the particular Machine. To keep the solution independent from Caffe* we recommend to use extensions mechanism for such layers.

Known cases like Proposal, DetectionOutput, SimplerNMS are implemented as extensions and can be used out of the box.

A detailed description of supported layers is in the Intermediate Representation Layers Notation Reference Catalog.

Building Caffe*
  1. Build Caffe* with Python* 3.5:
    export CAFFE_HOME=PATH_TO_CAFFE
    cd $CAFFE_HOME
    rm -rf  ./build
    mkdir ./build
    cd ./build
    cmake -DCPU_ONLY=ON -DOpenCV_DIR=<your opencv install dir> -DPYTHON_EXECUTABLE=/usr/bin/python3.5 ..
    make all # also builds pycaffe
    make install
    make runtest # optional
  2. Add Caffe* Python directory to PYTHONPATH to let it be imported from the Python program:
    export PYTHONPATH=$CAFFE_HOME/python;$PYTHONPATH
  3. Check the Caffe* installation:
    python3
    import caffe

If Caffe was installed correctly, the Caffe module is imported without errors.

TensorFlow Models With Custom Layers

You have three options for TensorFlow* models with custom layers:

  • Register those layers as extensions to the Model Optimizer. In this case, the Model Optimizer generates a valid and optimized Intermediate Representation.
  • If you have sub-graphs that should not be expressed with the analogous sub-graph in the Intermediate Representation, but another sub-graph should appear in the model, the Model Optimizer provides such an option. This feature is helpful for many TensorFlow models. To read more, see Sub-graph Replacement in the Model Optimizer.
  • Experimental feature of registering definite sub-graphs of the model as those that should be offloaded to TensorFlow during inference. In this case, the Model Optimizer produces an Intermediate Representation that:
    • Can be inferred only on CPU
    • Reflects each sub-graph as a single custom layer in the Intermediate Representation

    For more information, see Offloading Computations to TensorFlow. This feature is for development only. It is expected to be used, when you have the model that has complex structure and it is not an easy task to write extensions for internal subgraphs. In this case, you offload these complex subgraphs to TensorFlow to make sure that Model Optimizer and Inference Engine can successfully execute your model, however for each such subgraph, TensorFlow library is called that is not optimized for inference. Then, you start replacing each subgraph with extension and remove its offloading to TensorFlow* during inference until all the model is converted by Model Optimizer and infered by Inference Engine only with the maximum performance

Sub-Graph Replacement in the Model Optimizer

Several reasons exist for why the Model Optimizer could not generate an Intermediate Representation for a model. However, in some cases, the Intermediate Representation could be generated after providing certain hints to the tool. The examples of hints below are mostly related to TensorFlow* but potentially could be actual for models created in any framework:

  • Topology contains an operation (or a sub-graph of operations) not known for Model Optimizer, but this operation (sub-graph) could be expressed as a combination of known operations (so hint would be a description of this combination to the tool);
  • Sub-graph of operations in the topology expresses a single layer known for Inference Engine;
  • TensorFlow* and Inference Engine use different layout of tensors, NHWC and NCHW respectively. If some tensor in NHWC layout is flattened (e.g. all the dimensions are squashed into single dim) then it is not possible to convert it to NCHW layout required for Inference Engine, so Model Optimizer could not produce correct Intermediate Representation.

The detailed solutions for the examples above are given later but for now let's find what is common in all three examples.

Sub-graph Replacement

The sub-graph (or a single node) of initial graph is replaced with a new sub-graph (single node) in these cases. The sub-graph replacement consists of the following steps:

  1. Identify an existing sub-graph for replacement.
  2. Generate a new sub-graph.
  3. Connect a new sub-graph to the graph (create input/output edges to the new sub-graph).
  4. Create output edges out of a new sub-graph to the graph.
  5. Do something with the original sub-graph (e.g. remove).

Model Optimizer provides several ways to perform most of the sub-graph replacement steps. The next subsections describe these methods.

Replace a Single Operation With a Sub-graph of Operations

For example, there is an operation "SquaredDifference" in TensorFlow* which calculates (a - b)^2, where a and b are input tensors. Inference Engine does not support such operation. However, SquaredDifference could be expressed using two Power operations and one Eltwise Add. The Power operation calculates scale * (a ^ power) + shift where a is a tensor and scale, power and shift are float values. The first Power operation negates the value of tensor b. The second one is used to square the result of a + (- b) which is calculated using the elementwise Add operation applied to tensor a and tensor -b.

Given that, we can replace all SquaredDifference operations in initial model with two Power and one Eltwise. Now let's take a look at implementation of that replacer in the following file extensions/front/SquaredDifference.py.

import networkx as nx
from mo.front.common.replacement import FrontReplacementOp
from mo.graph.graph import Node
from mo.ops.eltwise import Eltwise
from mo.ops.power import Power
class SquaredDifference(FrontReplacementOp):
    """
    Example class illustrating how to implement replacement of a single op in the front-end of the MO pipeline.
    This class replaces a single op "SquaredDifference" by a sub-graph consisting of 3 lower-level ops.
    """
    op = "SquaredDifference"
    enabled = True
    def replace_op(self, graph: nx.MultiDiGraph, node: Node):
        negate = Power(graph, dict(scale=-1, name=node.name + '/negate_'))
        add = Eltwise(graph, dict(operation='sum', name=node.name + '/add_'))
        squared = Power(graph, dict(power=2, name=node.name + '/squared_'))
        out_node = squared.create_node([add.create_node([node.in_node(0), negate.create_node([node.in_node(1)])])])
        # Replace edge from out port 0 of the matched node with a edge from node out_node.id with port 0.
        # The "explicit" version of the return value is: [(out_node.id, 0)])
        return [out_node.id]

The Model Optimizer internal representation of the graph uses the networkx module.

Key lines:

  • Line 1: Imports this module.
  • Line 3: Imports class FrontReplacementOp that is used to replace operation of particular type with a new sub-graph. This class performs the first step of the sub-graph replacement (Identify an existing sub-graph for replacement). It is important to mention that the replacement happens before shape inference and creation of data nodes representing tensors with values. At this stage of model conversion pipeline all nodes in the graph are operation nodes or nodes of type "Const" that produce tensor with fixed value embedded into the node.
  • Line 4: Imports class "Node" representing a single node in the computation graph.
  • Lines 5 - 6: Import classes representing operations Power and Eltwise. These classes are inherited from base class "mo.ops.Op" that represents operation and stores its attributes.
  • Line 9: Defines class SquaredDifference inherited from FrontReplacementOp. This is a replacer class that is automatically registered and executed by Model Optimizer. Since the class is located in the common (not framework) specific directory extensions/front it is used for replacement for all supported frameworks.
  • Line 15: Defines the class variable op that stores the name of the operation to be replaced. In this case, it is "SquaredDifference".
  • Line 16: Defines class variable enabled that controls whether the replacer is enabled or not. The only function that should be implemented in the class is replace_op. It gets graph to operate on and an instance of node of desired operation ("SquaredDifference" in this case). This function performs step two and three of the sub-graph replacement (Generate a new sub-graph to replace with and Connect a new sub-graph to the graph).
  • Lines 19 - 21: Create instances of operations classes with required attributes.
  • Line 23: Creates a sub-graph from the operations defined above. The "create_node" method of the "Op" class generates Node from the Op and uses single mandatory argument - the list of input nodes (represented as instances of Node class) to create input edges to the node being generated. Inputs of the SquaredDifference node are retrieved using node.in_node(0) and node.in_node(1) method calls. The elementwise Add node gets first input as initial first input of "SquaredDifference" node, the second input of add is the result of negation of the second input of "SquaredDifference" node: [add.create_node([node.in_node(0), negate.create_node([node.in_node(1)])])]. Then the result of Add node is squared. "out_node" node performs this calculation.

The "replace_op" function returns a list of node names used to create output edges of the sub-graph to connect it with the rest of the graph. Each element of the list describes mapping between old output edge of the matched node and new sub-graph node and output edge index. The i-th element of the list corresponds to the i-th output tensor of the matched node. In this case, "SquaredDifference" produces single tensor through output port 0, so the returned list contains single element. In general case each element is a tuple where first element is the name of a new node producing required tensor and the second is the output port for that tensor. If the output port is 0 it is possible to use shortcut - just the name of the node instead of a tuple. Line 26 uses this shortcut. The returned values is used to create the new sub-graph output edges (step 4 of the sub-graph replacement).

Default implementation of the FrontReplacementOp class removes matched node and all its input/output edges (step five of the sub-graph replacement).

Another example of such kind of replacement is in the "extensions/front/Sub.py" class where all instances of Sub operations are replaced with two operations: Power to negate the second argument and the Eltwise to perform elementwise add.

Replace Sub-graph of Operations With a New Sub-graph of Operations

The previous example considered situation when one single node of a specific type is replaced. When it is necessary to replace a sub-graph of operations it is necessary to tell Model Optimizer how to identify this sub-graph. There are three options how to achieve that:

  1. Use graph isomorphism pattern of the networkx module.
  2. Use nodes name pattern to identify "scope" (according to TensorFlow* terminology) to be replaced.
  3. Use sets of "start" and "end" node names to match all nodes "between" them.

Let's review each option based on real examples.

Replace Sub-graph of Operations Using Graph Isomorphism Pattern

The Networkx Python module provides methods to find graph isomorphic to a given one using nodes and edges match: networkx.algorithms.isomorphism.categorical_node_match, networkx.algorithms.isomorphism.categorical_multiedge_match etc. Model Optimizer uses these methods and provides simple API to use that feature.

For example, the Caffe* has layer called Mean-Variance Normalization (MVN), which is also supported by the Inference Engine. This layer is implemented with low-level operations in TensorFlow*: Mean, StopGradient, SquaredDifference, Squeeze and FusedBatchNorm. Model Optimizer should replace sub-graph with these operations with a single Inference Engine layer of type "MVN".

The file extensions/front/tf/mvn.py perform such a replacement. The first part of the file is:

class MVN(FrontReplacementSubgraph):
    enabled = True
    def pattern(self):
        log.debug('Enabled MVN replacement')
        return dict(
            nodes=[
                ('mean', dict(op='Mean')),
                ('stop_grad', dict(op='StopGradient')),
                ('sqdiff', dict(op='SquaredDifference')),
                ('variance', dict(op='Mean')),
                ('squeeze_mean', dict(op='Squeeze')),
                ('squeeze_variance', dict(op='Squeeze')),
                ('fbn', dict(op='FusedBatchNorm')),
            ],
            edges=[
                ('mean', 'stop_grad', {'in': 0}),
                ('stop_grad', 'sqdiff', {'in': 1}),
                ('sqdiff', 'variance', {'in': 0}),
                ('mean', 'squeeze_mean', {'in': 0}),
                ('variance', 'squeeze_variance', {'in': 0}),
                ('squeeze_mean', 'fbn', {'in': 3}),
                ('squeeze_variance', 'fbn', {'in': 4}),
            ],
            node_attrs=['op'],
            edge_attrs=['in'])

In this file:

  • Line 1: Defines class "MVN" inherited from class FrontReplacementSubgraph that performs sub-graph replacement using sub-graph isomorphism pattern.
  • Line 3: Sets class variable "enabled" to value True meaning that this replacer is enabled.
  • The function "pattern" defines the sub-graph constraints to be matched. It returns a dictionary with four keys:
    • the "nodes" defines a list of nodes to be matched. Each element in the list is a tuple. The first element is the alias name assigned for the matched node; the second element is a dictionary with desired attributes of the node.
    • the "edges" defines a list of edges to be matched. Each element in the list is a tuple. The first and the second elements are the start and end edge nodes alias names respectively. The third element is a dictionary with desired edge attributes.
    • the "node_attrs" contains the names of nodes attributes to use during sub-graph isomorphism search.
    • the "edge_attrs" contains the names of edges attributes to use during sub-graph isomorphism search.
      The sub-graph is matched if all provided constraints are satisfied. If at least one node with desired attributes is missing or at least one defined edge is absent then the sub-graph is not matched.
  • Line 9: Adds constraint that sub-graph should contain node with attribute "op" with value "Mean". The matched node gets an alias name "mean". The same way the line 10 add constrain for node "StopGradient", the matched node gets an alias name "stop_grad" etc.
  • Now look at how the edges constraints are defined. Line 18: Defines edge from node with alias name "mean" to node with alias name "stop_grad" having attribute "in" equal to 0. This means that the output of node "mean" is connected to the node "stop_grad" as a first input (Model Optimizer uses zero-based indexing that is why "in" is 0). Another example is in line 25 where the edge from "squeeze_mean" is connected to the "fbn" node as fourth input.
  • Lines 26 - 27: Specify a list of attributes to be checked. In fact these lists are just list of all keys in the dictionaries for node and edge attributes.

Now when the Model Optimizer knows how to find sub-graph (step one of the sub-graph replacement) it is necessary to implement function that will perform actual sub-graph replacement (step 2 and 3). The code for this function is:

def replace_sub_graph(self, graph: nx.MultiDiGraph, match: dict):
    fbn = match['fbn']
    input = fbn.in_node(0)
    log.debug('Found potential MVN pattern after {} with name {}'.format(input.op, input.name))
    if input.id != match['mean'].in_node(0).id or input.id != match['sqdiff'].in_node(0).id:
        return
    log.debug('Confirmed MVN pattern after {} with name {}'.format(input.op, input.name))
    MVN = Op.get_op_class_by_name('MVN')
    mvn = MVN(graph, dict(
        name=fbn.name + '/MVN_',
        eps=fbn.eps,
        required_reduction_indices=[1,2] if fbn.data_format == b'NHWC' else [2,3]
    ))
    mvn.attrs['old_infer'] = mvn.attrs['infer']
    mvn.attrs['infer'] = __class__.infer
    mul = Eltwise(graph, dict(operation='mul', name=fbn.name + '/Mul_'))
    add = Eltwise(graph, dict(operation='sum', name=fbn.name + '/Add_'))
    input_gamma = fbn.in_node(1)
    input_beta = fbn.in_node(2)
    mean_reduction = match['mean'].in_node(1)
    variance_reduction = match['mean'].in_node(1)
    new_subgraph = add.create_node([
        mul.create_node([
            mvn.create_node([input, mean_reduction, variance_reduction]),
            input_gamma
        ]),
        input_beta
    ])
    replace_node(fbn, new_subgraph)

The function accepts two arguments - the graph and the dictionary "match". The keys in the dictionary are the alias names of matched nodes (defined in the "nodes" list in the function "pattern") and the values are the matched node of the graph (the instance of Node object).

The function generates new sub-graph with node of type "MVN" and two nodes of the type Eltwise calculating sum and product. There is nothing interesting how the graph is generated and mathematics behind that so attention will be put to two aspects of this function.

The first one is the call to function "replace_node" in line 36. The FusedBatchNorm node is replaced with the output node of the generated sub-graph: all input edges of the FusedBatchNorm node are re-connected to the "new_subgraph" node; all consumers of the FusedBatchNorm node are updated to get inputs from the "new_subgraph" node. This action connects newly generated sub-graph with an existing graph (step 4 of the sub-graph replacement).

The second one is that the default implementation of the inference function for MVN operation is overwritten. In line 16 the default implementation of the inference function for MVN is saved to attribute "old_infer". In line 17 the new inference function is saved to the instance of the MVN operation class. Let's take a look at the new inference function code:

@staticmethod
def infer(node: Node):
    if not(node.in_node(1).has_valid('value') and node.in_node(2).has_valid('value')):
        log.warning('Reduction indices for mean and variance for MVN node {} are not constants'.format(node.name))
        return
    if not(all(node.in_node(1).value == node.required_reduction_indices) and
        all(node.in_node(2).value == node.required_reduction_indices)):
        log.warning('Reduction indices for mean {} and variance {} do not match required ones {}'.format(
            node.in_node(1).value,
            node.in_node(2).value,
            node.required_reduction_indices
        ))
        return
    node.graph.remove_edge(node.in_node(1).id, node.id)
    node.graph.remove_edge(node.in_node(2).id, node.id)
    node.old_infer(node)

The infer function is needed to infer value of the node (if it is possible) and to infer shapes of the output tensors of the node (mandatory). The custom infer function performs additional checks that describe limitations of the MVN layer implementation in Inference Engine. For example, reduction indices for mean and variance must be constants (line 10), while in TensorFlow* they could be computed during model inference. In addition, the function removes two edges from the graph (lines 17 and 18) because all required information is already stored in the MVN node attributes. This is due to different MVN layer implementation in Inference Engine and TensorFlow*: mean and variance are attributes of the node in Inference Engine while in TensorFlow* they are input tensors. Edges are not removed in the "replace_sub_graph" function because these edges are used in the "infer" function (lines 7-12).

The last action in the "infer" method (line 19) is to call default infer function for the MVN which is saved in the attribute "old_infer" of the node to infer output tensors shapes.

What about step 5 of the sub-graph replacement? What will happen with 6 matched nodes? All these nodes are automatically removed during the dead code elimination pass that is performed after applying of custom sub-graph replacements defined. 6 matched nodes are no more connected to the inputs of the network after replacing node "fbn" with a newly created sub-graph node. Since they are not marked as output nodes (using --output command line parameter) they could be removed.

The replacement works for all sub-graph isomorphism instances found in the network.

Replace Sub-graph of Operations Using Nodes Name Pattern

TensorFlow* uses mechanism of scope to group related operation nodes. It is a good practice to put nodes performing particular task into the scope. This approach divides graph into logical blocks that are easier to review in TensorBoard. The "scope" in fact just defines common prefix for the node names in the scope.

For example, Inception topologies contain several types of so-called "Inception blocks". Some of them are exactly equal to each other but located in different places of the network. For example, Inception V4 from tensorflow.contrib.slim module has inception blocks "Mixed_5b", "Mixed_5c" and "Mixed_5d" with exactly the same nodes with the same attributes.

Now consider situation when someone implemented these Inception blocks extremely efficiently using single Inference Engine custom layer called "InceptionBlock" and would like to replace these blocks with instances of the layer to decrease inference time. Model Optimizer provides mechanism to replace sub-graph of operations defined by the regular expressions for the node names prefixes (aka scope). In this particular case the some of the patterns are: ".*InceptionV4/Mixed_5b", ".*InceptionV4/Mixed_5c" and ".*InceptionV4/Mixed_5d". Each pattern starts with ".*" because a prefix "InceptionV4" is added to all nodes names during a model freeze.

The sub-graph replacement using nodes name pattern is a bit trickier than replacements of single operation and networkx isomorphism pattern described above. The following additional steps should be done in comparison with previously described replacements:

  1. Prepare configuration file template defining node names patterns and information about custom layer attributes.
  2. Run Model Optimizer with command line parameter to add information about input and output nodes of the specified sub-graphs.

Consider the following possible configuration file for the Inception Block replacer:

[
    {
        "custom_attributes": {
            "attr1_key": "attr1_value",
            "attr2_key": 123456
        },
        "id": "InceptionBlockReplacer",
        "op": "InceptionBlock",
        "instances": [
            ".*InceptionV4/Mixed_5b",
            ".*InceptionV4/Mixed_5c",
            ".*InceptionV4/Mixed_5d"
        ],
        "match_kind": "scope"
    }
]

The JSON file contains list of dictionaries. Each dictionary defines one replacement. Each replacement is defined with several keys:

  • "id" (mandatory) is the unique identifier of the replacer. It is used in the Python code that implements sub-graph replacement to link the class and the replacement description from the configuration file.
  • "match_kind" (mandatory) is the string that specifies what matching algorithm is used. Currently supported "scope" and "points". In this example the first one is considered. The "points" match kind is described below.
  • "instances" (mandatory) specifies instances of the sub-graph to be matched. It contains list of node names prefixes patterns for the match kind "scope".
  • "custom_attributes" (optional) is dictionary with static attributes of the layer to be dumped to Inference Engine Intermediate Representation XML file.
  • "op" (optional) is used only if the sub-graph replacement Python code is not needed because the sub-graph should be replaced with a single node of type "op". If this attribute is not set then it is necessary to implement Python code with sub-graph generation code. Both options are considered in this example.

When the configuration file is ready, run the Model Optimizer with regular command line parameters pointing to the file with model and input shapes (if necessary) and additional parameter "--tensorflow_custom_operations_config_update" pointing to the generated configuration file. If the file is correct then Model Optimizer adds two keys to the "InceptionBlockReplacer" dictionary: "inputs" and "outputs" with the following content:

[
    {
        "id": "InceptionBlockReplacer",
        ...
        "inputs": [
            [
                {
                    "node": "Branch_2/Conv2d_0a_1x1/Conv2D$",
                    "port": 0
                },
                {
                    "node": "Branch_3/AvgPool_0a_3x3/AvgPool$",
                    "port": 0
                },
                {
                    "node": "Branch_1/Conv2d_0a_1x1/Conv2D$",
                    "port": 0
                },
                {
                    "node": "Branch_0/Conv2d_0a_1x1/Conv2D$",
                    "port": 0
                }
            ]
        ],
        "outputs": [
            {
                "node": "concat$",
                "port": 0
            }
        ]
    }
]

The value for key "inputs" is a list of lists describing input tensors of the sub-graph. Each element of the top-level list corresponds to one unique input tensor of the sub-graph. Each internal list describes list of nodes consuming this tensor and port numbers where the tensor is consumed. Model Optimizer generates regular expressions for the input nodes names to uniquely identify them in each instance of the sub-graph defined by the "instances". Denote these nodes as input nodes of the sub-graph.

In the InceptionV4 topology, the "InceptionV4/Mixed_5b" block has four input tensors from outside of the sub-graph but all of them are produced by the node "InceptionV4/Mixed_5a/concat". Therefore, the top-level list of the "inputs" contains one list corresponding to this tensor. Four input nodes of the sub-graph consume the tensor produced by "InceptionV4/Mixed_5a/concat" node. In this case, all four input nodes consumes input tensor into port 0.

The order of items in the internal list describing nodes does not matter, but the order of elements in the top-level list is important. This order defines the order how Model Optimizer attach input tensors to a new generated node if the sub-graph is replaced with a single node. The i-th input node of the sub-graph is obtained using call "match.single_input_node(i)" in the sub-graph replacer code. More information about API is given below. The configuration file can be edited in the text-editor to change the order of input tensors if necessary.

The value for key "outputs" is a list describing nodes of the sub-graph producing tensor that goes outside of the sub-graph or do not have child nodes. Denote these nodes as output nodes of the sub-graph. The order of elements in the list is important. The i-th element of the list describes the i-th output tensor of the sub-graph which could be obtained using call "match.output_node(i)". The order of elements can be manually changed in the configuration file. Model Optimizer uses this order to connect output edges if the sub-graph is replaced with a single node.

Now when meaning of "inputs" and "outputs" attributes is clean return back to the replacer implementation. The replacer "InceptionBlockReplacer" contains attribute "op" with the value "InceptionBlock" that means that the identified sub-graph should be replaced with a single layer of type "InceptionBlock". Such a layer is not known for Model Optimizer so it is necessary to define it. See Extending the Model Optimizer with New Primitives. You must create file "extension/ops/InceptionBlock.py" with the following content:

import numpy as np
from mo.graph.graph import Node
from mo.ops.op import Op
class InceptionBlock(Op):
    op = "InceptionBlock"
    enabled = True
    def __init__(self, graph, attrs):
        super().__init__(graph, attrs, {
            'type': __class__.op,
            'op': __class__.op,
        })

The shape inference function is not defined. In this case Model Optimizer uses TensorFlow* fallback to calculate shapes of the sub-graph output tensors.

Run the Model Optimizer with the command line parameter --tensorflow_use_custom_operations_config and point to the created configuration file. Of course, regular command line parameters with path to the model file and input shape (if necessary) should be provided. Model Optimizer generates Intermediate Representation xml with three sequantial layers of type "InceptionBlock" like this:

<layer id="1658" name="InceptionBlock1877" precision="FP32" type="InceptionBlock">
    <input>
        <port id="0">
            <dim>1</dim>
            <dim>384</dim>
            <dim>35</dim>
            <dim>35</dim>
        </port>
    </input>
    <output>
        <port id="1">
            <dim>1</dim>
            <dim>384</dim>
            <dim>35</dim>
            <dim>35</dim>
        </port>
    </output>
</layer>

The implementation of the sub-graph replacement by scope with a single layer is complete. Now look how Model Optimizer replaces sub-graph identified by start/end nodes (aka "points") with another sub-graph.

Replace Sub-graph of Operations Using Points

In this scenario, for the matching algorithm user defines the sub-graph via a set of "start" and "end" nodes. Given the set, the Model Optimizer performs the following steps:

  1. Starts graph traversal from every start nodes following the direction of the graph edges. The search stops in end nodes or in case of nodes without further children. All visited nodes are added to the matched sub-graph.
  2. Starts another graph traversal from each non-start node of the sub-graph, i.e. every node except nodes from "start" set. In this step the edges are traversed in the opposite edge direction. All newly visited nodes are added to the matched sub-graph. This step is needed to add nodes required for calculation values of internal nodes of the matched sub-graph.
  3. Checks that all "end" nodes were reached from "input" nodes. If no then exit with error.
  4. Check that there are no "Placeholder" operations among added nodes. If it is not true then some side branch of the sub-graph (added in step 2) depends on inputs of the network. Such configuration is not correct so exit with error.

This algorithm finds all nodes "between" start and end nodes. Also nodes needed for calculation of non-input nodes of the matched sub-graph produce constant values because they do not depend on input of the network. This sub-graph match has a limitation that each start node must have only one input. Therefore, it is not possible to specify, for example, convolution node as input because it has two inputs: data tensor and tensor with weights.

For example of replacement with points, see Case Study: Converting SSD Models Created With a TensorFlow Object Detection API.

Offloading Computations to TensorFlow

The Model Optimizer can't generate an Intermediate Representation from unsupported TensorFlow operations, as is the case with some custom layers. However, you can still successfully create an Intermediate Representation if you offload the unsupported operations to TensorFlow* for computation.

Limitations:

  • You can only offload operations to TensorFlow from a Linux computer.
  • The custom layer supports inference only on a CPU, not on Intel Integrated Graphics or on an FPGA.
  • The Inference Engine uses NCHW layout for tensors butTensorFlow uses usually NHWC. The Model Optimizer performs conversion between these layouts to correctly infer the model.
    The Model Optimizer adds transpose operations to convert sub-graph 4D input tensors from NCHW layout to NHWC and vice versa for the output nodes. These operations are embedded in the protobuf string that describes the TensorFlow sub-graph in the Intermediate Representation .xml file.
    Sometimes, this approach fails. For example, the offload convolution to TF fails because the convolution layout weights in TensorFlow don't correspond to the layout of weights in the Inference Engine. However, offloading convolution nodes plus nodes with weights succeeds because the node with weights are part of offloaded sub-graph, so there are no transposes for the weights tensor. The successful nodes are usually of type Const.
How to Build a Custom Layer to Offload Computations to TensorFlow

NOTE: You need to perform this step only once.

  1. Clone the TensorFlow r1.4 Git repository.
  2. Set the environment variable TF_ROOT_DIR to point to the cloned directory.
  3. Choose one of these options:
    • Run source <INSTALL_DIR>/bin/setupvars.sh
    • Set the environment variable INTEL_CVSDK_DIR to point to a directory containing the inference_engine/include/ directory.
  4. Build an Inference Engine layer with TensorFlow runtime. This might take about 20 minutes:
    ./tf_call_ie_layer/build.sh.
  5. A shared library is generated:
    $TF_ROOT_DIR/bazel-bin/tensorflow/cc/inference_engine_layer/libtensorflow_call_layer.so
    This library is the Inference Engine custom layer, which is used to offload inference to TensorFlow*.
How to Run a Model With Operations Offloaded to TensorFlow*
  1. Compile extensibility_sample
  2. Run extensibility_sample:
    ./extensibility_sample -i <path_to_image_file> -m <path_to_IR.xml> -d CPU -l <path_to_libtensorflow_call_layer.so>

Three command-line options are available offload part of the inference to TensorFlow.

NOTE: Use the command-line options on the line with the command:

python3 mo.py --input_model model-file.pb

For example:

Use node name patterns to offload a sub-graph of operations, using the command-line option:

-- tensorflow_subgraph_patterns 

This option uses a comma-separated list of regular expressions to match node names. This offload has two primary characteristics:

  • All nodes that match a specific regular expression are merged into a single Inference Engine node that TensorFlow* executes.
  • All patterns are applied independently, which means two nodes that match two different patterns are not merged into one node.​ For example, the option --tensorflow_subgraph_patterns "Scope_1/.*,Scope_2.*" is merged with all nodes whose names start from Scope_1/ to a new node, and all nodes whose names start from Scope_2 are merged to a different node.

Offload specific types of operations, using the command-line option:

--tensorflow_operation_patterns

This option specifies a comma-separated list of regular expressions to match node types. This offload has this primary characteristic: All nodes that match a specific regular expression are merged into a single Inference Engine node that TensorFlow* executes. For example, the following command offloads all operations of type 'Concat', 'ConcatV2', 'Add', and 'BiasAdd' to Tensorflow*:

--tensorflow_operation_patterns "Concat.*,.*Add"

Offload all unsupported operations automatically, using the command-line option:

--offload_unsupported_operations_to_tf

With this option, the Model Optimizer analyzes a network graph and finds unsupported operations. The Model Optimizer finds and offloads the connected sub-graphs of unsupported operations. The unsupported operations are offloaded to TensorFlow*.

You can use use all three options by issuing the commands in this order:

python3 mo.py --input_model model-file.pb --tensorflow_subgraph_patterns
python3 mo.py --input_model model-file.pb --tensorflow_operation_patterns
python3 mo.py --input_model model-file.pb --offload_unsupported_operations_to_tf

Case Study: Converting SSD Models Created With TensorFlow Object Detection API

As explained in Sub-graph Replacement in Model Optimizer, you have multiple ways to setup the sub-graph matching. In this example we focus on the defining the sub-graph via a set of "start" and "end" nodes. The result of matching is two buckets of nodes:

  • Nodes "between" start and end nodes.
  • Nodes connected to the first list, but just on the constant path (e.g. these nodes are not connected to the inputs of the entire graph). Let's look closer to the SSD models from the TensorFlow* detection model zoo: SSD MobileNet and SSD InceptionV2.

A distinct layer of any SSD topology is the Detection Output layer. This layer is implemented with a dozens of primitive operations in TensorFlow* while in Inference Engine it is one layer. Thus, to convert a SSD model from the TensorFlow, the Model Optimizer should replace the entire sub-graph of operations (that implement the DetectionOutput layer) with a single well-known DetectionOutput node.

The Inference Engine DetectionOutput layer consumes three tensors in the following order:

  1. Tensor with locations of bounding boxes.
  2. Tensor with confidences for each bounding box.
  3. Tensor with prior boxes (aka anchors in TensorFlow* terminology).

The DetectionOutput layer produces one tensor with 7 numbers for each actual detection. There are more output tensors in the TensorFlow* Object Detection API, but the values in them are consistent with the Inference Engine ones.

The difference with other examples is that here the DetectionOutput sub-graph is replaced with a new sub-graph (not a single layer).

Look at sub-graph replacement configuration file extensions/front/tf/ssd_support.json that is used to enable two models listed above:

[
    {"custom_attributes": {"code_type": "caffe.PriorBoxParameter.CENTER_SIZE","confidence_threshold": 0.01,"keep_top_k": 200,"nms_threshold": 0.45,"pad_mode": "caffe.ResizeParameter.CONSTANT","resize_mode": "caffe.ResizeParameter.WARP"
        },"id": "TFObjectDetectionAPIDetectionOutput","include_inputs_to_sub_graph": true,"include_outputs_to_sub_graph": true,"instances": {"end_points": ["detection_boxes","detection_scores","num_detections"
            ],"start_points": ["Postprocessor/Shape","Postprocessor/Slice","Postprocessor/ExpandDims","Postprocessor/Reshape_1"
            ]
        },"match_kind": "points"
    }
]

Lines 3-10 define static attributes that will be saved as is to the Intermediate Representation XML file for layer DetectionOutput.

Lines 12 and 13 define values for attributes that should be always set to "true" for this release of the Model Optimizer. These two attributes are specific for sub-graph match by points only.

Lines 14-26 define one instance of the sub-graph to be match. It is an important difference between sub-graph matching by scope and points. Several instances could be specified for matching by scope, but matching with points allows specifying just one instance. So the full node names (not regular expressions like in case of match with scope) are specified in "instances" dictionary.

Now let's analyze the structure of the topologies generated with the Object Detection API. There are several blocks in the graph performing particular task:

  • "Preprocessor" block resizes, scale and subtract mean values from the input image.
  • "FeatureExtractor" block is a MobileNet or other backbone to extract features.
  • "MultipleGridAnchorGenerator" block creates initial bounding boxes locations ("anchors" )
  • "Postprocessor" block acts as a DetectionOutput layer. So we need to replace "Postprocessor" block with DetectionOutput layer. It is necessary to add all input nodes of the "Postprocessor" scope to the list "start_points". Consider inputs of each of these nodes:
  • "Postprocessor/Shape" consumes tensor with locations.
  • "Postprocessor/Slice" consumes tensor with confidences.
  • "Postprocessor/ExpandDims" consumes tensor with prior boxes.
  • "Postprocessor/Reshape_1" consumes tensor with locations similarly to the "Postprocessor/Shape" node. Despite the fact that the last node "Postprocessor/Reshape_1" gets the same tensor as node "Postprocessor/Shape" it must be explicitly put to the list.

Object Detection API "Postprocessor" block generates output nodes: "detection_boxes", "detection_scores", "num_detections", "detection_classes".

Now consider the implementation of the sub-graph replacer, available in the "extensions/front/tf/SSDs.py". The file is rather big so only some code snippets are used:

class PostprocessorReplacement(FrontReplacementFromConfigFileSubGraph):
    replacement_id = 'TFObjectDetectionAPIDetectionOutput'

These lines define the new PostprocessorReplacement class inherited from FrontReplacementFromConfigFileSubGraph. FrontReplacementFromConfigFileSubGraph is designed to replace sub-graph of operations described in the configuration file. There are methods to override for implementing custom replacement logic that we need:

  • generate_sub_graph performs new sub-graph generation and returns dictionary where key is an alias name for the node and value is a Node objects. The dictionary has the same format as parameter match in the replace_sub_graph method in the example with networkx sub-graph isomorphism pattern. This dictionary is passed as argument to the next three methods, so it should contain entries the for nodes that the functions need
  • input_edges_match specifies mapping between input edges to sub-graph before replacement and after replacement. The key of the dictionary is a tuple specifying input tensor of the sub-graph before replacement: sub-graph input node name and input port number for this node. The value for this key is also a tuple specifying the node where this tensor should be attached during replacement: the node name (or alias name of the node) and the input port for this node. If the port number is zero then the parameter could be omitted so the key or value is just a node name (alias). Default implementation of the method returns an empty dictionary so Model Optimizer does not create new edges.
  • output_edges_match returns mapping between old output edges of the matched nodes and new sub-graph node and output edge index. The format is similar to the dictionary returned in the "input_edges_match" method. The only difference is that instead of specifying input port numbers for the nodes it is necessary to specify output port number. Of course, this mapping is needed for the output nodes only. Default implementation of the method returns an empty dictionary so the Model Optimizer does not create new edges.
  • nodes_to_remove specifies list of nodes those Model Optimizer should remove after sub-graph replacement. Default implementation of the method removes all sub-graph nodes.

Review of the replacer code, considering details of the DetectionOutput layer implementation in Inference Engine. There are several constraints to the input tensors of the DetectionOutput layer:

  • The tensor with locations must be of shape [#batch, #prior_boxes * 4] or [#batch, #prior_boxes * 5] depending on shared locations between different batches or not.
  • The tensor with confidences must be of shape [#batch, #prior_boxes * #classes] and confidences values are in range [0, 1], i.e. passed through softmax layer.
  • The tensor with prior boxes must be of shape [#batch, 2, #prior_boxes * 4]. Inference Engine expects that it contains variance values which TensorFlow* Object Detection API doesn't add.

To enable these models, add Reshape operations for locations and confidences tensors, and update the values for the prior boxes to include the variance constants (they are not there in TensorFlow* Object Detection API).

Look at the generate_sub_graph method:

def generate_sub_graph(self, graph: nx.MultiDiGraph, match: SubgraphMatch):
    log.debug('PostprocessorReplacement.generate_sub_graph')
    log.debug('matched_nodes = {}'.format(match.matched_nodes_names()))
    # softmax to be applied to the confidence
    softmax_conf_op = Softmax(graph, {'axis': 2, 'nchw_layout': True})
    softmax_conf_node = softmax_conf_op.add_node(dict(name='DetectionOutput_SoftMax_conf_'))
    # IE DetectionOutput layer consumes flattened tensors
    # reshape operation to flatten locations tensor
    reshape_loc_op = Reshape(graph, {'dim': np.array([0, -1])})
    reshape_loc_node = reshape_loc_op.add_node(dict(name='DetectionOutput_Reshape_loc_'))
    # IE DetectionOutput layer consumes flattened tensors
    # reshape operation to flatten confidence tensor
    reshape_conf_op = Reshape(graph, {'dim': np.array([0, -1])})
    reshape_conf_node = reshape_conf_op.add_node(dict(name='DetectionOutput_Reshape_conf_'))
    # create Node object from Op class
    detection_output_op = DetectionOutput(graph, match.custom_replacement_desc.custom_attributes)
    detection_output_op.attrs['old_infer'] = detection_output_op.attrs['infer']
    detection_output_op.attrs['infer'] = __class__.do_infer
    detection_output_node = detection_output_op.add_node(dict(name=detection_output_op.attrs['type'] + '_'))
    # create internal edges of the sub-graph. In this case we add edges to connect input port 0 and 1 of the
    # detection output with output of reshape of locations and reshape of confidence
    create_edge(softmax_conf_node, reshape_conf_node, 0, 0)
    create_edge(reshape_loc_node, detection_output_node, 0, 0)
    create_edge(reshape_conf_node, detection_output_node, 0, 1)
    return {'detection_output_node': detection_output_node, 'reshape_conf_node': softmax_conf_node,
            'reshape_loc_node': reshape_loc_node}

The method has two inputs: the graph to operate on and the instance of SubgraphMatch object which describes matched sub-graph. The latter class has several useful methods to get particular input/output Node of the sub-graph by input/output index or by node name pattern. Examples of these methods usage are given below.

Lines 6 and 7 create new instance of operation of type Softmax and graph Node object corresponding to that operation.

Lines 11-12 and 16-17 create new instance of operation of type Reshape to reshape locations and confidenses tensors correspondingly.

Lines 20-23 create new instance of operation Detection Output and graph Node object corresponding to that operation.

Lines 27-29 connect softmax node with reshape node and connect two reshaped locations and confidences tensors with Detection Output node.

Lines 30-31 define dictionary with aliases for detection output node, reshape locations and confidences nodes. These aliases are used in the "input_edges_match" and "output_edges_match" methods.

The input_edges_match method is the following:

def input_edges_match(self, graph: nx.DiGraph, match: SubgraphMatch, new_sub_graph: dict):
    locs_consumer_node, locs_consumer_node_port = match.input_nodes(0)[0]
    conf_consumer_node, conf_consumer_node_port = match.input_nodes(1)[0]
    priors_consumer_node, priors_consumer_node_port = match.input_nodes(2)[0]
    # create matching nodes for locations and confidence tensors using simple scheme "old_node_name: new_node_name"
    # which in fact means "(old_node_name, 0): (new_node_name, 0)", while first '0' means old_port and the second
    # zero defines 'new_port'.
    return {locs_consumer_node.id: new_sub_graph['reshape_loc_node'].id,
            conf_consumer_node.id: new_sub_graph['reshape_conf_node'].id,
            priors_consumer_node.id: (new_sub_graph['detection_output_node'].id, 2),
            }

The method has three parameters: input graph, match object describing matched sub-graph and new_sub_graph dictionary with alias names returned from the "generate_sub_graph" method.

Lines 2-4 initialize Node objects and input ports for these nodes where the input tensors for the sub-graph are consumed. The method match.input_nodes(ind) returns list of tuples where the first element is a Node object and the second is the input port for this node which consumes the ind-th input tensor of the sub-graph. input_points list in the configuration file defines the order of input tensors to the sub-graph. For example, the "locs_consumer_node" object of type Node is a node that consumes tensor with locations in the port with number "locs_consumer_node_port".

Lines 8-11 define dictionary with the mapping of tensors as described above. Note that the attribute "id" of the Node object contains the name of the node in the graph.

The "output_edges_match" method is the following:

def output_edges_match(self, graph: nx.DiGraph, match: SubgraphMatch, new_sub_graph: dict):
    # the DetectionOutput in IE produces single tensor, but in TF it produces two tensors, so we need to create only
    # one output edge match
    return {match.output_node(0)[0].id: new_sub_graph['detection_output_node'].id}

The method has the same three parameters as "input_edges_match" method. The returned dictionary contains mapping just for one tensor initially produces by the first output node of the sub-graph (which is "detection_boxes" according to the configuration file) to a single output tensor of the created DetectionOutput node. In fact, it is possible to use any output node of the initial sub-graph in mapping because the sub-graph output nodes are the output nodes of the whole graph (their output is not consumed by any other nodes).

Now Model Optimizer know how to replace the sub-graph. The last step to enable the model is to cut-off some parts of the graph not needed during inference.

It is necessary to remove the Preprocessor block where image is resized. Inference Engine does not support dynamic input shapes so Model Optimizer must froze the input image size and thus resizing of the image is not necessary. This is achieved by specifying input tensor for the model using the following command line parameter: "--input=1:Preprocessor/mul". This command line option is described in Cutting Off Parts of a Model.

There are several "Switch" operations in the Postprocessor block without output edges. For example, "Postprocessor/BatchMultiClassNonMaxSuppression/map/while/PadOrClipBoxList/cond/cond/switch_t", "Postprocessor/BatchMultiClassNonMaxSuppression/map/while/PadOrClipBoxList/cond/cond/switch_f", "Postprocessor/BatchMultiClassNonMaxSuppression/map/while/PadOrClipBoxList/cond_1/cond/switch_t", "Postprocessor/BatchMultiClassNonMaxSuppression/map/while/PadOrClipBoxList/cond_1/cond/switch_f" etc.

The Model Optimizer marks these nodes as output nodes of the topology. Some part of the "Posprocessor" blocks are not removed during sub-graph replacement because of that. In order to fix this issues it is necessary to specify output nodes of the graph manually using the "--output" command line parameter.

Example Model Optimizer Command-Line for TensorFlow's SSD

The final command line to convert SSDs from the TensorFlow* Object Detection Zoo is:

./mo_tf.py --input_model=<path_to_frozen.pb> --input=1:Preprocessor/mul --input_shape="(1,300,300,3)" --tensorflow_use_custom_operations_config extensions/front/tf/ssd_support.json --output="detection_boxes,detection_scores,num_detections"

MXNet Models With Custom Layers

MXNet* models with Custom Layers are not supported, the Model Optimizer refuses to work because it does not have any option to guess how to work with these entities. The recommendation is to manually cut the model and provide Model Optimizer with this cut model.

Advanced Topics About the Model Optimizer Internals

Cutting Off Parts of a Model

Sometimes some parts of a model must be removed while the Model Optimizer is converting models to the Intermediate Representation. This chapter describes methods of doing cutting off parts of a model using Model Optimizer command-line options. Model cutting applies mostly to TensorFlow models, but is also useful for other frameworks. In this chapter, TensorFlow examples are used for illustration.

Purpose of Model Cutting

The following examples are the situations when model cutting is useful or even required:

  • model has pre- or post-processing parts that cannot be translated to existing IE layers;
  • model has a training part that is convenient to be kept in the model but it is not used during inference;
  • model is too complex (contains lots of unsupported operations that cannot be easily implemented as custom layers), so the complete model cannot be converted in one shot;
  • model is one of the supported SSD models. In this case, you need to cut a post-processing part off.
  • problem with model conversion in the Model Optimizer or inference in the Inference Engine occurred. To localize the issue, limit the scope for conversion by iteratively searching for problematic places in the model.
  • single custom layer or a combination of custom layers is isolated for debugging purposes.

Command-Line Options

Model Optimizer provides command line options --input and --output to specify new entry and exit nodes, while ignoring the rest of the model:

  • --input option accepts a comma-separated list of layer names of the input model that should be treated as new entry points to the model;
  • --output option accepts a comma-separated list of layer names of the input model that should be treated as new exit points from the model.

The --input option is required for cases unrelated to model cutting. For example, when the model contains several inputs and --input_shape or --mean_values options are used, you should use the --input option to specify the order of input nodes for correct mapping between multiple items provided in --input_shape and --mean_values and the inputs in the model. This is out of scope.

Model cutting is illustrated with Inception V1. This model is in models/research/slim repository. This section describes pre-work to prepare the model for the Model Optimizer to be ready to proceed with this chapter.

Default Behavior Without --input and --output

The input model is converted as a whole if neither --input nor --output command line options are used. All Placeholder operations in a TensorFlow graph are automatically identified as entry points. The Input layer type is generated for each of them. All nodes that have no consumers are automatically identified as exit points.

For Inception_V1, there is one Placeholder: input. If the model is viewed in the TensorBoard, the input operation is easy to find:InceptionV1 placeholder

There is only one output operation, which enclosed in a nested name scope InceptionV1/Logits/Predictions, the Reshape operation has a full name

InceptionV1/Logits/Predictions/Reshape_1. In the TensorBoard it looks the following way together with some predecessors:TensorBoard with predecessors

Convert this model:

mo.py --input_model=inception_v1.pb -b 1

The output .xml file with an Intermediate Representation contains the Input layer among other layers in the model:

<layer id="286" name="input" precision="FP32" type="Input">
    <output>
        <port id="0">
            <dim>1</dim>
            <dim>3</dim>
            <dim>224</dim>
            <dim>224</dim>
        </port>
    </output>
</layer>

The input layer is converted from the TensorFlow graph Placeholder operation input and has the same name.

The -b option is used here for conversion to override a possible undefined batch size (coded as -1 in TensorFlow models). If a model was frozen with a defined batch size, you may omit this option in all examples here.

The last layer in the model is InceptionV1/Logits/Predictions/Reshape_1, which matches an output operation in the TensorFlow graph:

<layer id="389" name="InceptionV1/Logits/Predictions/Reshape_1" precision="FP32" type="Reshape">
    <data axis="0" dim="1,1001" num_axes="-1"/>
    <input>
        <port id="0">
            <dim>1</dim>
            <dim>1001</dim>
        </port>
    </input>
    <output>
        <port id="1">
            <dim>1</dim>
            <dim>1001</dim>
        </port>
    </output>
</layer>

Due to automatic identification of inputs and outputs, you do not need to provide the --input and --output options to convert the whole model. The following commands are equivalent for the Inception V1 model:

mo.py --input_model=inception_v1.pb -b 1

mo.py --input_model=inception_v1.pb -b 1 --input=input --output=InceptionV1/Logits/Predictions/Reshape_1

The Intermediate Representations are identical for both conversions. The same is true if the model has multiple inputs and/or outputs.

Cut at the End

Now consider how to cut some parts of the model off. For the Inception V1 model, the first convolution block InceptionV1/InceptionV1/Conv2d_1a_7x7 is considered:

The first convolution block

The following command cuts the rest of the model off just after the InceptionV1/InceptionV1/Conv2d_1a_7x7/Relu making this node the last one in the model:

mo.py --input_model=inception_v1.pb -b 1 --output=InceptionV1/InceptionV1/Conv2d_1a_7x7/Relu

Complete converted Intermediate Representation has three layers:

<?xml version="1.0" ?>
<net batch="1" name="model" version="2">
    <layers>
        <layer id="3" name="input" precision="FP32" type="Input">
            <output>
                <port id="0">
                    <dim>1</dim>
                    <dim>3</dim>
                    <dim>224</dim>
                    <dim>224</dim>
                </port>
            </output>
        </layer>
        <layer id="5" name="InceptionV1/InceptionV1/Conv2d_1a_7x7/convolution" precision="FP32" type="Convolution">
            <data dilation-x="1" dilation-y="1" group="1" kernel-x="7" kernel-y="7" output="64" pad-x="2" pad-y="2" stride="1,1,2,2" stride-x="2" stride-y="2"/>
            <input>
                <port id="0">
                    <dim>1</dim>
                    <dim>3</dim>
                    <dim>224</dim>
                    <dim>224</dim>
                </port>
            </input>
            <output>
                <port id="3">
                    <dim>1</dim>
                    <dim>64</dim>
                    <dim>112</dim>
                    <dim>112</dim>
                </port>
            </output>
            <blobs>
                <weights offset="0" size="37632"/>
                <biases offset="37632" size="256"/>
            </blobs>
        </layer>
        <layer id="6" name="InceptionV1/InceptionV1/Conv2d_1a_7x7/Relu" precision="FP32" type="ReLU">
            <input>
                <port id="0">
                    <dim>1</dim>
                    <dim>64</dim>
                    <dim>112</dim>
                    <dim>112</dim>
                </port>
            </input>
            <output>
                <port id="1">
                    <dim>1</dim>
                    <dim>64</dim>
                    <dim>112</dim>
                    <dim>112</dim>
                </port>
            </output>
        </layer>
    </layers>
    <edges>
        <edge from-layer="3" from-port="0" to-layer="5" to-port="0"/>
        <edge from-layer="5" from-port="3" to-layer="6" to-port="0"/>
    </edges>
</net>

The TensorBoard picture illustrates that the original model has more nodes. Model Optimizer has fused batch normalization InceptionV1/InceptionV1/Conv2d_1a_7x7/BatchNorm to the convolution InceptionV1/InceptionV1/Conv2d_1a_7x7/convolution. For this reason, it is not present in the final Intermediate Representation. This is not an effect of the --output option, it is a usual behavior of the Model Optimizer for batch normalizations and convolutions. The effect of the --output is that the ReLU layer becomes the last one in the converted model.

Cut From the Beginning

If you want to go further and cut the beginning of the model and leave only the ReLU layer, you can use the following command line, where --input and --output specify the same node in the graph:

mo.py --input_model=inception_v1.pb -b 1 --output=InceptionV1/InceptionV1/Conv2d_1a_7x7/Relu --input=InceptionV1/InceptionV1/Conv2d_1a_7x7/Relu

The resulting Intermediate Representation looks like this:

<xml version="1.0">
<net batch="1" name="model" version="2">
    <layers>
        <layer id="0" name="InceptionV1/InceptionV1/Conv2d_1a_7x7/Relu/placeholder_port_0" precision="FP32" type="Input">
            <output>
                <port id="0">
                    <dim>1</dim>
                    <dim>64</dim>
                    <dim>112</dim>
                    <dim>112</dim>
                </port>
            </output>
        </layer>
        <layer id="2" name="InceptionV1/InceptionV1/Conv2d_1a_7x7/Relu" precision="FP32" type="ReLU">
            <input>
                <port id="0">
                    <dim>1</dim>
                    <dim>64</dim>
                    <dim>112</dim>
                    <dim>112</dim>
                </port>
            </input>
            <output>
                <port id="1">
                    <dim>1</dim>
                    <dim>64</dim>
                    <dim>112</dim>
                    <dim>112</dim>
                </port>
            </output>
        </layer>
    </layers>
    <edges>
        <edge from-layer="0" from-port="0" to-layer="2" to-port="0"/>
    </edges>
</net>

The Input layer is automatically created to feed the layer that is converted from the node specified in --input: InceptionV1/InceptionV1/Conv2d_1a_7x7/Relu in this case. Model Optimizer does not replace the ReLU node by the Input layer, it produces such Intermediate Representation to make the node be the first executable node in the final Intermediate Representation. So the Model Optimizer creates enough number of Inputs to feed all input ports of the node that is passed in --input.

Even though --input_shape is not specified in the command line, the shapes for layers are inferred from the beginning of the original TensorFlow* model to the point at which the new input is defined. It has the same shape [1,64,112,112] as the model converted as a whole or without cutting off the beginning.

Shape Override for New Inputs

The input shape can be overridden with --input_shape. In this case, the shape is applied to the node referenced in --input, not to the original Placeholder in the model. For example, this command line

mo.py --input_model=inception_v1.pb --input_shape=[1,5,10,20] --output=InceptionV1/InceptionV1/Conv2d_1a_7x7/Relu --input=InceptionV1/InceptionV1/Conv2d_1a_7x7/Relu

gives the following shapes in the Input and ReLU layers:

<layer id="0" name="InceptionV1/InceptionV1/Conv2d_1a_7x7/Relu/placeholder_port_0" precision="FP32" type="Input">
    <output>
        <port id="0">
            <dim>1</dim>
            <dim>20</dim>
            <dim>5</dim>
            <dim>10</dim>
        </port>
    </output>
</layer>
<layer id="3" name="InceptionV1/InceptionV1/Conv2d_1a_7x7/Relu" precision="FP32" type="ReLU">
    <input>
        <port id="0">
            <dim>1</dim>
            <dim>20</dim>
            <dim>5</dim>
            <dim>10</dim>
        </port>
    </input>
    <output>
        <port id="1">
            <dim>1</dim>
            <dim>20</dim>
            <dim>5</dim>
            <dim>10</dim>
        </port>
    </output>
</layer>

An input shape [1,20,5,10] in the final Intermediate Representation differs from the shape [1,5,10,20] specified in the command line, because the original TensorFlow* model uses NHWC layout, but the Intermediate Representation uses NCHW layout. So usual NHWC to NCHW layout conversion occurred.

When --input_shape is specified, shape inference inside the Model Optimizer is not performed for the nodes in the beginning of the model that are not included in the translated region. It differs from the case when --input_shape is not specified as noted in the previous section where the shape inference is still performed for such nodes to deduce shape for the layers that should fall into the final Intermediate Representation. So --input_shape should be used for a model with a complex graph with loops, which are not supported by the Model Optimizer, to exclude such parts from the Model Optimizer shape inference process completely.

Inputs With Multiple Input Ports

There are operations that contain more than one input ports. In the example considered here, the convolution InceptionV1/InceptionV1/Conv2d_1a_7x7/convolution is such operation. When --input_shape is not provided, a new Input layer is created for each dynamic input port for the node. If a port is evaluated to a constant blob, this constant remains in the model and a corresponding input layer is not created. TensorFlow* convolution used in this model contains two ports:

  • port 0: input tensor for convolution (dynamic);
  • port 1: convolution weights (constant).

Following this behavior, the Model Optimizer creates an Input layer for port 0 only, leaving port 1 as a constant. So the result of:

mo.py --input_model=inception_v1.pb -b 1 --input=InceptionV1/InceptionV1/Conv2d_1a_7x7/convolution

is identical to the result of conversion of the model as a whole, because this convolution is the first executable operation in Inception V1.

Different behavior occurs when --input_shape is also used as an attempt to override the input shape:

mo.py --input_model=inception_v1.pb--input=InceptionV1/InceptionV1/Conv2d_1a_7x7/convolution --input_shape=[1,224,224,3]

An error occurs:

[ ERROR ]  Node InceptionV1/InceptionV1/Conv2d_1a_7x7/convolution has more than 1 input and input shapes were provided.
Try not to provide input shapes or specify input port with port:node notation, where port is an integer.

For more information see FAQ #30.

In this case, when --input_shape is specified and the node contains multiple input ports, you need to specify an input port index together with an input node name. The input port index is specified in front of the node name with ':' as a separator (PORT:NODE). In the considered case, the port index 0 of the node InceptionV1/InceptionV1/Conv2d_1a_7x7/convolution should be specified as 0:InceptionV1/InceptionV1/Conv2d_1a_7x7/convolution.

Here is a correct command line:

mo.py --input_model=inception_v1.pb --input=0:InceptionV1/InceptionV1/Conv2d_1a_7x7/convolution --input_shape=[1,224,224,3]

Model Optimization Techniques

Optimizations offers methods to accelerate inference with the convolution neural networks (CNN) that do not require model retraining.

Linear Operation Fusing

Many convolution neural networks includes BatchNormalization and ScaleShift layers (ex. Resnet*, Inception*) that can be fused into previous Convolution or FullyConnected layers.

Usage

In the Model Optimizer, this optimization is turned on by default. To disable it, you can pass -–disable_fusing parameter to the Model Optimizer.

Optimization Description

This optimization method consists of three stages:

  1. BatchNormalization and ScaleShift decomposition: on this stage, BatchNormalization layer is decomposed to Mul → Add → Mul → Add sequence, and ScaleShift layer is decomposed to Mul → Add layers sequence.
  2. Linear operations merge: on this stage we merge sequences of Mul and Add operations to the single Mul → Add instance.
    For example, if we have BatchNormalization → ScaleShift sequence in our topology, it is replaced with Mul → Add (by the first stage). On the next stage, the latter will be replaced with ScaleShift layer in case if we have no available Convolution or FullyConnected layer to fuse into (next).
  3. Linear operations fusion: on this stage, the tool fuses Mul and Add operations to Convolution or FullyConnected layers. Notice that it searches for Convolution and FullyConnected layers both backward and forward in the graph (except for Add operation that cannot be fused to Convolution layer in forward direction).
Usage Examples

The first picture below shows the depicted part of Caffe* Resnet269 topology where BatchNorm and ScaleShift layers will be fused to Convolution layers, shown in the second picture.

Pic.1 Caffe Resnet269 block (from Netscope)

Part of Caffe Resnet269 topology

Pic.2 Fused Caffe Resnet269 block (from Netscope)

BatchNorm and ScaleShift layers fused to Convolution layers


Grouped Convolution Fusing

Grouped convolution fusing is a specific optimization that applies for TensorFlow* topologies. The main idea of this optimization is to combine convolutions results for the Split outputs and then recombine them using Concat operation in the same order as they were out from Split (pic.3).

Pic.3 Split→Convolutions→Concat block from TensorBoard*Split->Convolutions->Concat block from TensorBoard

Intermediate Representation Notation Reference Catalog 

Convolution Layer

Name:Convolution

Short description:Reference

Detailed description: Reference

Parameters: Convolution layer parameters should be specified in the convolution_data node, which is a child of the layer node.

  • Parameter name: stride (stride-x, stride-y)
    • Description:stride (stride-x, stride-y) is a distance (in pixels) to slide the filter on the feature map over the (x, y) axis. For example, stride equal 1 (1, 1) means sliding the filter 1 pixel at a time over the (x, y) axis
    • Range of values: integer values starting from 0
  • Parameter name: pad (pad-x, pad-y)
    • Description:pad (pad-x, pad-y) is a number of pixels to add to the left (top) of the input. For example, pad (pad-x, pad-y) equal 1 (1, 1) means adding 1 pixel to the left of the input. Right and bottom padding should be calculated from the expected output width (height)
    • Range of values: integer values starting from 0
  • Parameter name: kernel (kernel-x, kernel-y)
    • Description: kernel (kernel-x, kernel-y) is a width (height) of each filter. For example, kernel (kernel-x, kernel-y) equal 3 (3, 3) means that each filter has width (height) equals 3
    • Range of values: integer values starting from 0
  • Parameter name: output
    • Description:output is a number of output feature maps per whole output (when group > 1, output still matches the number of output features regardless of 'group' value). For example, output equals 1 means that there is 1 output feature map in a layer
    • Range of values: integer values starting from 0
  • Parameter name:group
    • Description: group denotes the number of groups to which output and input should be split. For example, group equal 1 means that all the filters are applied to full input (usual convolution), group equals 2 means that both input and output channels are separated into 2 groups and i-th output group is connected to i-th input group channels. group equals number of output feature maps denotes depth-wise separable convolution (Reference)
    • Range of values: integer values starting from 0
  • Parameter name: dilation (dilation-x, dilation-y)
    • Description: dilation (dilation-x, dilation-y) denotes the distance in width (height) between elements (weights) in the filter. For example, dilation-x and dilation-y equal 1 means that all the elements in the filter are neighbors, so it is the same as for the usual convolution. dilation-x and dilation-y equal 2 means that all the elements in the filter are matched not to adjacent elements in the input matrix, but to those that are adjacent with distance 1
    • Range of values: integer values starting from 0

Weights Layout: Weights layout is GOIYX, which means that X is changing the fastest, then Y, then Input, Output, then Group.

Mathematical Formulation

  • For the convolutional layer, the number of output features in each dimension is calculated using the formula:
  • The receptive field in each layer is calculated using the formulas:
    • Jump in the output feature map:
      \[ j_{out} = j_{in} * s \]
    • Size of the receptive field of output feature:
      \[ r_{out} = r_{in} + \left ( k - 1 \right ) * j_{in} \]
    • Center position of the receptive field of the first output feature:
      \[ start_{out} = start_{in} + \left ( \frac{k - 1}{2} - p \right ) * j_{in} \]
    • Output is calculated using the following formula:
      \[ out = \sum_{i = 0}^{n}w_{i}x_{i} + b \]

Example

<layer ... type="Convolution" ... >
        <convolution_data stride-x="4" stride-y="4" pad-x="0" pad-y="0" kernel-x="11" kernel-y="11" output="96" group="1" dilation-x="2" dilation-y="2"/>
        <input> ... </input>
        <output> ... </output>
        <weights ... />
        <biases ... />
    </layer>

Pooling Layer

Name: Pooling

Short description: Reference

Detailed description: Reference

Parameters: Specify pooling layer parameters in the pooling_data node, which is a child of the layer node.

NOTE: A subset of pooling parameters, in particular, pad-x, pad-y, kernel-x, kernel-y, stride-x, stride-y are described in the  Convolution layer.

  • Parameter name:pool-method
    • Description:pool-method is a type of pooling strategy for values
    • Range of values:
      • max - chooses the biggest value in a feature map for each filter position
      • avg - takes the average value in a feature map for each filter position
  • Parameter name:exclude-pad
    • Description:exclude-pad is a type of pooling strategy for values in the padding area. For example, if exclude-pad is True, zero-values in the padding are not used
    • Range of values: True or False

Mathematical Formulation

  • For max pool-method
    \[ output_{j} = MAX\left \{ x_{0}, ... x_{i} \right \} \]
  • For avg pool-method:
    \[ output_{j} = \frac{\sum_{i = 0}^{n}x_{i}}{n} \]

Example

<layer ... type="Pooling" ... >
        <pooling_data kernel-x="3" kernel-y="3" pad-x="0" pad-y="0" stride-x="2" stride-y="2" pool-method="max" exclude-pad="true"/>
        <input> ... </input>
        <output> ... </output>
    </layer>

ROIPooling Layer

Name: ROIPooling

Short description: It is a pooling layer with max pooling strategy (see max option in the Pooling layer parameters description). It is used over feature maps of non-uniform sizes and outputs another feature map of a fixed size.

Detailed description: Reference

Parameters: Specify ROIPooling layer parameters in the data node, which is a child of the layer node.

  • Parameter name:pooled_h (pooled_w)
    • Description: pooled_h (pooled_w) is a height of the ROI output feature map. For example, pooled_h (pooled_w) equal 6 means that the height (width) of the output of ROIpooling is 6
    • Range of values: integer values starting from 0
  • Parameter name:spatial_scale
    • Description: spatial_scale is a ratio of the input feature map over the input image size
    • Range of values: positive floating point value

Mathematical Formulation

\[ output_{j} = MAX\left \{ x_{0}, ... x_{i} \right \} \]

Example

<layer ... type="ROIPooling" ... >
        <data pooled_h="6" pooled_w="6" spatial_scale="0.062500"/>
        <input> ... </input>
        <output> ... </output>
    </layer>

FullyConnected Layer

Name: FullyConnected

Short description: Reference

Detailed description: Reference

Parameters: Specify FullyConnected layer parameters in the fc_data node, which is a child of the layer node.

  • Parameter name: out-size
    • Description: out-size is a length of the output vector. For example, out-size equal 4096 means that the output vector length is 4096
    • Range of values: integer values starting from 0

Mathematical Formulation

  • If previous layer is FullyConnected:
    \[ y_{i} = f\left ( z_{i} \right ) \quad with \quad z_{i} = \sum_{j=1}^{m_{1}^{\left ( l-1 \right )}}w_{i,j}^{\left ( l \right )}y_{i}^{\left ( l -1 \right )} \]
  • Otherwise:
    \[ y_{i} = f\left ( z_{i} \right ) \quad with \quad z_{i}^{\left ( l \right )} = \sum_{j=1}^{m_{1}^{\left ( l-1 \right )}}\sum_{r=1}^{m_{2}^{\left ( l-1 \right )}}\sum_{s=1}^{m_{3}^{\left ( l-1 \right )}}w_{i,j,r,s}^{\left ( l \right )}\left ( Y_{i}^{\lef

Example

<layer ... type="FullyConnected" ... >
        <fc_data out-size="4096"/>
        <input> ... </input>
        <output> ... </output>
    </layer>

Weights layout: OI, which means that Input is changing the fastest, then Output.


ReLU Layer

Name: ReLU

Short description: Reference

Detailed description: Reference

Parameters: ReLU layer parameters can be (not mandatory) specified in the data node, which is a child of the layer node.

  • Parameter name: negative_slope
    • Description: negative_slope is a multiplier, which is used if the unit is not active (that is negative). For example, negative_slope equal 0.1 means that an inactive unit value would be multiplied by 0.1 and this is the Leaky ReLU. If negative_slope is equal to 0, this is the usual ReLU
    • Range of values: double values starting from 0
  • Parameter name: engine
    • Description: engine is a parameter that specifies computational engine implementation. For example, engine equal caffe.ReLUParameter.CAFFE means that a Caffe* computational engine is used
    • Range of values:
      • caffe.ReLUParameter.DEFAULT
      • caffe.ReLUParameter.CAFFE
      • caffe.ReLUParameter.CUDNN

Mathematical Formulation

Example

<layer ... type="ReLU" ... >
    <data negative_slope="0.100000"/>
    <input> ... </input>
    <output> ... </output>
</layer>

Activation Layer

Name: Activation

Short description: Activation layer represents an activation function of each neuron in a layer, which is used to add non-linearity to the computational flow.

Detailed description: Reference

Parameters: Activation layer parameters should be specified in the data node, which is a child of the layer node.

  • Parameter name: type
    • Description: type represents particular activation function. For example, type equal sigmoid means that neurons of this layer have a sigmoid activation function
    • Range of values:
      • sigmoid - sigmoid activation function. Learn more from the Detailed description section
      • tanh - tanh activation function. Learn more from the Detailed description section

Mathematical Formulation

  • Sigmoid function:
    \[ f\left ( x \right ) = \frac{1}{1+e^{-x}} \]
  • Tahn function:
    \[ f\left ( x \right ) = \frac{2}{1+e^{-2x}} - 1 = 2sigmoid(2x) - 1 \]

Example

<layer ... type="Activation" ... >
    <data type="sigmoid" />
    <input> ... </input>
    <output> ... </output>
</layer>

SoftMax layer

Name: SoftMax

Short description: Reference

Detailed description: Reference

Parameters: SoftMax layer parameters can be (not mandatory) specified in the data node, which is a child of the layer node.

  • Parameter name: axis
    • Description: axis represents the axis of which the SoftMax is calculated. axis equal 1 is a default value
    • Range of values: positive integer values

Mathematical Formulation

\[ y_{c} = \frac{e^{Z_{c}}}{\sum_{d=1}^{C}e^{Z_{d}}} \]

where C is a number of classes

Example

<layer ... type="SoftMax" ... >
    <data axis="1" />
    <input> ... </input>
    <output> ... </output>
</layer>

Deconvolution Layer

Name: Deconvolution

Short description: Deconvolution layer is applied for upsampling the output to the higher image resolution.

Detailed description: Reference

Parameters: Deconvolution layer parameters should be specified in the deconvolution_data node, which is a child of the layer node.

NOTE:Deconvolution layer has the same way of parameters definition in XML as a Convolution layer.

Weights layout: Weights layout is the following: GOIYX, which means that X is changing the fastest, then Y, then Input, Output, then Group.

Mathematical formulation:
Deconvolution is also called transpose convolution and performs operation, reverse to convolution.

The number of output features for each dimensions is calculated:
\[ S_{o}=stride\left (S_{i} - 1 \right ) + S_{f} - 2pad \]

Where S is size of output, input and filter

Output is calculated in the same way as for convolution layer:
\[ out = \sum_{i = 0}^{n}w_{i}x_{i} + b \]

Example

<layer ... type="Deconvolution" ... >
    <deconvolution_data stride-x="2" stride-y="2" pad-x="1" pad-y="1" kernel-x="4" kernel-y="4" output="19" group="1"/>
    <input> ... </input>
    <output> ... </output>
</layer>

Local Response Normalization (LRN) layer

Name: Norm

Short description: Reference

Detailed description: Reference

Parameters: Norm layer parameters should be specified in the norm_data node, which is a child of the layer node.

  • Parameter name: alpha
    • Description: alpha represents the scaling parameter for the normalizing sum. For example, alpha equal 0.0001 means that the normalizing sum is multiplied by 0.0001
    • Range of values: floating point positive number
  • Parameter name: beta
    • Description: beta represents the exponent for the normalizing sum. For example, beta equal 0.75 means that the normalizing sum is raised to the power of 0.75
    • Range of values: floating point positive number
  • Parameter name: region
    • Description: region represents strategy of local regions extension. For example, region equal across means that the normalizing sum is performed over adjacent channels
    • Range of values:
      • across - normalizing sum is performed over adjacent channels
      • same - normalizing sum is performed over nearby spatial locations
  • Parameter name: local-size
    • Description: local-size represents the side length of the region to be used for the normalization sum or number of channels depending on the strategy specified in the region parameter. For example, local-size equal 5 for the across strategy means application of sum across 5 adjacent channels
    • Range of values: positive integer bigger than zero

Mathematical Formulation

\[ o_{i} = \left( 1 + \left( \frac{\alpha}{n} \right)\sum_{i}x_{i}^{2} \right)^{\beta} \]

Where n is the size of each local region.

Example

<layer ... type="Norm" ... >
    <norm_data alpha="9.9999997e-05" beta="0.75" local-size="5" region="across"/>
    <input> ... </input>
    <output> ... </output>
</layer>

Concat Layer

Name: Concat

Short description: Reference

Parameters: Concat layer parameters should be specified in the concat_data node, which is a child of the layer node.

  • Parameter name: axis
    • Description: axis is the number of axis over which input blobs are concatenated. For example, axis equal 1 means that input blobs are concatenated over the first axis
    • Range of values: positive number greater or equal to 0

Mathematical Formulation
Axis parameter specifies a blob dimension to concat values. For example, for two input blobs B1xC1xH1xW1 and B2xC2xh4xW2 if axis: 1, output blob is****: B1xC1+C2xH1xW1. This is only possible if B1=B2, H1=h4, W1=W2

Example

<layer ... type="Concat" ... >
    <concat_data axis="1"/>
    <input> ... </input>
    <output> ... </output>
</layer>

Split Layer

Name: Split

Short description: Split layer splits the input into several output groups. Group sizes are denoted by the number and the size of output ports.

Detailed description: Reference

Parameters: None

Mathematical Formulation

Splits input blob among children. For example, blob is BxC+CxHxW and there are two children. Then, output blob is BxCxHxW.

Example

<layer ... type="Split" ... >
    <input> ... </input>
    <output> ... </output>
</layer>

Reshape Layer

Name: Reshape

Short description: Reshape layer changes dimensions of the input blob according to the specified order. Input blob volume is equal to output blob volume, where volume is the product of dimensions.

Detailed description: Reference

Parameters: Reshape layer parameters should be specified in the data node, which is a child of the layer node.

  • Parameter name: axis
    • Description: axis is the number of the starting axis for reshape. For example, axis equal 1 means that Reshape replaces dimensions starting from the next after the first dimension
    • Range of values: positive number greater or equal to 0
  • Parameter name: dim
    • Description: dim is a set of numbers separated with comma, which denote the dimensions of output blob. For example, dim equal 88,1,71 means that output blob gets following dimensions: first dimension equals 88, second dimension equals 1, third dimension equals 71. For more information, refer to the Description block. If dim is equal to two numbers, it performs flattening
    • Range of values: set of positive integer numbers separated with comma
  • Parameter name: num_axes
    • Description: num_axes is the number of dimensions to be replaced with a reshaped blob starting from the dimension number specified in axis property. For example, num_axes equal 2 means that 2 dimensions are replaced with reshaped blob
    • Range of values:
      • -1 - all dimensions are taken starting from the dimension number specified in axis property
      • positive number greater than the value in the axis parameter

Mathematical Formulation

If you want to reshape input blob BxCxHxW into Bx1x(C*H)xW, the dim parameters of your layer should be:

Example

<layer ... type="Reshape" ... >
    <data axis="0" dim="1, 1001" num_axes="-1"/>
    <input> ... </input>
    <output> ... </output>
</layer>

Eltwise Layer

Name: Eltwise

Short description: Eltwise layer performs element-wise operation, which is specified in parameters, over given inputs.

Parameters: Eltwise layer parameters should be specified in the elementwise_data node, which is placed as a child of the layer node.

  • Parameter name: operation
    • Description: operation is the simple mathematical operation to be performed over inputs. For example, operation equal mul means that input blobs are multiplied
    • Range of values:
      • sum - summation of given values
      • max - select maximum from given values
      • mul - multiplication of given values

Mathematical Formulation Eltwise accepts 2 inputs of any number of dimensions - from 1 to 4, however, it is required for both of them to have absolutely same dimensions. The produced blob is also of the same dimension as each of its parents

Eltwise does the following with the input blobs:

\[ o_{i} = f(b_{i}^{1}, b_{i}^{2}) \]

where - first blob i-th element, - second blob i-th element, $o_{i}$ - output blob i-th element, $f(a,b)$ - is a function that performs an operation over its two arguments $a, b$.

  • For sum operation, $f(a,b)$ is defined as
    \[ f(a,b) = a + b \]
  • For mul operation, $f(a,b)$ is defined as
    \[ f(a,b) = a * b \]
  • For max operation, $f(a,b)$ is defined as
    \[ f(a,b) = \left\{\begin{array}{ll} a \quad \mbox{if } a \geq b \\ b \quad \mbox{if } b > a \end{array}\right. \]

Example

<layer ... type="Eltwise" ... >
    <elementwise_data operation="sum"/>
    <input> ... </input>
    <output> ... </output>
</layer>

ScaleShift Layer

Name: ScaleShift

Short description: ScaleShift layer performs linear transformation of the input blobs. Weights denote scaling parameter, biases - a shift.

Parameters: ScaleShift layer does not have additional parameters.

Mathematical Formulation

\[ o_{i} =\gamma b_{i} + \beta \]

Example

<layer ... type="ScaleShift" ... >
    <input> ... </input>
    <output> ... </output>
</layer>

Crop Layer

Name: Crop

Short description: Crop layer changes selected dimensions of the input blob according to the specified parameters.

Parameters: Crop layer parameters should be specified as child crop nodes of the crop-data node, which is placed as a child of the layer node.

  • Parameter name: axis
    • Description: axis is the number of the dimension to be used for crop. For example, axis equal 1 means that crop is performed over the first dimension
    • Range of values: positive number greater or equal to 0
  • Parameter name: offset
    • Description: offset denotes the starting point for crop in the input blob. For example, offset equal 2 means that crop is starting from the second value in the given axis
    • Range of values: positive integer number
  • Parameter name: dim
    • Description: dim is the result size of the output blob for the given axis. For example, dim equal 88 means that output blob gets the dimension equals 88 for the given axis
    • Range of values: positive integer number

Mathematical Formulation

Crop changes dimensions of the input blob. Only dimensions of axes from attributes axis are changed. Dimensions of the output blob are computed based on offset and dims.

Example

<layer ... type="Crop" ... >
    <crop-data axis="2,3" offset="0,0" dim="34,34"/>
    </crop-data>
    <input> ... </input>
    <output> ... </output>
</layer>

Batch Normalization Layer

Name: BatchNormalization

Short description: Reference

Detailed description: Reference

Parameters: BatchNormalization layer parameters should be specified as the batch_norm_data node, which is a child of the layer node.

  • Parameter name: epsilon
    • Description: epsilon is the number to be added to the variance to avoid division by zero when normalizing the value. For example, epsilon equal 0.001 means that 0.001 is added to the variance
    • Range of values: positive floating point number

Mathematical Formulation

BatchNormalization is the normalization of the output in each hidden layer.

Input: Values of x over a mini-batch: $ \beta = \left \{ x_{1...m} \right \} $

Parameters to learn:$ \gamma, \beta$

  • Output:
    $ \left \{ o_{i} = BN_{\gamma, \beta} \left ( b_{i} \right ) \right \} $
  • Mini-batch mean:
    $ \mu_{\beta} \leftarrow \frac{1}{m}\sum_{i=1}^{m}b_{i} $
  • Mini-batch variance:
    $ \sigma_{\beta }^{2}\leftarrow \frac{1}{m}\sum_{i=1}^{m}\left ( b_{i} - \mu_{\beta} \right )^{2} $
  • Normalize:
    $ \hat{b_{i}} \leftarrow \frac{b_{i} - \mu_{\beta}}{\sqrt{\sigma_{\beta }^{2} + \epsilon }} $
  • Scale and shift:
    $ o_{i} \leftarrow \gamma\hat{b_{i}} + \beta = BN_{\gamma ,\beta }\left ( b_{i} \right ) $

Example

<layer ... type="BatchNormalization" ... >
    <batch_norm_data epsilon="9.99e-06" />
    <input> ... </input>
    <output> ... </output>
</layer>

Normalize Layer

Name: Normalize

Short description: Normalize layer performs l-p normalization of 1 of input blob.

Parameters: Normalize layer parameters should be specified as the data node, which is a child of the layer node.

  • Parameter name: across_spatial
    • Description: across_spatial is a flag that denotes if normalization is performed over CHW or HW. For example, across_spatial equals 0 means that normalization is not shared across channels
    • Range of values:
      • 0
      • 1 - not supported
  • Parameter name: channel_shared
    • Description: channel_shared is a flag that denotes if scale parameters are shared across channels. For example, channel_shared equal 0 means that scale parameters are not shared across channels
    • Range of values:
      • 0 - scale parameters are not shared across channels
      • 1 - not supported
  • Parameter name: eps
    • Description: eps is the epsilon used to avoid division by zero when normalizing the value. For example, eps equals 0.001 means that 0.001 is used if all the values in normalization are equal to zero
    • Range of values: positive floating point number

Mathematical Formulation

\[ o_{i} = \sum_{i}^{H*W}\frac{\left ( n*C*H*W \right )* scale}{\sqrt{\sum_{i=0}^{C*H*W}\left ( n*C*H*W \right )^{2}}} \]

Example

<layer ... type="Normalize" ... >
    <data across_spatial="0" channel_shared="0" eps="0.000000"/>
    <input> ... </input>
    <output> ... </output>
</layer>

Tile Layer

Name: Tile

Short description: Tile layer extends input blob with copies of data along specific axis.

Detailed description: Reference

Parameters: Tile layer parameters should be specified as the tile_data node, which is a child of the layer node.

  • Parameter name: axis
    • Description: axis is the index of the axis to tile. For example, axis equals 3 means that fourth axis is used for tiling
    • Range of values: positive integer number
  • Parameter name: tiles
    • Description: tiles is a size of the specified axis in the output blob. For example, tiles equal 88 means that output blob gets 88 copies of data from specified axis
    • Range of values: positive integer number

Mathematical Formulation

Tile extends input blobs and filling in output blobs following rules:

\[ out_i=input_i[inner\_dim*t] \]

\[ t \in \left ( 0, \quad tiles \right ) \]

Example

<layer ... type="Tile" ... >
    <tile_data axis="3" tiles="88"/>
    <input> ... </input>
    <output> ... </output>
</layer>

Permute Layer

Name: Permute

Short description: Permute layer performs reordering of input blob dimensions.

Detailed description: Reference

Parameters: Permute layer parameters should be specified as the data node, which is a child of the layer node.

NOTE: Model Optimizer (Beta 2) does not use the data node for retrieving parameters and currently supports only the following order for permutation: 0,2,3,1.

  • Parameter name: order
    • Description: order is the set of dimensions indexes for output blob. For example, order equal 0,2,3,1 means that the output blob has following dimensions: first dimension from the input blob, third dimension from the input blob, fourth dimension from the input blob, second dimension from the input blob
    • Range of values: set of positive integer numbers separated by comma

Mathematical Formulation

Permute layer performs reordering input blob. Source indexes and destination indexes are bound by formula:

\[ src\_ind_{offset} = n * ordered[1] * ordered[2] * ordered[3] + (h * ordered[3] + w) \]

\[ n \in \left ( 0, order[0] \right ) \]

\[ w \in \left ( 0, order[3] \right ) \]

Example

<layer ... type="Permute" ... >
    <data order="0,2,3,1"/>
    <input> ... </input>
    <output> ... </output>
</layer>

PriorBox Layer

Name: PriorBox

Short description: PriorBox layer generates prior boxes of specified sizes and aspect ratios across all dimensions.

Parameters: PriorBox layer parameters should be specified as the data node, which is a child of the layer node.

  • Parameter name: min_size (max_size)
    • Description: min_size (max_size) is the minimum (maximum) box size (in pixels). For example, min_size (max_size) equal 15 means that the minimum (maximum) box size is 15
    • Range of values: positive integer number
  • Parameter name: aspect_ratio
    • Description: aspect_ratio is a variance of aspect ratios. Duplicate values are ignored. For example, aspect_ratio equal 2.000000,3.000000 means that for the first box aspect_ratio is equal to 2 and for the second box - 3
    • Range of values: set of positive integer numbers
  • Parameter name: flip
    • Description: flip is a flag that denotes that each aspect_ratio is duplicated and flipped. For example, flip equals 1 and aspect_ratio equals 3 mean that aspect_ratio is equal to 1/3
    • Range of values:
      • 0 - each aspect_ratio is flipped
      • 1 - each aspect_ratio is not flipped
  • Parameter name: clip
    • Description: clip is a flag that denotes if each value in the output blob is within [0,1]. For example, clip equal 1 means that each value in the output blob is within [0,1]
    • Range of values:
      • 0 - clipping is not performed
      • 1 - each value in the output blob is within [0,1]
  • Parameter name: step
    • Description: step is a distance between box centers. For example, step equal 85 means that the distance between neighborhood prior boxes centers is 85
    • Range of values: floating point positive number
  • Parameter name: offset
    • Description: offset is a shift of box respectively to top left corner. For example, offset equal 85 means that the shift of neighborhood prior boxes centers is 85
    • Range of values: floating point positive number
  • Parameter name: variance
    • Description: variance denotes a variance of adjusting bounding boxes. For example, variance equals 85 means that the shift of neighborhood prior boxes centers is 85
    • Range of values: floating point positive number

Mathematical formulation:
PriorBox computes coordinates of prior boxes by following:

  1. First calculates center_x and center_y of prior box:

    \[ W \equiv Width \quad Of \quad Image \]

    \[ H \equiv Height \quad Of \quad Image \]

    • If step equals 0:
      \[ center_x=(w+0.5) \]
      \[ center_y=(h+0.5) \]
    • else:
      \[ center_x=(w+offset)*step \]
      \[ center_y=(h+offset)*step \]
      \[ w \subset \left( 0, W \right ) \]
      \[ h \subset \left( 0, H \right ) \]
  2. Then, for each $ s \subset \left( 0, min_sizes \right )$ calculates coordinates of priorboxes:
    \[ xmin = \frac{\frac{center_x - s}{2}}{W}; \]
    \[ xmin = \frac{\frac{center_y - s}{2}}{H}; \]
    \[ xmax = \frac{\frac{center_x + s}{2}}{W}; \]
    \[ xmin = \frac{\frac{center_y + s}{2}}{H}; \]

Example

<layer ... type="PriorBox" ... >
    <data step="64.000000" min_size="162.000000" max_size="213.000000" offset="0.500000" flip="1" clip="0" aspect_ratio="2.000000,3.000000" variance="0.100000,0.100000,0.200000,0.200000" />
    <input> ... </input>
    <output> ... </output>
</layer>

SimplerNMS layer

Name: SimplerNMS

Short description: SimplerNMS layer performs filtering of bounding boxes and outputs only those with the highest confidence of prediction.

Parameters: SimplerNMS layer parameters should be specified as the data node, which is a child of the layer node.

  • Parameter name: pre_nms_topn (post_nms_topn)
    • Description: pre_nms_topn (post_nms_topn) is the quantity of bounding boxes before (after) applying NMS operation. For example, pre_nms_topn (post_nms_topn) equals 15 means that the minimum (maximum) box size is 15
    • Range of values: positive integer number
  • Parameter name: cls_threshold
    • Description: cls_threshold is the minimum value of the proposal to be taken into consideration. For example, cls_threshold equal 0.5 means that all boxes with prediction probability less than 0.5 are filtered out
    • Range of values: positive floating point number
  • Parameter name: iou_threshold
    • Description: iou_threshold is the minimum ratio of boxes overlapping to be taken into consideration. For example, iou_threshold equal 0.7 means that all boxes with overlapping ratio less than 0.7 are filtered out
    • Range of values: positive floating point number
  • Parameter name: feat_stride
    • Description: feat_stride is the step size to slide over boxes (in pixels). For example, feat_stride equal 16 means that all boxes are analyzed with the slide 16
    • Range of values: positive integer number
  • Parameter name: min_bbox_size
    • Description: min_bbox_size is the minimum size of box to be taken into consideration. For example, min_bbox_size equal 35 means that all boxes with box size less than 35 are filtered out
    • Range of values: positive integer number
  • Parameter name: scale
    • Description: scale is array of scales for anchor boxes generating
    • Range of values: positive integer number

Mathematical Formulation

SimplerNMS accepts three inputs with four dimensions. Produced blob has two dimensions, the first one equals post_nms_topn.

SimplerNMS does the following with the input blob:

  1. Generates initial anchor boxes. Left top corner of all boxes is (0, 0). Width and height of boxes are calculated based on scaled (according to the scale parameter) default widths and heights
  2. For each point in the first input blob:
    • pins anchor boxes to picture according to the second input blob, which contains four deltas for each box: for x and y of center, for width, and for height
    • finds out score in the first input blob
  3. Filters out boxes with size less than min_bbox_size.
  4. Sorts all proposals (box, score) by score from highest to lowest
  5. Takes top pre_nms_topn proposals
  6. Calculates intersections for boxes and filters out all with $intersection/union > iou_threshold$
  7. Takes top post_nms_topn proposals
  8. Returns top proposals

Example

<layer ... type="SimplerNMS" ... >
    <data cls_threshold="0.500000" iou_threshold="0.700000" min_bbox_size="16" feat_stride="16" pre_nms_topn="6000" post_nms_topn="150"/>
    <input> ... </input>
    <output> ... </output>
</layer>

DetectionOutput Layer

Name: DetectionOutput

Short description: DetectionOutput layer performs non-maximum suppression to generate the detection output using information on location and confidence predictions.

Detailed description: Reference

Parameters: DetectionOutput layer parameters should be specified as the data node, which is a child of the layer node.

  • Parameter name: num_classes
    • Description: number of classes to be predicted
    • Range of values: positive integer values
  • Parameter name: background_label_id
    • Description: background label id. If there is no background class, set it to -1
    • Range of values: integer values
  • Parameter name: top_k
    • Description: maximum number of results to be kept on NMS stage
    • Range of values: integer values
  • Parameter name: variance_encoded_in_target
    • Description: if True, variance is encoded in target; otherwise, we need to adjust the predicted offset accordingly
    • Range of values: logical values
  • Parameter name: keep_top_k
    • Description: number of total bboxes to be kept per image after NMS step. -1 means keeping all bboxes after NMS step
    • Range of values: integer values
  • Parameter name: num_orient_classes
    • Range of values: integer values
  • Parameter name: code_type
    • Description: type of coding method for bounding boxes
    • Range of values: caffe.PriorBoxParameter.CENTER_SIZE and others
  • Parameter name: share_location
    • Description: bounding boxes are shared among different classes
    • Range of values: logical values
  • Parameter name: interpolate_orientation
    • Range of values: integer values
  • Parameter name: nms_threshold
    • Description: threshold to be used in NMS stage
    • Range of values: floating point values
  • Parameter name: confidence_threshold
    • Description: only consider detections whose confidences are larger than a threshold. If not provided, consider all boxes
    • Range of values: floating point values

Mathematical Formulation

At each feature map cell, DetectionOutput predicts the offsets relative to the default box shapes in the cell, as well as the per-class scores that indicate the presence of a class instance in each of those boxes. Specifically, for each box out of k at a given location, DetectionOutput computes class scores and the four offsets relative to the original default box shape. This results in a total of $(c + 4)k$ filters that are applied around each location in the feature map, yielding $(c + 4)kmn$ outputs for a m × n feature map.

Example

<layer ... type="DetectionOutput" ... >
    <data num_classes="21" share_location="1" background_label_id="0" nms_threshold="0.450000" top_k="400" eta="1.000000" output_directory="" output_name_prefix="" output_format="" label_map_file="" name_size_file="" num_test_image="0" prob="1.000000" resize_mode="caffe.ResizeParameter.WARP" height="0" width="0" height_scale="0" width_scale="0" pad_mode="caffe.ResizeParameter.CONSTANT" pad_value="#" interp_mode="#" code_type="caffe.PriorBoxParameter.CENTER_SIZE" variance_encoded_in_target="0" keep_top_k="200" confidence_threshold="0.010000" visualize="0" visualize_threshold="0.000000" save_file=""/>
    <input> ... </input>
    <output> ... </output>
</layer>

Memory / Delay Object layer

Name: Memory

Short description: Memory layer represents delay layer in terms of LSTM terminology.

Detailed description: Memory layer saves state between two infer requests. In the topology, it is the single layer, however, in the Intermediate Representation, it is always represented as a pair of Memory layers. One of these layers does not have outputs and another does not have inputs (in terms of the Intermediate Representation).

Parameters: Memory layer parameters should be specified as the data node, which is a child of the layer node.

  • Parameter name: id
    • Description: id is the id of the pair of Memory layers. For example, id equals r_27-28 means that layers with id 27 and 28 are in one pair
    • Range of values: positive integer number
  • Parameter name: index
    • Description: index represents if the given layer is input or output. For example, index equal 0 means this layer is output one
    • Range of values:
      • 0 - current layer is output one
      • 1 - current layer is input one
  • Parameter name: size
    • Description: size represents the size of the group. For example, size equals 2 means this group is a pair
    • Range of values: only 2 is supported

Mathematical Formulation
Memory save data from the input blob.

Example

<layer ... type="Memory" ... >
    <data id="r_27-28" index="0" size="2" />
    <input> ... </input>
    <output> ... </output>
</layer>

Clamp Layer

Name: Clamp

Short description: Clamp layer represents clipping activation operation.

Detailed description: Reference

Parameters: Clamp layer parameters should be specified as the data node, which is a child of the layer node.

  • Parameter name: min
    • Description: min is the lower bound of values in the output shape. Any value in the input shape that is smaller than the bound, is replaced by the min value. For example, min equal 10 means that any value in the input shape that is smaller than the bound, is replaced by 10
    • Range of values: positive integer number
  • Parameter name: max
    • Description: max is the upper bound of values in the output shape. Any value in the input shape that is greater than the bound, is replaced by the max value. For example, max equals 50 means that any value in the input shape that is greater than the bound, is replaced by 50
    • Range of values: positive integer number

Mathematical Formulation

Clamp generally does the following with the input blobs:

\[ out_i=\left\{\begin{array}{ll} max\_value \quad if \quad input_i>max\_value, \\ min\_value \quad if \quad input_i

Example

<layer ... type="Clamp" ... >
    <data min="10" max="50" />
    <input> ... </input>
    <output> ... </output>
</layer>

ArgMax Layer

Name: ArgMax

Short description: ArgMax layer compute the index of the K maximum values for each datum across all dimensions CxHxW.

Detailed description: Intended for use after a classification layer to produce a prediction. If parameter out_max_val is set to True, output is a vector of pairs (max_ind, max_val) for each image. The axis parameter specifies an axis along which to maximize.

Parameters: ArgMax layer parameters should be specified as the data node, which is a child of the layer node.

  • Parameter name: top_k
    • Description: top_k is the number K of maximum items to output
    • Range of values: positive integer number
  • Parameter name:out_max_val
    • Description: if out_max_val equals 1, output is a vector of pairs (max_ind, max_val), unless axis is set. Then output is max_val along the specified axis
    • Range of values: 0 or 1
  • Parameter name: axis
    • Description: if set, maximizes along the specified axis, else maximizes the flattened trailing dimensions for each index of the first / num dimension
    • Range of values: integer values

Mathematical Formulation

ArgMax generally does the following with the input blobs:

 f(y) \leq f(x) \right\} \]

Example

<layer ... type="ArgMax" ... >
    <data top_k="10" out_max_val="1" axis="-1"/>
    <input> ... </input>
    <output> ... </output>
</layer>

PSROIPooling Layer

Name: PSROIPooling

Short description: PSROIPooling layer compute position-sensitive max pooling on regions of interest specified by input, takes as input N position-sensitive score maps and a list of R regions of interest.

Detailed description: Reference

ParametersPSRoiPooling layer parameters should be specified as the data node, which is a child of the layer node.

  • Parameter name: output_dim
    • Description: output_dim is the pooled output channel number
    • Range of values: positive integer number
  • Parameter name: group_size
    • Description: group_size is the number of groups to encode position-sensitive score maps
    • Range of values: positive integer number
  • Parameter name: spatial_scale
    • Description: spatial_scale is multiplicative spatial scale factor to translate ROI coordinates from their input scale to the scale used when pooling
    • Range of values: positive floating point value

Mathematical Formulation

The output value for $(i, j)$-th bin is obtained by summation from one score map $x_{i,j}$ corresponding to that bin. In short, the difference from RoIPooling is that a general feature map x is replaced by a specific positive-sensitive score map $x_{i,j}$

Example

<layer ... type="PSROIPooling" ... >
    <data output_dim="10" out_max_val="1" spatial_scale="0.1"/>
    <input> ... </input>
    <output> ... </output>
</layer>

GRN Layer

Name: GRN

Short description: GRN is Global Response Normalization with L2 norm (across channels only).

Parameters: GRN layer parameters should be specified as the data node, which is a child of the layer node.

  • Parameter name: bias
  • Description: bias is added to the variance
  • Range of values: floating point value

Mathematical Formulation

GRN computes L2 norm by channels for input blob. GRN generally does the following with the input blob:

\[ output_{i} = \frac{input_{i}}{\sqrt{\sum_{i}^{C} input_{i}}} \]

Example

<layer ... type="GRN" ... >
    <data bias="1.0"/>
    <input> ... </input>
    <output> ... </output>
</layer>

PReLU Layer

Name: PReLU

Short description: PReLU is the Parametric Rectifier Linear Unit. The difference from ReLU is that negative slopes can vary across channels.

Parameters: PReLU layer parameters should be specified as the data node, which is a child of the layer node.

  • Parameter name: channel_shared
    • Description: channel_shared shows if negative slope shared across channels or not
    • Range of values: 0 or 1
  • Parameter name: filler_type
    • Description: filler_type defines initialization type for negative slope
    • Range of values: string
  • Parameter name: filler_value
    • Description: filler_value defines the value in constant filler
    • Range of values: integer
  • Parameter name: min(max)
    • Description: min(max) defines the minimal(maximal) value in uniform filler
    • Range of values: integer
  • Parameter name: mean
    • Description: mean defines the mean value in Gaussian filler
    • Range of values: integer

Mathematical Formulation

PReLU accepts one input with four dimensions. The produced blob has the same dimensions as input.

PReLU does the following with the input blob:
\[ o_{i} = max(0, x_{i}) + w_{i} * min(0,x_{i}) \]

where $w_{i}$ is from weights blob.

Example

<layer ... type="PReLU" ... >
    <data bias="1.0"/>
    <input> ... </input>
    <output> ... </output>
</layer>

PriorBoxClustered Layer

Name: PriorBoxClustered

Short description: PriorBoxClustered layer generates prior boxes of specified sizes.

Parameters: PriorBoxClustered layer parameters should be specified as the data node, which is a child of the layer node.

  • Parameter name: width (height)
    • Description: width (height) is a parameter that specifies desired boxes widths (heights) in pixels
    • Range of values: floating point positive number
  • Parameter name: clip
    • Description: clip is a flag that denotes if each value in the output blob is within [0,1]. For example, clip equal 1 means that each value in the output blob is within [0,1]
    • Range of values:
    • 0 - clipping is not performed
    • 1 - each value in the output blob is within [0,1]
  • Parameter name: flip
    • Description: flip is a flag that denotes whether the list of boxes is augmented with the flipped ones
    • Range of values:
      • 0 - list of boxes is not augmented with the flipped ones
      • 1 - list of boxes is augmented with the flipped ones
  • Parameter name: step (step_w, step_h)
    • Description: step (step_w, step_h) is a distance between box centers. For example, step equal 85 means that the distance between neighborhood prior boxes centers is 85
    • Range of values: floating point positive number
  • Parameter name: offset
    • Description: offset is a shift of box respectively to top left corner. For example, offset equal 85 means that the shift of neighborhood prior boxes centers is 85
    • Range of values: floating point positive number
  • Parameter name: variance
    • Description: variance denotes a variance of adjusting bounding boxes. For example, variance equal 85 means that the shift of neighborhood prior boxes centers is 85
    • Range of values: floating point positive number
  • Parameter name: img_h (img_w)
    • Description: img_h (img_w) specifies height (width) of input image. These parameters are calculated unless provided explicitly
    • Range of values: floating point positive number

Mathematical Formulation

PriorBoxClustered computes coordinates of prior boxes by following:

  1. Calculates the center_x and center_y of prior box:
    \[ W \equiv Width \quad Of \quad Image \]
    \[ H \equiv Height \quad Of \quad Image \]
    \[ center_x=(w+offset)*step \]
    \[ center_y=(h+offset)*step \]
    \[ w \subset \left( 0, W \right ) \]
    \[ h \subset \left( 0, H \right ) \]
  2. For each $ s \subset \left( 0, W \right )$ calculates the prior boxes coordinates:
    \[ xmin = \frac{center_x - \frac{width_s}{2}}{W} \]
    \[ ymin = \frac{center_y - \frac{height_s]}{2}}{H} \]
    \[ xmax = \frac{center_x - \frac{width_s}{2}}{W} \]
    \[ xmax = \frac{center_y - \frac{height_s}{2}}{H} \]

If clip is defined, the coordinates of prior boxes are recalculated with the formula:
\[ coordinate = \min(\max(coordinate,0), 1) \]

Example

<layer ... type="PriorBoxClustered">
    <data clip="0" flip="0" height="44.0,10.0,30.0,19.0,94.0,32.0,61.0,53.0,17.0" offset="0.5" step="16.0" variance="0.1,0.1,0.2,0.2"
     width="86.0,13.0,57.0,39.0,68.0,34.0,142.0,50.0,23.0"/>
    <input>
        ...
    </input>
    <output>
        ...
    </output>
</layer>

MVN Layer

Name: MVN

Short description: Reference

Parameters: MVN layer parameters should be specified as the data node, which is a child of the layer node.

  • Parameter name: across_channels
    • Description: across_channels is a flag that denotes if mean values are shared across channels. For example, across_channels equal 0 means that mean values are not shared across channels
    • Range of values:
      • 0 - mean values are not shared across channels
      • 1 - mean values are shared across channels
  • Parameter name: normalize_variance
    • Description: normalize_variance is a flag that denotes whether to perform variance normalization
    • Range of values:
      • 0 - variance normalization is not performed
      • 1 - variance normalization is performed
  • Parameter name: eps
    • Description: eps is the number to be added to the variance to avoid division by zero when normalizing the value. For example, epsilon equal 0.001 means that 0.001 is added to the variance
    • Range of values: positive floating point number

Mathematical Formulation

MVN subtracts mean from the input blob:

\[ o_{i}=i_{i} - \frac{\sum {i_{k}}}{C*H*W} \]

If normalize_variance is set to 1, the output blob is divided by variance:

\[ o_{i}=\frac{o_{i}}{\sum \sqrt {o_{k}^2}+\epsilon} \]

Example

<layer ... type="MVN">
    <data across_channels="1" eps="9.999999717180685e-10" normalize_variance="1"/>
    <input>
        ...
    </input>
    <output>
        ...
    </output>
</layer>

CTCGreadyDecoder Layer

Name: CTCGreadyDecoder

Short description: CTCGreadyDecoder performs greedy decoding on the logits given in input (best path).

Detailed description: Reference

Parameters: CTCGreadyDecoder layer parameters should be specified as the data node, which is a child of the layer node.

  • Parameter name: ctc_merge_repeated
    • Description: ctc_merge_repeated is flag for collapsing the repeated labels during the ctc calculation
    • Range of values: 0 or 1

Mathematical formulation

Given an input sequence X of length T, CTCGreadyDecoder assumes the probability of a length T character sequence C is given by,

\[ p(C|X) = \prod_{t=1}^{T} p(c_{t}|X) \]

Example

<layer ... type="CTCGreadyDecoder" ... >
    <data stride="1"/>
    <input> ... </input>
    <output> ... </output>
</layer>

Proposal Layer

Name: Proposal

Short description: Proposal layer performs filtering of only those bounding boxes and outputs with the highest confidence of prediction.

Parameters: Proposal layer parameters should be specified as the data node, which is a child of the layer node.

  • Parameter name: pre_nms_topn (post_nms_topn)
    • Description: pre_nms_topn (post_nms_topn) is the quantity of bounding boxes before (after) applying NMS operation. For example, pre_nms_topn (post_nms_topn) equal 15 means that the minimum (maximum) box size is 15
    • Range of values: positive integer number
  • Parameter name: nms_thresh
    • Description: nms_thresh is the minimum value of the proposal to be taken into consideration. For example, nms_thresh equal 0.5 means that all boxes with prediction probability less than 0.5 are filtered out
    • Range of values: positive floating point number
  • Parameter name: feat_stride
    • Description: feat_stride is the step size to slide over boxes (in pixels). For example, feat_stride equal 16 means that all boxes are analyzed with the slide 16
    • Range of values: positive integer number
  • Parameter name: min_size
    • Description: min_size is the minimum size of box to be taken into consideration. For example, min_size equal 35 means that all boxes with box size less than 35 are filtered out
    • Range of values: positive integer number
  • Parameter name: base_size
    • Description: base_size is the base size for anchor generation
    • Range of values: positive integer number
  • Parameter name: ratio
    • Description: ratio is the ratios for anchor generation
    • Range of values: array of float numbers
  • Parameter name: scale
    • Description: scale is the scales for anchor generation
    • Range of values: array of float numbers

Mathematical formulation

Proposal layer accepts three inputs with four dimensions. The produced blob has two dimensions: first one equals batch_size * post_nms_topn.

Proposal does the following with the input blob:

  1. Generates initial anchor boxes Left top corner of all boxes in (0, 0). Width and height of boxes are calculated from base_size with scale and ratio parameters
  2. For each point in the first input blob:
    • pins anchor boxes to the image according to the second input blob that contains four deltas for each box: for x and y of center, for width and for height
    • finds out score in the first input blob
  3. Filters out boxes with size less than min_size
  4. Sorts all proposals (box, score) by score from highest to lowest
  5. Takes top pre_nms_topn proposals
  6. Calculates intersections for boxes and filter out all with $intersection/union > nms_thresh$
  7. Takes top post_nms_topn proposals
  8. Returns top proposals

Example

<layer ... type="Proposal" ... >
    <data base_size="16" feat_stride="16" min_size="16" nms_thresh="0.6" post_nms_topn="200" pre_nms_topn="6000" 
     ratio="2.67" scale="4.0,6.0,9.0,16.0,24.0,32.0"/>
    <input> ... </input>
    <output> ... </output>
</layer>

Resample Layer

Name: Resample

Short description: Resample layer scales the input blob by the specified parameters.

Parameters: Resample layer parameters should be specified as the data node, which is a child of the layer node.

  • Parameter name: type
    • Description: type parameter specifies type of blob interpolation
    • Range of values:
      • LINEAR - linear blob interpolation
      • CUBIC - cubic blob interpolation
      • NEAREST - nearest-neighbor blob interpolation
  • Parameter name: antialias
    • Description: antialias is a flag that denotes whether to perform anti-aliasing
    • Range of values:
      • 0 - anti-aliasing is not performed
      • 1 - anti-aliasing is performed
Mathematical formulation

Resample layer scales the input blob. Depending on the type parameter, Resample applies different blob interpolation algorithms and performs anti-aliasing if the antialias parameter is specified.

Example

<layer type="Resample"> 
  <data antialias="0" factor="1.0" height="227" type="caffe.ResampleParameter.LINEAR" width="227"/> 
      <input> 
      ... 
      </input> 
      <output> 
      ... 
      </output> 
​</layer>

Frequently Asked Questions

If your question is not covered by the topics below, use the CV SDK Support page, where you can participate on a free forum.

1. Current caffe.proto does not contain field

Internally, the Model Optimizer uses a protobuf library to parse and load Caffe* models. This library requires a file grammar and a generated parser. For a Caffe fallback, the Model Optimizer uses a Caffe-generated parser for a Caffe-specific .proto file (which is usually located in the src/caffe/proto directory). So, if you have Caffe installed on your machine with Python* interface available, make sure that this is exactly the version of Caffe that was used to create the model.

If you just want to experiment with the Model Optimizer and test a Python extension for working with your custom layers without building Caffe, add the layer description to the caffe.proto file and generate a parser for it.

For example, to add the description of the CustomReshape layer, which is an artificial layer not present in any caffe.proto files:

  1. Add the following lines to of the caffe.proto file:
    package mo_caffe; // to avoid conflict with system Caffe* it is highly recommended to specify different package name
    ...
    message LayerParameter {
      // other layers parameters description
      ...
      optional CustomReshapeParameter custom_reshape_param = 546; // 546 - ID is any number not present in caffe.proto
    }
    // these lines to end of the file - describing contents of this parameter
    message CustomReshapeParameter {
      optional BlobShape shape = 1; // we just use the same parameter type as some other Caffe layers
    }
  2. Generate a new parser:
    cd <INSTALL_DIR>/deployment_tools/model_optimizer/mo/front/caffe/proto
    python3 generate_caffe_pb2.py --input_proto <PATH_TO_CUSTOM_CAFFE>/src/caffe/proto/caffe.proto

    where PATH_TO_CUSTOM_CAFFE is the path to the root directory of custom Caffe*.

  3. Now, the Model Optimizer is able to load the model into memory and start working with your extensions if there are any.

However, because your model has custom layers, you must register your custom layers as custom.

2. How do I create a bare caffemodel, if I have only prototxt?

You need the Caffe* Python* interface. In this case, do the following:

python3
import caffe
net = caffe.Net('<PATH_TO_PROTOTXT>/my_net.prototxt', caffe.TEST)
net.save('<PATH_TO_PROTOTXT>/my_net.caffemodel')
3. Unable to create ports for node with id

Most likely, the Model Optimizer does not know how to infer output shapes of some layers in the given topology. To lessen the scope, compile the list of layers that are custom for the Model Optimizer: present in the topology, absent in list of supported layers for the target framework: Caffe*, TensorFlow*, MXNet*. Then refer to available options in the corresponding section: Caffe* Models with Custom Layers, TensorFlow* Models with Custom Layers, MXNet* Models with Custom Layers.

4. Input image of shape is larger than mean image from file

Your model input shapes must be smaller than or equal to the shapes of the mean image file you provide. The idea behind the mean file is to subtract its values from the input image in an element-wise manner. When the mean file is smaller than the input image, there are not enough values to perform element-wise subtraction. Also, make sure that you use the mean file that was used during the network training phase. Note that the mean file is dataset dependent.

5. Mean file is empty

Most likely, the mean file that you have is specified with --mean_file flag, while launching the Model Optimizer is empty. Make sure that this is exactly the required mean file and try to regenerate it from the given dataset if possible.

6. Probably mean file has incorrect format

The mean file that you provide for the Model Optimizer must be in a .binaryproto format. You can try to check the content using recommendations from the BVLC Caffe* (#290).

7. Invalid .proto file: there is neither "layer" nor "layers" top-level messages

The structure of any Caffe* topology is described in the caffe.proto file of any Caffe version. For example, in the Model Optimizer, you can find the following proto file, used by default: <INSTALL_DIR>/deployment_tools/model_optimizer/mo/front/caffe/proto/my_caffe.proto. There you can find the structure:

message NetParameter {
  // ... some other parameters
  // The layers that make up the net.  Each of their configurations, including
  // connectivity and behavior, is specified as a LayerParameter.
  repeated LayerParameter layer = 100;  // ID 100 so layers are printed last.
  // DEPRECATED: use 'layer' instead.
  repeated V1LayerParameter layers = 2;
}

This means that any topology should contain layers as top-level structures in prototxt. For example, see the LeNet topology.

8. Old-style inputs (via input_dims) are not supported. Please, specify inputs via input_shape

The structure of any Caffe* topology is described in the caffe.proto file for any Caffe version. For example, in the Model Optimizer you can find the following .proto file, used by default: <INSTALL_DIR>/deployment_tools/model_optimizer/mo/front/caffe/proto/my_caffe.proto. There you can find the structure:

message NetParameter {

 optional string name = 1; // consider giving the network a name
  // DEPRECATED. See InputParameter. The input blobs to the network.
  repeated string input = 3;
  // DEPRECATED. See InputParameter. The shape of the input blobs.
  repeated BlobShape input_shape = 8;
  // 4D input dimensions -- deprecated.  Use "input_shape" instead.
  // If specified, for each input blob there should be four
  // values specifying the num, channels, height and width of the input blob.
  // Thus, there should be a total of (4 * #input) numbers.
  repeated int32 input_dim = 4;
  // ... other parameters
}

So, the input layer of the provided model must be specified in one of the following styles:

  • input: "data"
    input_shape
    {
        dim: 1
        dim: 3
        dim: 227
        dim: 227
    }
  • input: "data"
    input_shape
    {
        dim: 1
        dim: 3
        dim: 600
        dim: 1000
    }
    input: "im_info"
    input_shape
    {
         dim: 1
         dim: 3
    }
  • layer
    {
        name: "data"
        type: "Input"
        top: "data"
        input_param {shape: {dim: 1 dim: 3 dim: 600 dim: 1000}}
    }
    layer
    {
        name: "im_info"
        type: "Input"
        top: "im_info"
        input_param {shape: {dim: 1 dim: 3}}
    }
  • input: "data"
    input_dim: 1
    input_dim: 3
    input_dim: 500

However, if your model contains more than one input, the Model Optimizer is able to convert the model with inputs specified in a form of 1, 2, 3 of the list above. The last form is not supported for multi-input topologies.

9. Mean file for topologies with multiple inputs is not supported

Model Optimizer does not support mean file processing for topologies with more than one input. In this case, you need to perform preprocessing of the inputs for a generated Intermediate Representation in the Inference Engine to perform subtraction for every input of your multi-input model.

10.Cannot load or process mean file: value error

There are multiple reasons why the Model Optimizer does not accept the mean file. See FAQ #4, #5, and #6.

11. Invalid prototxt file: value error

There are multiple reasons why the Model Optimizer does not accept a Caffe* topology. See FAQs #7 and #20.

12. Error happened while constructing caffe.Net in the Caffe* fallback function

Model Optimizer tried to infer a specified layer via the Caffe* framework, however it cannot construct a net using the Caffe Python* interface. Make sure that your caffemodel and prototxt files are correct. To prove that the problem is not in the prototxt file, see FAQ #2.

13. Cannot infer shapes due to exception in Caffe*

Model Optimizer tried to infer a custom layer via the Caffe* framework, however an error occurred, meaning that the model could not be inferred using the Caffe. It might happen if you try to convert the model with some noise weights and biases resulting in problems with layers with dynamic shapes. You should write your own extension for every custom layer you topology might have. For more details, refer to: Extending Model Optimizer with New Primitives.

14. Cannot infer shape for node {} because there is no Caffe* available. Please, register Python* inference function for op or use Caffe for shape inference

Your model contains a custom layer and you have correctly registered it with the CustomLayersMapping.xml file. These steps are required to offload shape inference of the custom layer with the help of the system Caffe*. However, the Model Optimizer could not import a Caffe package. Make sure that you have built Caffe with a pycaffe target and added it into the PYTHONPATH environment variable. For more information, please refer to the Configuring the Model Optimizer. At the same time, it is highly recommend to avoid dependency on Caffe and write your own Model Optimizer extension for your custom layer. For more information, refer to the FAQ #45.

15. Framework name can not be deduced from the given options. Use --framework to choose one of Caffe*, TensorFlow*, MXNet*

You have run the Model Optimizer without a flag --framework caffe|tf|mxnet. Model Optimizer tries to deduce the framework by the input model file extension (.pb for TensorFlow*, .caffemodel for Caffe*, .params for MXNet*). Your input model might have a different extension and you need to explicitly set the source framework. For example, use --framework caffe.

16. Input shape is required to convert MXNet* model. Please, provide it with --input_shape

Input shape was not provided. That is mandatory for converting an MXNet* model to the Intermediate Representation, because MXNet models do not contain information about input shapes. Please, use the --input_shape flag to specify it. For more information about using the --input_shape, refer to the FAQ #57.

17. Both --mean_file and --mean_values are specified. Specify either mean file or mean values

--mean_file and --mean_values are two ways of specifying preprocessing for the input. However, they cannot be used together, as it would mean double subtraction and lead to ambiguity. Choose one of these options and pass it using the corresponding CLI option.

18. Negative value specified for --mean_file_offsets option. Please, specify positive integer values in format '(x,y)'

You might have specified negative values with --mean_file_offsets. Only positive integer values in format '(x,y)' must be used.

19. Both --scale and --scale_values are defined. Specify either scale factor or scale values per input channels

--scale sets a scaling factor for all channels. --scale_values sets a scaling factor per each channel. Using both of them simultaneously produces ambiguity, so you must use only one of them. For more information, refer to the Using Framework-Agnostic Conversion Parameters: for Converting a Caffe* Model, Converting a TensorFlow* Model, Converting an MXNet* Model.

20. Cannot find .prototxt file: for Caffe* please specify --input_proto - a protobuf file that stores topology and --input_model that stores pretrained weights

Model Optimizer cannot find a .prototxt file for a specified model. By default, it must be located in the same directory as the input model with the same name (except extension). If any of these conditions is not satisfied, use --input_proto to specify the path to the .prototxt file.

21. Specified input model does not exist

You probably specified an incorrect path to a model. Make sure that the path is correct and the file exists.

22. Failed to create directory .. . Permission denied!

Model Optimizer cannot create a directory specified via --output_dir. Make sure that you have enough permissions to create the specified directory.

23. Discovered data node without inputs and value

One of the layers in the specified topology might not have inputs or values. Please make sure that the provided caffemodel and protobuf files are correct.

24. Part of the nodes was not translated to IR. Stopped

Some of the layers are not supported by the Model Optimizer and cannot be translated to an Intermediate Representation. You can extend the Model Optimizer by adding new primitives. For more information, refer to Extending the Model Optimizer with New Primitives page.

25. While creating an edge from .. to .. : node name is undefined in the graph. Check correctness of the input model

Model Optimizer cannot build a graph based on a specified model. Most likely, it is incorrect.

26. Node does not exist in the graph

You might have specified an output node via the --output flag that does not exist in a provided model. Make sure that the specified output is correct and this node exists in the current model.

27. --input parameter was provided. Other inputs are needed for output computation. Provide more inputs or choose another place to cut the net

Most likely, the Model Optimizer tried to cut the model by a specified input. However, other inputs are needed.

28. Placeholder node does not have input port, but input port was provided

You might have specified a placeholder node with an input node, while the placeholder node does not have it the model.

29. Port index is out of number of available input ports for node

This error occurs when an incorrect input port is specified with the --input command line argument. When using --input, you can optionally specify an input port in the form: X:node_name, where X is an integer index of the input port starting from 0 and node_name is the name of a node in the model. This error occurs when the specified input port X is not in the range 0..(n-1), where n is the number of input ports for the node. Please, specify a correct port index, or do not use it if it is not needed.

30. Node has more than 1 input and input shapes were provided. Try not to provide input shapes or specify input port with PORT:NODE notation, where PORT is an integer

This error occurs when an incorrect combination of the --input and --input_shape command line options is used. Using both --input and --input_shape is valid only if --input points to the Placeholder node, a node with one input port or --input has the form PORT:NODE, where PORT is an integer port index of input for node NODE. Otherwise, the combination of --input and --input_shape is incorrect.

31. Input port > 0 in --input is not supported if --input_shape is not provided. Node: NAME_OF_THE_NODE. Omitted port index and all input ports will be replaced by placeholders. Or provide --input_shape

When using the PORT:NODE notation for the --input command line argument and PORT> 0, you should specify --input_shape for this input. This is a limitation of the current Model Optimizer implementation.

32. No or multiple placeholders are in the model, but only one shape is provided, cannot set it

Looks like you have provided only one shape for the placeholder, however there are no or multiple inputs in the model. Please, make sure that you have provided correct data for placeholder nodes.

33. The amount of input nodes for port is not equal to 1

This error occurs when the SubgraphMatch.single_input_node function is used for an input port that supplies more than one node in a sub-graph. The single_input_node function can be used only for ports that has a single consumer inside the matching sub-graph. When multiple nodes are connected to the port, use the input_nodes function or node_by_pattern function instead of single_input_node. Please, refer to Sub-Graph Replacement in the Model Optimizer for more details.

34. Output node for port has already been specified

This error occurs when the SubgraphMatch._add_output_node function is called manually from user's extension code. This is an internal function, and you should not call it directly.

35. Unsupported match kind .. . Match kinds points or scope are supported only

While using configuration file to implement a TensorFlow* front replacement extension, an incorrect match kind was used. Only points or scope match kinds are supported. Please, refer to Sub-Graph Replacement in the Model Optimizer for more details.

36. Cannot write an event file for the TensorBoard to directory

Model Optimizer tried to write an event file in the specified directory but failed to do that. That could happen because the specified directory does not exist or you do not have enough permissions to write in it.

37. There is no registered infer function for node with op = .. . Please, implement this function in the extensions

Most likely, you tried to extend Model Optimizer with a new primitive, but did not specify an infer function. For more information on extensions, see Extending the Model Optimizer with New Primitives.

38. Stopped shape/value propagation at node ..

Model Optimizer cannot infer shapes or values for the specified node. It can happen because of a bug in the custom shape infer function, because the node inputs have incorrect values/shapes, or because the input shapes are incorrect.

39. The input with shape .. does not have the batch dimension

Batch dimension is the first dimension in the shape and it should be equal to 1 or undefined. In your case, it is not equal to either 1 or undefined, which is why the -b shortcut produces undefined and unspecified behavior. To resolve the issue, specify full shapes for each input with the --input_shape option. Run Model Optimizer with the --help option to learn more about the notation for input shapes.

40. Not all output shapes were inferred or fully defined for node

Most likely, the shape is not defined (partially or fully) for the specified node. You can use --input_shape with positive integers to override model input shapes.

41. Shape for tensor is not defined. Cannot proceed

This error occurs when the --input command line option is used to cut a model and --input_shape is not used to override shapes for a node and a shape for the node cannot be inferred by Model Optimizer. You need to help Model Optimizer and specify shapes with --input_shape for each node that is specified with the --input command line option.

42. Module tensorflow was not found. Please install TensorFlow* 1.2 or higher

To convert TensorFlow* models with Model Optimizer, TensorFlow* 1.2 or newer must be installed. For more information on prerequisites, see Configuring the Model Optimizer.

43. Cannot read the model file: it is incorrect TensorFlow* model file or missing

The model file should contain a frozen TensorFlow* graph in the text or binary format. Make sure that --input_model_is_text is provided for a model in the text format. By default, a model is interpreted as binary file.

44. Cannot preprocess TensorFlow* graph after reading from model file. File is corrupted or has unsupported format

Most likely, there is a problem with the specified file for model. The file exists, but it has bad formatting or is corrupted.

45. Found custom layer. Model Optimizer does not support this layer. Please, register it in CustomLayersMapping.xml or implement extension

This means that the layer {layer_name} is not supported in the Model Optimizer. You can find a list of all unsupported layers in the corresponding section. You should add this layer to CustomLayersMapping.xml (Legacy Mode for Caffe* Custom Layers) or implement the extensions for this layer (Extending Model Optimizer with New Primitives).

46. Custom replacement configuration file does not exist

Path to the custom replacement configuration file was provided with the --tensorflow_use_custom_operations_config flag, but the file could not be found. Please, make sure that the specified path is correct and the file exists.

47. Extractors collection have case insensitive duplicates

When extending Model Optimizer with new primitives keep in mind that their names are case insensitive. Most likely, another operation with the same name is already defined. For more information, see Extending the Model Optimizer with New Primitives.

48. Input model name is not in an expected format, cannot extract iteration number

Model Optimizer can not load an MXNet* model in the specified file format. Please, use the .json or .param format.

49. Cannot convert type of placeholder because not all of its outputs are Cast to float operations

There are models where Placeholder has the UINT8 type and the first operation after it is 'Cast', which casts the input to FP32. Model Optimizer detected that the Placeholder has the UINT8 type, but the next operation is not 'Cast' to float. Model Optimizer does not support such a case. Please, change the model to have placeholder FP32 data type.

50. Data type is unsupported

Model Optimizer cannot convert the model to the specified data type. Currently, FP16 and FP32 are supported. Please, specify the data type with the --data_type flag. The available values are: FP16, FP32, half, float.

51. No node with name ..

Model Optimizer tried to access a node that does not exist. This could happen if you have incorrectly specified placeholder, input or output node name.

52. Module mxnet was not found. Please, install MXNet* 1.0.0

To convert MXNet* models with Model Optimizer, MXNet 1.0.0 must be installed. For more information about prerequisites, see Configuring the Model Optimizer.

53. The following error happened while loading MXNet* model ..

Most likely, there is a problem with loading of the MXNet* model. Please, make sure that the specified path is correct, the model exists, it is not corrupted, and you have sufficient permissions to work with it.

54. The following error happened while processing input shapes: ..

Please, make sure that inputs are defined and have correct shapes. You can use --input_shape with positive integers to override model input shapes.

55. Attempt to register of custom name for the second time as class. Note that custom names are case-insensitive

When extending Model Optimizer with new primitives keep in mind that their names are case insensitive. Most likely, another operation with the same name is already defined. For more information, see Extending the Model Optimizer with New Primitives .

56. Both --input_shape and --batch were provided. Please, provide only one of them

You cannot specify the batch and the input shape at the same time. You should specify a desired batch as the first value of the input shape.

57. Input shape ... cannot be parsed

The specified input shape cannot be parsed. Please, define it in one of the following ways:

  • python3 mo.py --input_model <INPUT_MODEL>.caffemodel --input_shape (1,3,227,227)
  • python3 mo.py --input_model <INPUT_MODEL>.caffemodel --input_shape [1,3,227,227]
  • In case of multi input topology you should also specify inputs:
    python3 mo.py --input_model /path-to/your-model.caffemodel --input data,rois --input_shape (1,3,227,227),(1,6,1,1)

Keep in mind that there is no space between and inside the brackets for input shapes.

58. Please, provide input layer names for input layer shapes

When specifying input shapes for several layers, you must provide names for inputs, whose shapes will be overwritten. For usage examples, see Converting a Caffe* Model. Additional information for --input_shape is in FAQ #57.

59. Values cannot be parsed

Mean values for the given parameter cannot be parsed. It should be a string with a list of mean values. For example, in '(1,2,3)', 1 stands for the RED channel, 2 for the GREEN channel, 3 for the BLUE channel.

60. .. channels are expected for given values

The number of channels and the number of given values for mean values do not match. The shape should be defined as '(R,G,B)' or '[R,G,B]'. The shape should not contain undefined dimensions (? or -1). The order of values is as follows: (value for a RED channel, value for a GREEN channel, value for a BLUE channel).

61. You should specify input for each mean value

Most likely, you have not specified inputs using --mean_values. Please, specify inputs with the --input flag. For usage examples, please, refer to FAQ #63.

62. You should specify input for each scale value

Most likely, you have not specified inputs using --scale_values. Please, specify inputs with the --input flag. For usage examples, please, refer to FAQ #64.

63. Number of inputs and mean values do not match

The number of specified mean values and the number of inputs must be equal. Please, refer to Converting a Caffe* Model for a usage example.

64. Number of inputs and scale values do not match

The number of specified scale values and the number of inputs must be equal. Please, refer to Converting a Caffe* Model for a usage example.

65. No class registered for match kind .. . Supported match kinds are ..

A replacement defined in the configuration file for sub-graph replacement using node names patterns or start/end nodes has the match_kind attribute. The attribute may have only one of the values: scope or points. If a different value is provided, this error is displayed.

66. No instance(s) is(are) defined for the custom replacement

A replacement defined in the configuration file for sub-graph replacement using node names patterns or start/end nodes has the instances attribute. This attribute is mandatory, and it causes this error if it is missing. Refer to documentation with a description of the sub-graph replacement feature.

67. The instance must be a single dictionary for the custom replacement with id ..

A replacement defined in the configuration file for sub-graph replacement using start/end nodes has the instances attribute. For this type of replacement, the instance must be defined with a dictionary with two keys start_points and end_points. Values for these keys are lists with the start and end node names, respectively. Refer to documentation with a description of the sub-graph replacement feature.

68. No instances are defined for replacement with id ..

A replacement for the specified id is not defined in the configuration file. Please, refer to FAQ #66 for more information.

69. Custom replacements configuration file ... does not exist

Path to a custom replacement configuration file was provided with the --tensorflow_use_custom_operations_config flag, but it cannot be found. Please, make sure that the specified path is correct and the file exists.

70. Failed to parse custom replacements configuration file ...

The file for custom replacement configuration provided with the --tensorflow_use_custom_operations_config flag cannot be parsed. In particular, it should have a valid JSON structure. For more details, refer to JSON Schema Reference.

71. One of the custom replacements in the configuration file .. does not contain attribute id

Every custom replacement should declare a set of mandatory attributes and their values. For more details, refer to FAQ #72.

72. File .. validation failed

The file for custom replacement configuration provided with the --tensorflow_use_custom_operations_config flag cannot pass validation. Make sure that you have specified id, instances and match_kind for all the patterns.

73. Cannot update the file .. because it is broken

The custom replacement configuration file provided with the --tensorflow_custom_operations_config_update cannot be parsed. Please, make sure that the file is correct and refer to FAQs #69, #70, #71, and #72.

74. End node .. is not reachable from start nodes: ..

This error occurs when you try to make a sub-graph match. It is detected that between the start and end nodes that were specified as inputs/outputs of the subgraph to find, there are nodes that are marked as outputs but there is no path from them to the input nodes. Make sure that the subgraph you want to match does actually contain all the specified output nodes.

75. Sub-graph contains network input node ..

Start or end node for the sub-graph replacement using start/end nodes is specified incorrectly. Model Optimizer finds internal nodes of the sub-graph strictly "between" the start and end nodes. Then it adds all input nodes to the sub-graph (and inputs of their inputs and so on) for these "internal" nodes. The error reports, that the Model Optimizer reached input node during this phase. This means that the start/end points are specified incorrectly in the configuration file. Refer to documentation with a description of the sub-graph replacement feature.

76. ... elements of ... were clipped to infinity while converting a blob for node [...] to ...

This message may appear when the --data_type=FP16 command line option is used. This option implies conversion of all the blobs in the node to FP16. If a value in a blob is out of the range of valid FP16 values, the value is converted to positive or negative infinity. It may lead to incorrect results of inference or may not be a problem, depending on the model. The number of such elements and the total number of elements in the blob is printed out together with the name of the node, where this blob is used.

77. ... elements of ... were clipped to zero while converting a blob for node [...] to ...

This message may appear when the --data_type=FP16 command line option is used. This option implies conversion of all blobs in the mode to FP16. If a value in the blob is so close to zero that it cannot be represented as a valid FP16 value, it is converted to a true zero FP16 value. Depending on the model, it may lead to incorrect results of inference or may not be a problem. The number of such elements and the total number of elements in the blob are printed out together with a name of the node, where this blob is used.

78. The amount of nodes matched pattern ... is not equal to 1

This error occurs when the SubgraphMatch.node_by_pattern function is used with a pattern that does not uniquely identify a single node in a sub-graph. Try to extend the pattern string to make unambiguous match to a single sub-graph node. For more details, refer to Sub-graph Replacement in the Model Optimizer.

79. The topology contains no input layers

Your Caffe* topology .prototxt file is intended for training. Model Optimizer expects a deployment-ready .prototxt file. To fix the problem, prepare a deployment-ready .prototxt file. Usually, preparation of a deploy-ready topology results in removing data layer(s), adding input layer(s), and removing loss layer(s).

80. Warning: please expect that Model Optimizer conversion might be slow

You are using an unsupported Python* version. Use only versions 3.4 - 3.6 for the C++ protobuf implementation that is supplied with the CV SDK. You can still boost conversion speed by building protobuf library from sources. For complete instructions about building protobuf from sources see

Known Issues

Old proto Compiler Breaks protobuf Library

With Python protobuf library version 3.5.1 an incompatibility is possible. This is a known issue for CentOS 7.4

Error log report:

File "../lib64/python3.5/site-packages/google/protobuf/descriptor.py", line 829, in _new_ 
return _message.default_pool.AddSerializedFile(serialized_pb) 
TypeError: expected bytes, str found

A possible workaround is to upgrade the default protobuf compiler (libprotoc 2.5.0) to a newer version, such as libprotoc 2.6.1.

 

Legal Information

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at http://www.intel.com/ or from the OEM or retailer.

No computer system can be absolutely secure.

Intel, Arria, Core, Movidius, Pentium, Xeon, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used with permission by Khronos.

*Other names and brands may be claimed as the property of others.

Copyright © 2018, Intel Corporation. All rights reserved.

 

Intel® Computer Vision SDK Release Notes

$
0
0

Introduction

The Intel® Computer Vision SDK 2018 (Intel® CV SDK) is a comprehensive toolkit for quickly developing applications and solutions that emulate human vision. Based on Convolutional Neural Networks (CNNs), the SDK extends CV workloads across Intel® hardware, maximizing performance.

The Intel® CV SDK:

  • Enables the CNN-based deep learning inference on the edge.
  • Supports heterogeneous execution across Intel CV accelerators, using a common API for the CPU, Intel® Integrated Graphics, Intel® Movidius™ Neural Compute Stick, and FPGA.
  • Speeds time-to-market through an easy-to-use library of CV functions and pre-optimized kernels.
  • Includes optimized calls for CV standards, including OpenCV*, OpenCL™, and OpenVX*

New and Changed in This Release

Model Optimizer Changes

The Model Optimizer component has been replaced by a Python*-based application. It has a consistent design across the supported frameworks. Key features are listed below. See the Model Optimizer Developer Guide for more information.

  • General changes:
    • Several CLI options have been deprecated since the last release. See the Model Optimizer Developer Guide for more information.
    • More optimization techniques were added.
    • Usability, stability, and diagnostics capabilities were improved.
    • Microsoft* Windows* 10 support was added.
    • A total of more than 100 public models are now supported for Caffe*, MXNet*, and TensorFlow* frameworks. 
    • A framework is required for unsupported layers, and a fallback to the original framework is available for unsupported layers.
  • Caffe* changes:
    • The workflow was simplified, and you are no longer required to install Caffe.
    • Caffe is no longer required to generate the Intermediate Representation for models that consist of standard layers and/or user-provided custom layers. User-provided custom layers must be properly registered for the Model Optimizer and the Inference Engine. See Using the Model Optimizer to Convert Caffe* Models for details and a list of standard layers.
    • Caffe is now only required for unsupported layers that are not registered as extensions in the Model Optimizer.
  • TensorFlow* support is significantly improved, and now offers a preview of the Object Detection API support for SSD*-based topologies.

Inference Engine

  • Added Heterogeneity support:
    • Device affinities via API are now available for fine-grained, per-layer control.
    • You can now specify a CPU fallback for layers that the FPGA does not support. For example, you can specify HETERO: FPGA, CPU as a device option for Inference Engine samples.
    • You can use the fallback for CPU + Intel® Integrated Graphics if you have custom layers implemented only on the CPU, and you want to execute the rest of the topology on the Intel® Integrated Graphics without rewriting the custom layer for the Intel® Integrated Graphics.
  • Asynchronous execution: The Asynchronous API improves the overall application frame rate, allowing you to perform secondary tasks, like next frame decoding, while the accelerator is busy with current frame inference.
  • New customization features include easy-to-create Inference Engine operations. You can:
    • Express the new operation as a composition of existing Inference Engine operations or register the operation in the Model Optimizer.
    • Connect the operation to the new Inference Engine layer in C++ or OpenCL™. The existing layers are reorganized to “core” (general primitives) and “extensions” (topology-specific, such as DetectionOutput for SSD). These extensions now come as source code. You must build and load them into your application. After the Inference Engine samples are compiled, this library is built automatically, and every sample explicitly loads the library upon execution. The extensions are also required for the pre-trained models inference.
  • Plugin support added for the Intel® Movidius™ Neural Compute Stick hardware (Myriad2).
  • Samples are provided for an increased understanding of the Inference Engine, APIs, and features:
    • All samples automatically support heterogeneous execution.
    • Async API showcase in Object Detection via the SSD sample.
    • Minimalistic Hello, classification sample to demonstrate Inference Engine API usage.

OpenCV*

  • Updated to version 3.4.1 with minor patches. See the change log for details. Notable changes:
    • Implementation of on-disk caching of precompiled OpenCL kernels. This feature reduces initialization time for applications that use several kernels.
    • Improved C++ 11 compatibility on source and binary levels.
  • Added subset of OpenCV samples from the community version to showcase the SDK capabilities:
    • bgfg_segm.cpp - background segmentation
    • colorization.cpp - performs image colorization using DNN module (download the network from a third-party site)
    • dense_optical_flow.cpp - dense optical flow using T-API (Farneback, TVL1)
    • opencl_custom_kernel.cpp - running custom OpenCL™ kernel via T-API
    • opencv_version.cpp - the simplest OpenCV application - prints library version and build configuration
    • peopledetect.cpp - pedestrian detector using built-in HOGDescriptor

OpenVX*

  • A new memory management scheme with the Imaging and Analytics Pipeline (IAP) framework drastically reduces memory consumption.
    • Introduces intermediate image buffers that result in a significant memory footprint reduction for complex Printing and Imaging (PI) pipelines operating with extremely large images.
    • Deprecated tile pool memory consumption reduction feature. Removed from the Copy Pipeline sample.
  • The OpenVX* CNN path is not recommended for CNN-based applications and is partially deprecated:
    • CNN AlexNet* sample is removed.
    • CNN Custom Layer (FCN8) and Custom Layers library are removed.
    • The OpenVX* SSD-based Object Detection web article is removed.
    • OpenVX* FPGA plugin is deprecated. This is part of the CNN OVX deprecation.
  • The VAD tool for creating OpenVX* applications is deprecated and removed.
  • New recommendation: Use Deep Learning Inference Engine capabilities for CNN-based applications.

Examples and Tutorials

  • Model downloader for Intel® CV SDK public models in Caffe format:
  • Cross-check tool: To debug the model inference both in whole and layer-by-layer, comparing accuracy and performance between CPU, Intel® Integrated Graphics, and the Intel® Movidius™ Neural Compute Stick.
  • CNN pre-trained models (prototxt) + pre-generated Intermediate Representations (.xml + .bin):
    • age-gender-recognition-retail: Age and gender classification.
    • face-detection-retail: Face Detection.
    • person-detection-retail: Person detection.
    • license-plate-recognition-barrier: Chinese license plate recognition.
    • face-detection-adas: Face Detection.
    • person-detection-retail: Person Detection.
    • head-pose-estimation-adas: Head and yaw + pitch + roll.
    • vehicle-attributes-recognition-barrier: Vehicle attributes (type/color) recognition.
    • person-vehicle-bike-detection-crossroad: Multiclass (person, vehicle, non-vehicle) detector.
    • vehicle-license-plate-detection-barrier: Multiclass (vehicle, license plates) detector.

Known Issues

IDDescriptionComponentWorkaround
1Releasing a non-virtual vx_array object after it has been used as a parameter in a graph and before graph execution, may result in slow vxProcessGraph and data corruption.OpenVX*N/A
2When a graph is abandoned due to a failure in a user node, the callbacks that are attached to skipped nodes are called.OpenVXN/A
3The OpenVX* volatile kernels extensions API are subject to change.OpenVXN/A
4Multiple user node access the same array cause application crash.OpenVXN/A
5Intel® Integrated Graphics equalize histogram node partially runs on CPU.OpenVXN/A
6User node hangs when calling Intel® Intetegated Performance Primitives if the node is linked to IAP.soOpenVXN/A
7Edge Tracing part of IPU Canny Edge detection runs on CPU.OpenVXN/A
8The Harris Corners* Kernel Extension produces inaccurate results when the sensitivity parameter is set outside the range of [0.04; 0.15]OpenVXN/A
9The API vxQueryNode() returns zero for custom Intel® Integrated Graphics nodes when queried for the attribute VX_NODE_ATTRIBUTE_PERFORMANCE.OpenVXN/A
10Node creation methods do not allow using the NULL pointer for non-optional parameters.OpenVXN/A
11The vx_delay object doesn’t the support vx_tensor and vx_object_array typesOpenVXN/A
12The vx_delay object is not supported as a user node input parameterOpenVXN/A
13Scalar arguments are not changing dynamically in several nodes on Intel®Integrated Graphics (Harris Corners*, ColorConvert*, Convolve*) in RuntimeOpenVXN/A
14The OpenCL™ out of order queue feature might slow down a single nodes graphOpenVXN/A
15On CPU in vxConvolutionLayerrounding_police parameter ignored, TO_ZERO rounding is used in any caseOpenVXN/A
16On CPU in vxFullyConnecedLayerrounding_police parameter ignored, TO_ZERO rounding is used in any caseOpenVXN/A
17On CPU in vxTensorMultiplyNode rounding_policy parameter ignored, TO_ZERO policy is used in any caseOpenVXN/A
18Unsupported Dynamic Shapes for Caffe* layersModel OptimizerN/A
19Some TensorFlow operations are not supported, but only a limited set of different operations can be successfully converted.Model OptimizerEnable unsupported ops through Model Optimizer  extensions and IE custom layers
20Only TensorFlow models with FP32 Placeholders. If there is non FP32 Placeholder, the next immediate operation after this Placeholder should be a Cast operation that converts to FP32.Model OptimizerRebuild your model to include a FP32 placeholder only or add cast operations
21Only TensorFlow models with FP32 weights are supported.Model OptimizerRebuild your model to have FP32 weights only
22The recent version of TensorFlow Detection API is not supported. Only SSD models frozen in versions prior r1.6.0 of the detection API can be converted.Model OptimizerN/A
23Pre-build protobuf binary distributed as egg-file with Model Optimizer breaks Python 3.5.2 installation. It shouldn't be used with Python 3.5.2.Model OptimizerBuild protobuf binary yourself (recommended), or use Python version of protobuf (slow
24TensorFlow models with trainable layers such as Conv2D or MatMul that re-use weights from the same Const operations cannot be successfully converted.Model OptimizerRebuild a model with duplicated Const operations to avoid weights sharing
25Embedded preprocessing in Caffe models is not supported and is ignored.Model OptimizerPass preprocessing parameters through MO CLI parameters
26Shape infer function implemented via Model Optimizer extensions for a new TensorFlow operations is provided. Fallback to TensorFlow doesn't work correctly for unknown operations that are not a part of constant sub-graphModel Optimizer 
27Releasing the the plugin's pointer before inference completion might cause a crash.Inference EngineRelease the plugin pointer at the end of the application, when inference is done.
28Altera* OpenCL* 17.1 might not be installed properly. Follow the Installation guide.Inference EngineUse the instructions in the FPGA installation guide
29FP11 bitstreams can be programmed to the boards using the flash approach only.Inference EngineUse the instructions in the FPGA installation guide
30If Intel OpenMP was initialized before OpenCL, OpenCL will hang. This means initialization or executing the FPGA will hang, too.Inference EngineInitialize FPGA or Heterogeneous with the FPGA plugin priority before the CPU plugin.
31The performance of the first iteration of the samples for networks executing on FPGA is much lower than the performance of the next iterations.Inference EngineUse the -ni <number> -pc to tet the real performance of inference on FPGA.
32To select the best bitstream for a custom network, evaluate all available bitstreams and choose the bitstream with the best performance and accuracy. Use validation_app to collect accuracy and performance data for the validation dataset.Inference Engine 
33The Intel® Movidius™ Myriad™ Vision Processing Unit plugin supports batch=1 onlyInference EngineInfer the batch of images one at a time, or use multiple Intel® Movidius™ Myriad™ Vision Processing Units
34Myriad plugin may fail to open device when several processes try to do inference the same time and several NCS devices are availableInference EngineUse threads within same process to utilize multiple devices.
35The setBatch method works only for topology which has batch as first dimension for all tensorsInference Engine 
36Multiple OpenMP runtime initialization is possible if you are using MKL and Inference Engine simultaneouslyInference EngineUse apreloaded iomp5 dynamic library
37Completion Callback is called in case of succesfull execution of infer request onlyInference EngineUse Wait to get notfied about errors in infer request

Included in This Release

The Intel® CV SDK is available in three versions:

  • Intel® Computer Vision SDK for Windows*
  • Intel® Computer Vision SDK for Linux*
  • Intel® Computer Vision SDK for Linux with FPGA Beta Support
Install Location/File NameDescription
Deep Learning Model OptimizerModel optimization tool for your trained models.
Deep Learning Inference EngineUnified API to integrate the inference with application logic
Open CV 3.4.1 libraryOpenCV* Community version compiled for Intel hardware. Includes PVL libraries for computer vision
Intel® Media SDK libraries (open source version)Eases the integration between the Intel® CV SDK and the Intel® Media SDK.
Intel OpenVX* 1.1 runtimeDirectory containing bundled OpenVX* runtime that supports CPU, Intel® Integrated Graphics, and IPU devices
OpenCL NEO driver Improves usability
Intel® FPGA Deep Learning Acceleration Suite, including pre-compiled bitstreams

Implementations of the most common CNN topologies to enable image classification and ease the adoption of FPGAs for AI developers.

Includes pre-compiled bitstream samples for the Intel® Programmable Acceleration Card with Intel® Arria® 10 GX FPGA and the Arria® 10 GX FPGA Development Kit.

Intel® FPGA SDK for OpenCL™ software technologyThe Intel® FPGA RTE for OpenCL™ provides utilities, host runtime libraries, drivers, and RTE-specific libraries and files
Intel® CV SDK documentationSDK developer guides and other documentation. Available from the Intel® CV SDK product site
Pre-trained Deep Learning ModelsTen pre-trained models for prototxt and generated Intermediate Representation Files. You can use these for demonstrations, to help you learn the product, or for product development.
Computer Vision SamplesSamples that illustrate use of or application CV application creation for these SDK components: Inference Engine, OpenCV, and OpenVX.

Where to Download This Release

https://software.intel.com/en-us/cv-sdk/computer-vision-sdk/choose-download

System Requirements

Development Platform

Processors

6th-8th Generation Intel® Core™ & Xeon™ processor

Operating Systems:

  • Ubuntu* 16.04.3 long-term support (LTS), 64-bit
  • CentOS* 7.4, 64-bit
  • Windows* 10, 64-bit

Target Platform (choose one processor with one corresponding operating system)

Your requirements may vary, depending on which product version you use.

CPU processors with corresponding operating systems

  • 6th-8th Generation Intel® Core™ & Xeon™ processor with operating system options:
    • Ubuntu* 16.04.3 long-term support (LTS), 64-bit
    • CentOS* 7.4, 64-bit
    • Windows* 10, 64-bit
  • Intel® Pentium® processor N4200/5, N3350/5, N3450/5 with Intel® HD Graphics
    • Yocto Project* Poky Jethro* v2.0.3, 64-bit

Intel® Integrated Graphics processors with corresponding operating systems (GEN Graphics)

NOTE: This installation requires drivers that are not included in the Intel® CV SDK package

  • 6th Generation Intel® Core™ processor with Intel® Iris® Pro graphics and Intel® HD Graphics
    • Ubuntu* 16.04.3 long-term support (LTS), 64-bit
    • CentOS* 7.4, 64-bit
  • 6th Generation Intel® Xeon™ processor with Intel® Iris® Pro graphics and Intel® HD Graphics (without e5)
    • Ubuntu* 16.04.3 long-term support (LTS), 64-bit
    • CentOS* 7.4, 64-bit

FPGA processors with corresponding operating systems

NOTES:
Only for the Intel® Computer Vision SDK for Linux with FPGA Beta Support download
OpenCV* and OpenVX functions must be run against the CPU or Intel® Integrated Graphics to get all required drivers and tools

  • Intel® Arria® FPGA 10 GX development kit
    • Ubuntu* 16.04.3 long-term support (LTS), 64-bit
    • CentOS* 7.4, 64-bit

Intel® Movidius™ Neural Compute Stick processor with corresponding operating systems

  • Intel® Movidius™ Neural Compute Stick Neural Compute Stick
    • Ubuntu* 16.04.3 long-term support (LTS), 64-bit
    • CentOS* 7.4, 64-bit

Helpful Links

Note: Links open in a new window.

Intel® CV SDK Home Page: https://software.intel.com/en-us/computer-vision-sdk

Intel® CV SDK Documentation: https://software.intel.com/en-us/computer-vision-sdk/documentation/featured

Legal Information

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at http://www.intel.com/ or from the OEM or retailer.

No computer system can be absolutely secure.

Intel, Arria, Core, Movidius, Xeon, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

*Other names and brands may be claimed as the property of others.

Copyright © 2018, Intel Corporation. All rights reserved.


Intel® Computer Vision SDK 2018 Overview

$
0
0

About the Intel® Computer Vision SDK  2018

This document describes the Intel® Computer Vision SDK, its key components, and its support on a CPU, FPGA, and Intel® Integrated Graphics. At the end of this document is an in-depth glossary of the terms and concepts used by the Intel® CV SDK. This document does not give you instructions about how to install or use the Intel® CV SDK.

The Intel® Computer Vision SDK (Intel® CV SDK) is a comprehensive toolkit that you can use to develop and deploy vision-oriented solutions on Intel platforms. Vision-oriented means the solutions use images or videos to perform specific tasks. A few of the solutions include autonomous vehicles, digital surveillance cameras, robotics, and mixed-reality headsets.

The Intel® CV SDK:

  • Enables CNN-based deep learning inference on the edge
  • Supports heterogeneous execution across Intel computer vision accelerators—CPU, Intel® Integrated Graphics, Intel® Movidius™ Myriad™ 2 Vision Processing Unit, and FPGA using a common API
  • Speeds time-to-market via an easy-to-use library of CV functions and pre-optimized kernels
  • Includes optimized calls for computer vision standards including OpenCV*, OpenCL™, and OpenVX*

The Intel® CV SDK includes the Deep Learning Deployment Toolkit, a product that includes the Model Optimizer and the Inference Engine. In addition to the Deep Learning Deployment Toolkit, the Intel® CV SDK adds these components:

 

Deep Learning Workflow

A simple deep learning workflow looks like this:

Intel Computer Vision Basic Workflow  

A summary of the steps for optimizing and deploying a trained model:

  1. Configure the Model Optimizer for your framework.
  2. Convert a trained model to produce an optimized Intermediate Representation (IR) of the model based on the trained network topology, weights, and bias values.
  3. Test the model in the Intermediate Representation format using the Inference Engine in the target environment via provided Inference Engine validation application or sample applications.
  4. Integrate the Inference Engine into your application to deploy the model in the target environment. 

The next diagram shows how the Deep Learning Deployment Toolkit and the Intel® CV SDK fit into an end-to-end computer vision workflow. The dark blue boxes indicate parts of the Intel® CV SDK, including the Deep Learning Deployment Toolkit. The light blue text indicates Intel tools that you can use for this process. The Intel tools are not the only tools you can use.

Computer vision and deep learning toolkit

The next sections go into more detail about the Intel® CV SDK and Deep Learning Deployment Toolkit.

Supported Frameworks

Three frameworks are supported. 

Caffe*

Caffe is a popular open-source framework that was developed at UC Berkeley. It can be used on both Linux and Windows. This framework provides a way to switch from a CPU to Intel® Integrated Graphics by setting a flag on a device that includes Intel® Integrated Graphics. 

For more information about Caffe, see http://caffe.berkeleyvision.org/ 

To work with your Caffe model, see https://software.intel.com/en-us/articles/CVSDK-Using-Caffe

TensorFlow*

TensorFlow is a popular open-source framework that was developed by Google It can be used on both Linux and Windows. 

For more information, see https://www.tensorflow.org/

The Model Optimizer accepts only a frozen TensorFlow* model as input. In TensorFlow models, all variables are converted to constants and operations related to training are removed. Detailed explanations about this topic is provided in the TensorFlow* documentation here.

To work with your TensorFlow model, see https://software.intel.com/en-us/articles/CVSDK-Using-TensorFlow

MXNet*

MXNet is a popular open-source framework that was developed by Apache. It can be used on both Linux and Windows. This framework provides a way to switch from CPU to Intel® Integrated Graphics. 

For more information, see https://mxnet.apache.org/

To work with your MXNet model, see https://software.intel.com/en-us/articles/CVSDK-Using-MXNet

Model Optimizer

The Model Optimizer is the first of two key components of the Intel® CV SDK and the Deep Learning Deployment Toolkit. The Model Optimizer is a command-line tool for Windows* and Linux that converts trained models into Immediate Representation (IR) files that are required by the Inference Engine. In the optimization process, the Model Optimizer: 

  • Performs horizontal fusion of the network layers
  • Merges the network layers
  • Prunes unused branches in the network
  • Applies weight compression methods

The Model Optimizer has two main purposes:

  1. To produce a valid Intermediate Representation that the Inference Engine can use. The Model Optimizer's main responsibility is to produce two files that form the Intermediate Representation: an .xml that describes the network topology, and a .bin file that contains the weights and biases binary data.
  2. To produce an optimized Intermediate Representation. Pre-trained models can contain layers that are important for training, but serve no value during inference and also increase inference time. Layers can be removed from the Intermediate Representation and represented as one mathematical operation made by single layer. The Model Optimizer tries to recognize patterns and merges layers, creating an Intermediate Representation with fewer layers than the original model, reducing the inference time.

How the Model Optimizer Works

NOTE: The Intel® CV SDK documentation discusses the Caffe, TensorFlow, and MXNet frameworks.

The Model Optimizer:

  • Loads a trained Caffe, TensorFlow, or MXNet model into memory. 
  • Reads the model
  • Builds an internal representation of the model
  • Optimizes the model
  • Produces Intermediate Representation files. The Intermediate Representation is the only format that is accepted by Inference Engine.

The Model Optimizer uses three stages to process a model:

Stage 1: Learning

  • Iteratively runs the networks on a set of input samples and collects network statistics on each layer. This allows the Model Optimizer to estimate the dynamic range of all layer activations, weights and biases. This is required only if the target data type differs from the original data type with which the network was trained.
  • Reports collected statistics for offline analysis. Statistics contain these metrics: min, max, standard deviation, mean, and percentiles (99%, 99.5%, 99.95%).
  • Builds an optimal configuration for the target precision network, and creates an inference network converted to run in the target data type.

Stage 2: Feedback

  • Simulates the inference process to estimate potential accuracy loss. Each layer in the produced network has a bit accurate implementation for a specific Intel® platform, which simulates the mode of operation of the hardware and the required data type precision.
  • Reports the network performance in terms of accuracy and loss. These metrics are identical to those that would have reported using the dedicated scoring API, such as OpenVX*, the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) and others.

Stage 3: Deployment: Outputs an Intermediate Representation of the network. The Intermediate Representation is a required  input to the Inference Engine. The Intermediate Representation consists of two files:

  • A topology file: a .xml file that describes the network topology.
  • A trained data file: a .bin file that contains the weights and biases binary data.

Layered Models

Frameworks and neural networks topologies include known layers, such as convolution, pooling, and activation. When the Model Optimizer loads a trained model, it looks through the topology for the framework that was used to train the model, trying to find each layer type in the list of known layers. This list is different for each framework. 

If the Model Optimizer does not find the layer, it looks for the layer in any custom layers that you provided. If the layer is still not found, you receive a failure message, the search stops, and the conversion to the Intermediate Representation fails.

To read the model and produce the Intermediate Representation, the Model Optimizer must be able to identify and use topology layers. If, like most users, your topology contains only supported layers, you do not need to perform any extra steps. However, if you use a topology that has layers that are not included in the list of supported layers, you need to take further action, depending on which framework you are using:

MXNet models with custom layers: The Model Optimizer fails. You have no options to work with the custom layers.

TensorFlow models with custom layers: You have three options:

  • Register the layers as Model Optimizer extensions, allowing the Model Optimizer to create a valid and optimized Intermediate Representation.
  • Use sub-graph replacement if you have some sub-graphs that should be expressed in the Intermediate Representation, and sub-models that should not be expressed.
  • Register model sub-graphs that can be offloaded to TensorFlow during inference. The Intermediate Representation cannot be inferred with Intel Integrated Graphics or an FPGU, and the Model Optimizer reflects each subgraph as a single custom layer in the Intermediate Representation. This is an experimental feature, intended only for development.

Caffe models with custom layers: You have three options:

  • Do nothing. If you have the Caffe Python interface installed, the Model Optimizer uses Caffe to generate the Intermediate Representation, using Caffe to calculate the shapes of the custom layers. These layers will not contain the original layer parameters in the Intermediate Representation file.
  • Register the layers to pass the original layer parameters to the Intermediate Representation by using CustomLayerMaping. This option also requires the Caffe Python interface.
  • Register the layers as Model Optimizer extensions. The Model Optimizer generates a valid and optimized Intermediate Representation.

To be successful with custom layers, it is important to understand how to do two things:

  1. How to map a sub-graph in a framework model to a sub-graph that consists of Inference Engine layers. For Caffe, the mapping is 1-to-1 between the Caffe layer and and the Inference Engine layer.
  2. How to infer shapes for unknown sub-graphs. This inference can be for a step when the Internal Representation consists of framework-specific layers, or for a step when internal representation already consists of Inference Engine layers.

You can use a framework fallback for unknown sub-graphs. This applies when the original framework is used for inference of output shapes of operations, such as when the framework is not available or should not be used.

To see which topologies are supported with each framework, and to configure your framework to work with the Model Optimizer, see the Caffe*, TensorFlow*, and MXNet* guides.

Model Optimizer Directory Structure

|-- model_optimizer
    |-- extensions
        |-- front/caffe
            |-- CustomLayersMapping.xml.example 
            manner
    |-- mo
        |-- back - Back-End logic: contains IR emitting logic
        |-- front - Front-End logic: contains matching between Framework-specific layers and IR specific, calculation
        of output shapes for each registered layer
        |-- graph - Graph utilities to work with internal IR representation
        |-- middle - Graph transformations - optimizations of the model
        |-- pipeline - Sequence of steps required to create IR for each framework
        |-- utils - Utility functions
    |-- tf_call_ie_layer - Sources for TensorFlow fallback in Inference Engine during model inference
    |-- mo.py - Centralized entry point that can be used for any supported framework
    |-- mo_caffe.py - Entry point particularly for Caffe
    |-- mo_mxnet.py - Entry point particularly for MXNet
    |-- mo_tf.py - Entry point particularly for TensorFlow
    |-- ModelOptimizer - Entry point

Inference Engine

The Inference Engine is the second of the two key components of the Intel® CV SDK.

The Inference Engine uses the Intermediate Representation files that result from running the Model Optimizer to provide an optimized C++ application. The Inference Engine helps application execution with computational graph analysis, scheduling, model compression.

The Inference Engine has:

  • A core library
  • Four hardware-specific libraries in addition to several third-party libraries
  • A plugin for Intel® Xeon® and Intel® Core™ processors with Intel® AVX2
  • A plugin for Intel Atom® processors ("CPU Plugin")
  • A plugin for Intel® Integrated Graphics
  • A plugin for the Intel® Arria® A10 GX Development Kit ("FPGA Plugin")
  • A plugin for the Intel® Movidius™ Myriad™ 2 Vision Processing Unit ("Myriad Plugin")

Inference Engine Workflow

A brief description of the Inference Engine workflow is:

  1. Use the model as input. The model is in the form of Intermediate Representation (IR) that was produced by Model Optimizer.
  2. Optimize the inference execution for target hardware.
  3. Deliver the inference solution to one or more embedded inference platforms.

To work with the Inference Engine or work with Inference Engine samples that are provided with the Intel® CV SDK, see the Inference Engine Developer Guide.

Supported Processor Types

CPU

You have three options for using the Intel® CV SDK: CPU, FPGA, and Intel® Integrated Graphics. The CPU when used alone is the slowest of these options

Intel® Integrated Graphics

Intel® Integrated Graphics is an electronics circuit that accelerates graphic processes. Intel® Integrated Graphics offloads intensive computing parts of applications to lessen the work required by the CPU. This makes your applications run faster. Intel® Integrated Graphics is used with a CPU, not instead of the CPU.

You have three options for using the Intel® CV SDK: CPU, FPGA, and Intel® Integrated Graphics. Intel® Integrated Graphics is the second fastest of these options

FPGA

FPGA is an acronym for field programmable gate array. It is a integrated that greatly increases processing speed. You have three options for using the Intel® CV SDK: CPU, FPGA, and Intel® Integrated Graphics, and FPGA. FPGA is the fastest of these options

 

Terminology

To help you understand the components and concepts that the Intel® CV SDK uses, take a brief glance at this terminology:

Note: Links open in a new window.

TermDescription
APIApplication programming interface. You use an API to tell your program how to communicate with a image or video device and which actions to perform based on what the device sees in the image or video.
Caffe*

Caffe is a popular open-source framework that was developed at UC Berkeley. It can be used on both Linux and Windows. This framework provides a way to switch from a CPU to Intel® Integrated Graphics by setting a flag on a device that includes Intel® Integrated Graphics. 

For more information, see http://caffe.berkeleyvision.org/ 

CNNCNN is an acronym for convolutional neural network. CNNs are successful in identifying specific objects in images or videos.
CHW, NC, C

Tensor memory layout.

Example: the CHW value at index (c, h, w) is physically located at index (c * H + h) * W + w for others by analogy.

Computer visionUses computers to get information from digital images or videos to automate tasks, such as counting vehicles driving past an intersection.
CPU

Central processor unit. 

You have three options for using the Intel® CV SDK: CPU, FPGA, and Intel® Integrated Graphics. The CPU when used alone is the slowest of these options.

Deep learning, sometimes called DL 
Framework

A framework is composed of libraries and software that you use to create, train and deploy your computer vision model. A framework provides a way to build and deploy images. For the Intel CV SDK, a framework gives you a way to work with the images or videos from your digital camera or recorder.

Companies or individuals can provide custom frameworks to perform specific tasks by supporting specific programs, compilers, libraries, tools, and APIs. The Intel® CV SDK supports Caffe*, TensorFlow* Intel® Movidius™ Myriad 2 Vision Processing Unit.

FP16 formatHalf-precision floating-point format
FP32 formatSingle-precision floating point format
FPGA

FPGA is an acronym for field programmable gate array. It is a integrated that greatly increases processing speed. You have three options for using the Intel® CV SDK: CPU, FPGA, and Intel® Integrated Graphics, and FPGA. FPGA is the fastest of these options. 

For more information about the Intel® FPGA that is supported by the Intel® CV SDK, see https://www.altera.com/products/fpga/arria-series.html

Intel® Integrated Graphics

Graphics processor unit. Intel® Integrated Graphics is an electronics circuit that accelerates graphic processes. Intel® Integrated Graphics offloads intensive computing parts of applications to lessen the work required by the CPU. This makes your applications run faster. Intel® Integrated Graphics is used with a CPU, not instead of the CPU.

You have three options for using the Intel® CV SDK: CPU, FPGA, and Intel® Integrated Graphics. Intel® Integrated Graphics is the second fastest of these options.

I162-byte unsigned integer format
Inference EngineThe second of two key components of the Intel® CV SDK.
Intel® Movidius™ Myriad 2 Vision Processing Unit

The Intel® Movidius™ Myriad 2 Vision Processing Unit gives you immediate access to its advanced vision processing core, while allowing you to develop proprietary capabilities that provide true differentiation

For more information,see https://www.movidius.com/solutions/vision-processing-unit/

Intermediate Representation, sometimes referred to as IRThe Intermediate Representation consists of two files created by the Model Optimizer and used by the Inference Engine
ModelA model uses data, equations, and instructions to make predictions. 
Model OptimizerThe first of two key components of the Intel® CV SDK. This is a command-line tool that converts trained models into Immediate Representation (IR) files
MXNet*

MXNet is a popular open-source framework that was developed by Apache. It can be used on both Linux and Windows. This framework provides a way to switch from CPU to Intel® Integrated Graphics

For more information, see https://mxnet.apache.org/

NCHW or NHWC

Image data layout. This refers to the representation of batches of images where:

  • N is the number of images in a batch
  • C is the channels
  • H is the number of pixels in the vertical dimension
  • W is the number of pixels in the horizontal dimension
OpenCL™

OpenCL is an acronym for Open Computing Language. Intel provides an implementation of OpenCL. OpenCL is useful in writing programs that run across different platforms, like CPUs and Intel® Integrated Graphics.  

For more information, see https://www.khronos.org/opencl/ and https://software.intel.com/en-us/intel-opencl

OpenCV*

OpenCV is an acronym for Open Source Computer Vision. OpenCV is an open-source library of programming functions that let you work with models that were created with specific frameworks, including Caffe. OpenCV works on both Windows and Linux.

For more information, see https://opencv.org/

OpenVX*

OpenVX speeds is an open source utility used to improve the performance of computer vision applications, especially when used with such tasks as video surveillance, facial recognition, and tracking bodies and gestures, among others. OpenVX creates nodes that are optimized for your hardware. It handles memory management and figures out the best way to process your images.

For more information, see https://www.khronos.org/openvx/

Proto fileA proto file contains data and services and is compiled with protoc. The proto file is created in the format defined by the associated protocol buffer.
Protobuf

A protobuf is a library for protocol buffers. 

ProtocA compiler that is used to generate code from proto files.
Protocal bufferData structures are saved and in and communicated from protocol buffers. The primary purpose of protocol buffers is in network communication. Protocol buffers are used because they are simple and fast.
TensorFlow*

TensorFlow is a popular open-source framework that was developed by Google It can be used on both Linux and Windows. 

For more information, see https://www.tensorflow.org/

Training

Training means teaching your computer software to correctly identify what you told it to look at in the model.

Training takes place before you use the Intel® CV SDK. If you are using the Intel® CV SDK and decide you need to make changes and re-train your model, you will need to return to the application that you originally used to train your model, and then return to the Intel® CV SDK after you are done training.

U16 format2-byte unsigned integer format
U8 format1-byte unsigned integer format

Note: The * next to some terms indicate trademarks that belong to someone else. OpenCL and the OpenCL logo are trademarks of Apple Inc. used with permission from Khronos

 

Helpful Links

Note: Links open in a new window.

Intel® CV SDK Home Page: https://software.intel.com/en-us/computer-vision-sdk

Intel® CV SDK Documentation: https://software.intel.com/en-us/computer-vision-sdk/documentation/featured

Model Optimizer Developer Guide: https://software.intel.com/en-us/articles/CVSDK-ModelOptimizer

Inference Engine Developer Guide: https://software.intel.com/en-us/articles/CVSDK-InferEngine

 

Legal Information

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at http://www.intel.com/ or from the OEM or retailer.

No computer system can be absolutely secure.

Intel, Arria, Core, Movidia, Movidius, Xeon, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used with permission by Khronos

*Other names and brands may be claimed as the property of others.

Copyright © 2018, Intel Corporation. All rights reserved.

Intel® Computer Vision SDK API Guide

$
0
0

APIs are available in the offline release package that you installed: 

  • For the newest API:
    1. Go to <INSTALL_DIR>/deployment_tools/documentation/ where <INSTALL_DIR>is the directory in which the Intel® CV SDK is installed.
    2. Open index.html in an Internet browser.
    3. Select Integrating Inference Engine in Your Application (legacy API) from the contents.
  • The Intel CV SDK documents refer to APIs from previous releases as "legacy" API. It is best to stop using the legacy API since it will be removed in a future product release. To locate the legacy API:
    1. Go to <INSTALL_DIR>/deployment_tools/documentation/ under the directory in which the Intel® CV SDK is installed.
    2. Open index.html in an Internet browser.
    3. Select Integrating Inference Engine in Your Application (legacy API) from the contents.
  • The complete API documentation is also in the offline package documentation.
    1. Go to <INSTALL_DIR>/deployment_tools/documentation/ under the directory in which the Intel® CV SDK is installed.
    2. Open index.html in an Internet browser.
    3. Select Open Data Structures from the menu at the top of the screen.

Helpful Links

Note: Links open in a new window.

Intel® CV SDK Home Page: https://software.intel.com/en-us/computer-vision-sdk

Intel® CV SDK Documentation: https://software.intel.com/en-us/computer-vision-sdk/documentation/featured

 

Legal Information

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at http://www.intel.com/ or from the OEM or retailer.

No computer system can be absolutely secure.

Intel, Arria, Core, Movidia, Pentium, Xeon, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

*Other names and brands may be claimed as the property of others.

Copyright © 2018, Intel Corporation. All rights reserved.

Intel® Computer Vision SDK Command-Line Guide

$
0
0

For help with commands, type the command name followed by -h

 

 

Helpful Links

Note: Links open in a new window.

Intel® CV SDK Home Page: https://software.intel.com/en-us/computer-vision-sdk

Intel® CV SDK Documentation: https://software.intel.com/en-us/computer-vision-sdk/documentation/featured

 

Legal Information

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at http://www.intel.com/ or from the OEM or retailer.

No computer system can be absolutely secure.

Intel, Arria, Core, Movidia, Pentium, Xeon, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

*Other names and brands may be claimed as the property of others.

Copyright © 2018, Intel Corporation. All rights reserved.

Inference Engine Samples

$
0
0

Using Inference Engine Samples

The Inference Engine sample applications are simple console applications that demonstrate how to use Intel's Deep Learning Inference Engine in your applications.

Samples in the Samples Directory

The following sample applications are available in the samples directory in the Inference Engine installation directory:

Sample Description
Image Classification SampleInference of image classification networks like AlexNet and GoogLeNet (the sample supports only images as inputs)
Image Classification Sample, pipelinedMaximize performance via pipelined execution, the sample supports only images as inputs
Security Barrier Camera SampleVehicle Detection followed by the Vehicle Attributes
Object Detection for Faster R-CNN SampleInference of object detection networks like Faster R-CNN (the sample supports only images as inputs)
Image Segmentation SampleInference of image segmentation networks like FCN8 (the sample supports only images as inputs)
Object Detection for SSD Demonstration, Async API Performance ShowcaseDemonstration application for SSD-based Object Detection networks, new Async API performance showcase, and simple OpenCV interoperability (supports video and camera inputs)
Object Detection for SSD SampleInference of object detection networks based on the SSD, this sample is simplified version that supports only images as inputs
Neural Style Transfer SampleStyle Transfer sample (the sample supports only images as inputs)
Hello Infer Request Classification SampleInference of image classification networks via Infer Request API (the sample supports only images as inputs)
Interactive Face Detection SampleFace Detection coupled with Age-Gender and Head-Pose, supports video and camera inputs
Security Barrier Camera ExampleSupports images/video and camera inputs
Validation ApplicationInfers a pack of images, resulting in total accuracy (only images as inputs)

Samples That Support Pre-Trained Models Shipped With the Product

You are provided several pre-trained models. The table below shows the correlation between models and samples/devices.  The samples are available in <INSTALL_DIR>/deployment_tools/inference_engine/samples

ModelSample Supported on the ModelCPUIntel® Integrated GraphicsHETERO:FPGA,CPUIntel® Movidius™ Myriad™ 2 VPU
face-detection-adas-0001Interactive Face Detection Samplexx x
age-gender-recognition-retail-0013Interactive Face Detection Samplexxxx
head-pose-estimation-adas-0001Interactive Face Detection Samplexxxx
vehicle-license-plate-detection-barrier-0007Security Barrier Camera Samplexxxx
vehicle-attributes-recognition-barrier-0010Security Barrier Camera Samplexxxx
license-plate-recognition-barrier-0001Security Barrier Camera Samplexxxx
person-detection-retail-0001Object Detection Samplexxx 
person-detection-retail-00012Any sample that supports SSD-based modelsxx x
face-detection-retail-0004Any sample that supports SSD-based modelsxxxx
person-vehicle-bike-detection-crossroad-0066Any sample that supports SSD-based modelsxx x

Inferring Your Model with the Inference Engine Samples

Building the Sample Applications on Linux

Supported Linux build environment:

  • Ubuntu* 16.04 LTS 64-bit or CentOS* 7.4 64-bit
  • GCC* 5.4.0 (for Ubuntu* 16.04) or GCC* 4.8.5 (for CentOS* 7.4)
  • CMake* version 2.8 or higher.
  • OpenCV* 3.3 or later (required for some samples and demonstrations). Use the Intel® CV SDK installation download and instructions to complete this installation.

Follow these steps to prepare your Linux computer for the samples:

  1. Go to the samples directory: <INSTALL_DIR>/deployment_tools/inference_engine/samples/
  2. Create a directory. This example uses a directory named build
    mkdir build
  3. Go to the new directory:
    cd build
  4. Run CMake to generate the Make files with or without debug information:
    • Without debug information:
      cmake -DCMAKE_BUILD_TYPE=Release <path_to_inference_engine_samples_directory>
    • With debug information:
      cmake -DCMAKE_BUILD_TYPE=Debug <path_to_inference_engine_samples_directory>
  5. Build the application:
    make

The sample application binaries are in <INSTALL_DIR>/deployment_tools/inference_engine/samples/intel64/Release/

Building the Sample Applications on Windows*

Supported Windows build environment:

Follow these steps to prepare your Windows computer for the samples:

  1. Go to the samples directory.
  2. Double-click create_msvc_solution.bat
  3. Open Microsoft Visual Studio* 2015
  4. Build samples\build\Samples.sln

Set Your Environment Variables

Use these steps to make sure your application can find the Interface Engine libraries.

For Linux, execute the following command to set the environment variable:

source <INSTALL_DIR>/deployment_tools/inference_engine/bin/setupvars.sh

where <INSTALL_DIR> is the Intel CV SDK installation directory.

Running the Samples

Image Classification Sample

Description

The Image Classification sample application does inference using image classification networks, like AlexNet* and GoogLeNet*. The sample application reads command line parameters and loads a network and an image to the Inference Engine plugin. When inference is done, the application creates an output image and outputs data to the standard output stream.

Running the Application

Running the application with the -h option results in the message:

$ ./classification_sample -h
InferenceEngine: 
    API version ............ <version>
    Build .................. <number>
classification_sample [OPTION]
Options:
    -h                      
                            Print a usage message.
    -i "<path1>""<path3>"
                            Required. Path to a directory with images or path to an image files: a .ubyte file for LeNet*
                            and a .bmp file for the other networks.
    -m "<path>"             
                            Required. Path to an .xml file with a trained model.
        -l "<absolute_path>"
                            Optional. Absolute path to library with MKL-DNN (CPU) custom layers (*.so).
        Or
        -c "<absolute_path>"
                            Optional. Absolute path to Intel® Integrated Graphics custom layers config (*.xml).
    -pp "<path>"            
                            Path to a plugin directory.
    -d "<device>"           
                            Specify the target device to infer on; CPU, Intel® Integrated Graphics, or MYRIAD is acceptable. Sample will look for a suitable plugin for device specified
    -nt "<integer>"         
                            Number of top results (default 10)
    -ni "<integer>"         
                            Number of iterations (default 1)
    -pc                     
                            Enables per-layer performance report

Running the application with an empty list of options results in an error message and the usage list above.

To do inference on an image using a trained AlexNet network on Intel® Processors:

$ ./classification_sample -i <path_to_image]/cat.bmp -m <path_to_model]/alexnet_fp32.xml

Output Description

By default the application outputs top-10 inference results. Add the -nt option to the previous command to modify the number of top output results. For example, to get the top-5 results on Intel® HD Graphics, use the command:

$ ./classification_sample -i <path_to_image]/cat.bmp -m <path_to_model]/alexnet_fp32.xml

Image Classification - Pipelined

Description

This sample demonstrates how to build and execute inference in pipelined mode on example of classifications networks.

The pipelined mode might increase the throughput of the pictures. The latency of one inference will be the same as for syncronous execution. The throughput is increased due to follow reasons:

  • Some plugins have heterogenity inside themselves. Transferring of data, execution on remote device, pre-processing and post-processing on the host
  • Using of explicit heterogenious plugin with execution of different parts of network on differnet devices

When two and more devices are involved in inference process of one picture, creation of several infer requests and starting of asynchronious inference allows to utilize devices the most efficient way. If two devices are involved in execution, the most optimal value for -nireq2.

To do this efficiently, the Classification Sample Async uses a round-robin algorithm for inference requests. It starts by the executing the current inference request and switches to waiting for the previous request results. After finishing the wait, the application switches inference requests and repeats the procedure.

Another required aspect for good throughput is the number of iterations. Only with a large number of iterations can you emulate the application work and see performance results.

The batch mode is an independent attribute on the pipelined mode. The pipelined mode works efficiently with any batch size.

The sample application reads command line parameters and loads a network and an image to the Inference Engine plugin. Then the application creates several infer requests pointed in -nireq parameter and loads pictures for inference.

Then in the loop it starts inference for the current infer request and switch for waiting of another one. When results are ready, inference requests are swapped.

When inference is done, the application outputs data to the standard output stream.

Running the Application

Running the application with the -h option results in the message:

./classification_sample -h
InferenceEngine: 
    API version ............ <version>
    Build .................. <number>
classification_sample [OPTION]
Options:
    -h                      
                            Print a usage message.
    -i "<path1>""<path3>"
                            Required. Path to a directory with images or path to an image files: a .ubyte file for LeNet
                            and a .bmp file for the other networks.
    -m "<path>"             
                            Required. Path to an .xml file with a trained model.
        -l "<absolute_path>"
                            Optional. Absolute path to library with Intel® MKL-DNN (CPU) custom layers (*.so).
        Or
        -c "<absolute_path>"
                            Optional. Absolute path to Intel® Integrated Graphics custom layers config (*.xml).
    -pp "<path>"            
                            Path to a plugin directory.
    -d "<device>"           
                            Specify the target device to infer on; CPU, Intel® Integrated Graphics or MYRIAD is acceptable. Sample will look for a suitable plugin for device specified
    -nt "<integer>"         
                            Number of top results (default 10)
    -ni "<integer>"         
                            Number of iterations (default 1)
    -pc                     
                            Enables per-layer performance report

Running the application with an empty list of options results in an error message and the usage list above.

To do inference on an image using a trained AlexNet network on FPGA with a fallback to Intel® Processors:

$ ./classification_sample_async -i <path_to_image]/cat.bmp -m <path_to_model]/alexnet_fp32.xml -nt 5 -d HETERO:FPGA,CPU -nireq 2 -ni 200

Output Description

By default the application outputs top-10 inference results for each infer request. In addition to this information it will provide throughput value measured in frames per seconds.


Security Barrier Camera Sample

Description

Showcases Vehicle Detection, followed by Vehicle Attributes and License Plate Recognition applied on top of Vehicle Detection. The results are in the intel_models directory:

  • vehicle-license-plate-detection-barrier-0007: The primary detection network to find the vehicles and licence-plate
  • vehicle-attributes-recognition-barrier-0010: Executed on top of the results from vehicle-license-plate-detection-barrier-0007. The vehicle attributes execution barrier reports the general vehicle attributes, like the vehicle type and color, where type is something like car, van, or bus.
  • license-plate-recognition-barrier-0001: Executed on top of the results from vehicle-license-plate-detection-barrier-0007. The license plate recognition barrier network reports a string for each recognized license plate. For topology details, see the descriptions in the intel_models

Other demonstration objectives:

  • Show images/video/camera as inputs, via OpenCV*
  • Show an example of simple network pipelining: Attributes and LPR networks are executed on top of the Vehicle Detection results
  • Show vehicle attributes and licence plate information for each detected vehicle

How it Works

The application reads command line parameters and loads the specified networks. The Vehicle/License-Plate Detection network is required, and the other two are optional.

Upon getting a frame from the OpenCV's VideoCapture the app performs inference of Vehicles/License-Plates, then performs another two inferences using Vehicle Attributes and LPR detection networks (if those specified in command line) and displays the results.

Running the Application

Running the application with the -h option results in the message:

$ ./security_barrier_sample -h 
InferenceEngine:
        API version ............ 1.0
    [ INFO ] Parsing input parameters
    interactive_vehicle_detection [OPTION]
    Options:
        -h                         Print a usage message.
        -i "<path>"                Required. Path to a video or image file. Default value is "cam" to work with camera.
        -m "<path>"                Required. Path to the Vehicle/License-Plate Detection model (.xml) file.
        -m_va "<path>"             Optional. Path to the Vehicle Attributes model (.xml) file.
        -m_lpr "<path>"            Optional. Path to the License-Plate Recognition model (.xml) file.
          -l "<absolute_path>"     For Intel® MKL-DNN (CPU)-targeted custom layers, if any. Absolute path to a shared library with the kernels impl.
              Or
          -c "<absolute_path>"     For Intel® Integrated Graphics-targeted custom kernels, if any. Absolute path to the xml file with the kernels desc.
        -d "<device>"              Specify the target device for Vehicle Detection (CPU, Intel® Integrated Graphics, FPGA, MYRYAD, or HETERO).
        -d_va "<device>"           Specify the target device for Vehicle Attributes (CPU, Intel® Integrated Graphics, FPGA, MYRYAD, or HETERO).
        -d_lpr "<device>"          Specify the target device for License Plate Recognition (CPU, Intel® Integrated Graphics, FPGA, MYRYAD, or HETERO).
        -pc                        Enables per-layer performance statistics.
        -r                         Output Inference results as raw values.
        -t                         Probability threshold for Vehicle/Licence-Plate detections.

Running the application with an empty list of options results in an error message and the usage list above.

Demonstration Output

The demonstration uses OpenCV* to display the resulting frame with detections rendered as bounding boxes and text:

Automobile driving


Object Detection for Faster R-CNN Sample

Description

VGG16-Faster-RCNN is a public CNN that can be easily obtained from GitHub. 

The sample application reads command line parameters and loads a network and an image to the Inference Engine plugin. When inference is done, the application creates an output image and outputs data to the standard output stream.

Downloading and Converting a Caffe* Model

  1. Download test.prototxt from https://raw.githubusercontent.com/rbgirshick/py-faster-rcnn/master/models/pascal_voc/VGG16/faster_rcnn_end2end/test.prototxt
  2. Download the pretrained models from https://dl.dropboxusercontent.com/s/o6ii098bu51d139/faster_rcnn_models.tgz?dl=0
  3. Unzip the archive and make sure you have the file named VGG16_faster_rcnn_final.Caffe*model.

For correctly converting the source model, run the Model Optimizer with the extension for the Python proposal layer. To convert the source model:

python3 ${MO_ROOT_PATH}/mo_Caffe*.py --input_model <path_to_model]/VGG16_faster_rcnn_final.Caffe*model --input_proto <path_to_model]/deploy.prototxt --extensions <path_to_object_detection_sample]/fasterrcnn_extensions

Running the Application

Running the application with the -h option results in the message:

$ ./object_detection_sample -h
InferenceEngine: 
    API version ............ <version>
    Build .................. <number>
object_detection_sample [OPTION]
Options:
    -h                      
                            Print a usage message.
    -i "<path>"
                            Required. Path to an image file.
    -m "<path>"             
                            Required. Path to an .xml file with a trained model.
        -l "<absolute_path>"    
                            Optional. Absolute path to library with MKL-DNN (CPU) custom layers (*.so).
        Or
        -c "<absolute_path>"
                            Optional. Absolute path to Intel® Integrated Graphics custom layers config (*.xml).
    -pp "<path>"            
                            Path to a plugin directory.
    -d "<device>"           
                            Specify the target device to infer on; CPU or Intel® Integrated Graphics is acceptable. The sample looks for a suitable plugin for the device specified
    -ni "<integer>"         
                            Number of iterations (default 1)
    -pc                     
                            Enables per-layer performance report

Running the application with an empty list of options results in an error message and the usage list above.

Use the following command to do inference on Intel® Processors on an image using a trained Faster R-CNN network:

$ ./object_detection_sample -i <path_to_image>/inputImage.bmp -m <path_to_model>/faster-rcnn.xml -d CPU

Output Description

The application outputs an image named out_0.bmp with detected objects enclosed in rectangles. It outputs the list of classes of the detected objects along with the respective confidence values and the coordinates of the rectangles to the standard output stream.

Using this Sample with the Intel Person Detection Model

This model has a non-default (for Faster-RCNN) output layer name. To score it correctly, add the option --bbox_name detector/bbox/ave_pred to the command line.

Usage example:

./object_detection_sample -i /home/user/people.jpg -m /<ie_path]/intel_models/person-detection-retail-0001/FP32/person-detection-retail-0001.xml --bbox_name detector/bbox/ave_pred -d CPU

Object Detection SSD, Async API Performance Showcase Sample

Description

This demonstration showcases Object Detection with SSD and new Async API. Async API usage can improve overall frame-rate of the application, because rather than wait for inference to complete, the app can continue doing things on the host, while accelerator is busy. Specifically, this demonstration keeps two parallel infer requests and while the current is processed, the input frame for the next is being captured. This essentially hides the latency of capturing, so that the overall framerate is rather determined by the MAXIMUM(detection time, input capturing time) and not the SUM(detection time, input capturing time).

The technique can be generalized to any available parallel slack, such as doing inference while simultaneously encoding the resulting (previous) frames, or running further inference, like emotion detection on top of the face detection results.

Be aware of performance caveats though. When running tasks in parallel, avoid over-using shared compute resources. For example, if performing inference on the FPGA with a mostly idle CPU, perform parallel tasks on the CPU. When doing inference on Intel® Integrated Graphics, you have little gain in tasks like having resulting video encoding on the same Intel® Integrated Graphics in parallel because the device is already busy.

For more performance implications and tips for the Async API, see the Optimization Guide

Other demonstration objectives:

  • Video as input support via OpenCV*
  • Visualization of the resulting bounding boxes and text labels (from the .labels file) or class number (if no file is provided)
  • OpenCV* provides resulting bounding boxes, labels, and other information. You can copy and paste this code without pulling Inference Engine samples helpers into your application.
  • Demonstrate the Async API in action. For this, the demonstration features two modes with a Tab key toggle.
    • Old-style "Sync" way - The frame capturing with OpenCV* executes back-to-back with Detection
    • "Truly Async" way - The Detection is performed on the current frame, while the OpenCV* captures the next frame.

How it Works

The application reads command line parameters and loads a network to the Inference Engine. Upon getting a frame from the OpenCV*'s VideoCapture it performs inference and displays the results.

New "Async API" operates with new notion of the "Infer Request" that encapsulates the inputs/outputs and separates scheduling and waiting for result, next section. And here what makes the performance look different:

  1. In the default ("Sync") mode the frame is captured and then immediately processed, below in pseudo-code:
    while(true) {
        capture frame
        populate CURRENT InferRequest
        start CURRENT InferRequest //this call is async and returns immediately
        wait for the CURRENT InferRequest
        display CURRENT result
    }
    This is a reference implementation in which the new Async API is used in a serialized/synch fashion.
  2. In "true" ASync mode, the frame is captured and then immediately processed:
    while(true) {
            capture frame
            populate NEXT InferRequest
            start NEXT InferRequest //this call is async and returns immediately
                wait for the CURRENT InferRequest (processed in a dedicated thread)
                display CURRENT result
            swap CURRENT and NEXT InferRequests
        }
    In this case, the NEXT request is populated in the main (app) thread, while the CURRENT request is processed. This is handled in the dedicated thread, internal to the Inference Engine runtime.

Async API

In this release, the Inference Engine offers a new API based on the notion of Infer Requests. With this API, requests encapsulate input and output allocation. You access the blob with the GetBlob method.

You can execute a request asynchronously in the background and wait until you need the result. In the meantime your application can continue:

// load plugin for the device as usual
  auto enginePtr = PluginDispatcher({"../../../lib/intel64", ""}).getSuitablePlugin(
                getDeviceFromStr("GPU"));
// load network
CNNNetReader network_reader;
network_reader.ReadNetwork("Model.xml");
network_reader.ReadWeights("Model.bin");
// populate inputs etc
auto input = async_infer_request.GetBlob(input_name);
...
// start the async infer request (puts the request to the queue and immediately returns)
async_infer_request->StartAsync();
// Continue execution on the host until you need the request results
//...
async_infer_request.Wait(IInferRequest::WaitMode::RESULT_READY);
auto output = async_infer_request.GetBlob(output_name);

You have no direct way to measure execution time of the infer request that is running asynchronously, unless you measure the Wait executed immediately after the StartAsync. But this essentially would mean the serialization and synchronous execution.

This is what sample does for the default "SYNC" mode and reports as a Detection time/fps message on the screen. In the truly asynchronous ("ASYNC") mode the host continues execution in the master thread, in parallel to the infer request. If the request is completed before than the Wait is called in the main thread (i.e. earlier than OpenCV* decoded a new frame), that reporting the time between StartAsync and Wait would obviously incorrect. That is why in the "ASYNC" mode the inference speed is not reported.

For more information about the new, request-based Inference Engine API, including ASYNC execution, see the information about integrating a customer application new request API.

Running the Application

Running the application with the -h option results in the message:

$ ./object_detection_demo_ssd_async -h
InferenceEngine: 
    API version ............ [version]
    Build .................. 
object_detection_demo_ssd_async [OPTION]
Options:
    -h                      
                            Print a usage message.
    -i "[path]"
                            Required. Path to an video file. Use "cam" to capture input from the camera).
    -m "[path]"             
                            Required. Path to an .xml file with a trained model.
        -l "[absolute_path]"    
                            Optional. Absolute path to library with Intel® MKL-DNN (CPU) custom layers (*.so).
        Or
        -c "[absolute_path]"
                            Optional. Absolute path to Intel® Integrated Graphics custom layers config (*.xml).
    -d "[device]"
                            Specify the target device to infer on; CPU, Intel® Integrated Graphics, FPGA, and Intel® Movidius™ Myriad™ 2 Vision Processing Unit are accepted.
    -pc
                            Enables per-layer performance report.
    -t
                            Probability threshold for detections (default is 0.5).
    -r
                            Output inference results as raw values to the console.

Running the application with an empty list of options results in an error message and the usage list above.

Use the following command to do inference on a Intel® Integrated Graphics with an example pre-trained GoogleNet based SSD* available at https://software.intel.com/file/609199/download

Command Description

After reading through this demonstration, use this command to perform inference on a Intel® Integrated Graphics with the SSD you download from https://software.intel.com/file/609199/download

$ ./object_detection_demo_ssd_async -i <path_to_video>/inputVideo.mp4 -m <path_to_model>/ssd.xml -d GPU

The network must be converted from the Caffe* (*.prototxt + *.model) to the Inference Engine format (*.xml + *bin) before using this command. See the Model Optimizer Developer Guide.

The only GUI knob is using 'Tab' to switch between the synchronized execution and the true Async mode.

Output Description

The output uses OpenCV* to display the resulting frame with detections rendered as bounding boxes and labels, if provided. In default mode, the sample reports:

  • OpenCV* time: Frame decoding + time to render the bounding boxes, labels, and display of the results.
  • Detection time: Inference time for the objection network. This is reported in SYNC mode.
  • Wallclock time: The combined application-level performance.

Object Detection with SSD-VGG Sample

Description

How to run the Object Detection sample application, which does inference using object detection networks like SSD-VGG on Intel® Processors and Intel® HD Graphics.

The sample application reads command line parameters and loads a network and an image to the Inference Engine plugin. When inference is done, the application creates an output image and outputs data to the standard output stream.

Running the Application

Running the application with the -h option results in the message:

$./object_detection_sample_ssd -h
InferenceEngine: 
    API version ............ <version>
    Build .................. <number>
object_detection_sample_ssd [OPTION]
Options:
    -h                      
                            Print a usage message.
    -i "<path>"
                            Required. Path to an image file.
    -m "<path>"             
                            Required. Path to an .xml file with a trained model.
        -l "<absolute_path>"    
                            Optional. Absolute path to library with MKL-DNN (CPU) custom layers (*.so).
        Or
        -c "<absolute_path>"
                            Optional. Absolute path to Intel® Integrated Graphics custom layers config (*.xml).
    -pp "<path>"            
                            Path to a plugin directory.
    -d "<device>"           
                            Specify the target device to infer on; CPU, Intel® Integrated Graphics or MYRIAD is acceptable. The sample looks for a suitable plugin for the specified device.
    -ni "<integer>"         
                            Number of iterations (default 1)
    -pc                     
                            Enables per-layer performance report

Running the application with an empty list of options results in an error message and the usage list above.

Use the following command to do inference on Intel® Processors on an image using a trained SSD network:

$ ./object_detection_sample_ssd -i <path_to_image>/inputImage.bmp -m <path_to_model>/VGG_ILSVRC2016_SSD.xml -d CPU

Output Description

The application outputs an image named out_0.bmp with detected objects enclosed in rectangles. It outputs the list of classes of the detected objects along with the respective confidence values and the coordinates of the rectangles to the standard output stream.


Neural Style Transfer Sample

Description

How to build and run the Neural Style Transfer sample (NST sample) application, which does inference using models of style transfer topology.

Running the Application

Running the application with the -h option results in the message:

$ ./style_transfer_sample --h
InferenceEngine:
    API version ............ <version>
    Build .................. <number>
style_transfer_sample [OPTION]
Options:
    -h
                            Print a usage message.
    -i "<path1>""<path3>"
                            Required. Path to a directory with images or path to an image files: a .ubyte file for LeNet
                            and a .bmp file for the other networks.
    -m "<path>"
                            Required. Path to an .xml file with a trained model.
        -l "<absolute_path>"
                            Optional. Absolute path to library with MKL-DNN (CPU) custom layers (*.so).
        Or
        -c "<absolute_path>"
                            Optional. Absolute path to Intel® Integrated Graphics custom layers config (*.xml).
    -pp "<path>"
                            Path to a plugin directory.
    -p "<name>"
                            Plugin name. For example Intel® MKL-DNN. If this parameter is pointed, the sample looks for this plugin only
    -d "<device>"
                            Specify the target device to infer on; CPU or Intel® Integrated Graphics is acceptable. The sample looks for a suitable plugin for the specified device.
    -nt "<integer>"
                            Number of top results (default 10)
    -ni "<integer>"
                            Number of iterations (default 1)
    -pc
                            Enables per-layer performance report

Running the application with an empty list of options results in an error message and the usage list above.

To do inference on an image using a trained model of NST network on Intel® Processors using the following command:

$ ./style_transfer_sample -i <path_to_image>/cat.bmp -m <path_to_model>/1_decoder_FP32.xml

Output Description

The application outputs one or more styled image, starting with named out1.bmp, which were redrawn in style of model which used for inference. Style of output images depend on models which use for sample.


Hello Infer Request Classification

Description

How to run the Hello Infer Classification sample application. The sample is simplified version of the Image Classification Sample. It's intended to demonstrate using of new Infer Request API of Inference Engine in applications. See Integrate with customer application New Request API for details.

Running the Application

To do inference on an image using a trained AlexNet network on Intel® Processors:

$ ./hello_request_classification <path_to_model>/alexnet_fp32.xml <path_to_image>/cat.bmp CPU

Output Description

The top-10 inference results

Interactive Face Detection

Description

Showcases the Object Detection task applied to face recognition using a sequence of neural networks. The Async API can improve the overall frame-rate of the application because the application can continue operating while the accelerator is busy. This demonstration maintains two parallel inferance requests for the Age Gender and Head Pose detection that are run simultaneously.

Other demonstration objectives:

  • Video as input support via OpenCV*.
  • Visualization of the resulting face bounding boxes from Face Detection network.
  • Visualization of age gender and head pose information for each detected face.
  • The OpenCV* provides resulting bounding boxes, labels, and other information. You can copy and paste this code without pulling Inference Engine sample helpers into your application.

How it Works

  1. The application loads up to three networks, depending on the -d option.
  2. The application gets a frame from the OpenCV's video capture
  3. The application performs inference on the frame detection network
  4. The application performs two simultaneous inferences, using the Age Gender and Head Pose detection networks, if these are specified in the command-line.
  5. The application displays the results.

The new Async API operates with new notion of the Infer Request that encapsulates the inputs/outputs and separates scheduling and waiting for result. This operation changes the performance, as follows:

In the default mode (Sync mode), the frame is captured and immediately processed:

while(true) {
    capture frame
    populate FaceDetection InferRequest
    wait for the FaceDetection InferRequest
    populate AgeGender InferRequest using dyn batch technique
    populate HeadPose InferRequest using dyn batch technique
    wait AgeGender
    wait HeadPose
    display detection results
}

Running the Application

Running the application with the -h option results in the message:

$ ./interactive_face_detection -h
InferenceEngine: 
    API version ............ <version>
    Build .................. <number>
interactive_face_detection [OPTION]
Options:
    -h                         Print a usage message.
    -i "<path>"                Optional. Path to an video file. Default value is "cam" to work with camera.
    -m "<path>"                Required. Path to an .xml file with a trained face detection model.
    -m_ag "<path>"             Optional. Path to an .xml file with a trained age gender model.
    -m_hp "<path>"             Optional. Path to an .xml file with a trained head pose model.
      -l "<absolute_path>"     Required for Intel® MKL-DNN (CPU)-targeted custom layers.Absolute path to a shared library with the kernels impl.
          Or
      -c "<absolute_path>"     Required for Intel® Integrated Graphics-targeted custom kernels.Absolute path to the xml file with the kernels desc.
    -d "<device>"              Specify the target device for Face Detection (CPU, Intel® Integrated Graphics, FPGA, or MYRYAD. The sample looks for a suitable plugin for the specified device.
    -d_ag "<device>"           Specify the target device for Age Gender Detection (CPU, Intel® Integrated Graphics, FPGA, or MYRYAD. The sample looks for a suitable plugin for the specified device.
    -d_hp "<device>"           Specify the target device for Head Pose Detection (CPU, Intel® Integrated Graphics, FPGA, or MYRYAD. The sample looks for a suitable plugin for the specified device.
    -pc                        Enables per-layer performance report.
    -r                         Inference results as raw values.
    -t                         Probability threshold for detections.

Running the application with an empty list of options results in an error message and the usage list above.

To do inference on a Intel® Integrated Graphics with an example pre-trained GoogleNet based SSD* available at example pre-trained GoogLeNet-based SSD:

./object_detection_demo_ssd_async -i <path_to_video>/inputVideo.mp4 -m <path_to_model>/ssd.xml -d Intel® Integrated Graphics

Before using this, use the Model Optimizer to convert the network from the Caffe* (*.prototxt + *.model) to the Inference Engine format (*.xml + *bin)

Demonstration Output

The demonstration uses OpenCV* to display the resulting frame with detections that are rendered as bounding boxes. Labels are included if available. In default mode, the sample reports:

  • OpenCV* time: frame decoding + time to render the bounding boxes, labels, and displaying the results
  • Face detection time: inference time for the face Detection network
  • Age Gender + Head Pose time: combined inference time of simultaneously executed age gender and head pose networks

Image Segmentation Sample

Description

How to run the Image Segmentation sample application, which does inference using image segmentation networks like FCN8.

The sample application reads command line parameters and loads a network and an image to the Inference Engine plugin. When inference is done, the application creates an output image.

Running the Applicaiton

Running the application with the -h option results in the message:

$ ./segmentation_sample -h
InferenceEngine: 
    API version ............ <version>
    Build .................. <number>
segmentation_sample [OPTION]
Options:
    -h                      
                            Print a usage message.
    -i "<path1>""<path3>"
                            Required. Path to a directory with images or path to an image files: a .ubyte file for LeNet
                            and a .bmp file for the other networks.
    -m "<path>"             
                            Required. Path to an .xml file with a trained model.
        -l "<absolute_path>"    
                            Optional. Absolute path to library with MKL-DNN (CPU) custom layers (*.so).
        Or
        -c "<absolute_path>"
                            Optional. Absolute path to Intel® Integrated Graphics custom layers config (*.xml).
    -pp "<path>"            
                            Path to a plugin directory.
    -d "<device>"           
                            Specify the target device to infer on; CPU or Intel® Integrated Graphics is acceptable. The sample looks for a suitable plugin for the specified device.
    -ni "<integer>"         
                            Number of iterations (default 1)
    -pc                     
                            Enables per-layer performance report

Running the application with an empty list of options results in an error message and the usage list above.

To do inference on Intel® Processors using an image from a trained FCN8 network:

$ ./segmentation_sample -i <path_to_image>/inputImage.bmp -m <path_to_model>/fcn8.xml

Output Description

The application outputs are a segmented image named out.bmp.

 

Using the Validation Application to Check Accuracy on a Dataset

The Inference Engine Validation application lets you score common topologies with standard inputs and outputs configuration. These topologies include AlexNet and SSD. The Validation application allows the user to collect simple validation metrics for the topologies. It supports Top-1/Top-5 counting for classification networks and 11-points mAP calculation for object detection networks.

Possible Validation application uses:

  • Check if Inference Engine scores the public topologies well
  • Verify if the user's custom topology compatible with the default input/output configuration and compare its accuracy with the public ones
  • Using Validation application as another sample: although the code is much more complex than in classification and object detection samples, it's still open and could be re-used

The application loads a network to the Inference Engine plugin. Then:

  1. The application reads the validation set (the -i option):
    • If -i specifies a directory. The application tries to load labels first. To do so, the application searches for a file with the same base name as the model, but with a .labels extension. The application then searches the specified directory and adds all images from sub-directories whose names are equal to a known label to the validation set. If there are no sub-directories whose names are equal to known labels, the validation set is considered empty.
    • If -i specifies a .txt file. The application reads the .txt file, considering every line that has the format: <relative_path_from_txt_to_img] <ID] where ID is the image number that the network should classify.
  2. The application reads the number of images specified by -b and loads the images to the plugin. When all images are loaded, the plugin does inference and the Validation application collects the statistics.

NOTE: Image load time is not part of of the inference time reported by the application.

As an option, use the -dump option to retrieve the inference results. This option creates an inference report with the name in as dumpfileXXXX.csv. in this format, using semicolon separated values:

  • Image_path
  • Flag representing correctness of prediction
  • ID of the Top-1 class
  • Probability that the image belongs to the Top-1 class
  • ID of the Top-2 class
  • Probability that the image belongs to the Top-x class, where x is an integer

CLI Options

Usage: validation_app [OPTION]
Available options:
    -h                        Print a usage message
    -t                  Type of the network being scored ("C" by default)
      -t "C" for classification
      -t "OD" for object detection
    -i [path]                 Required. Directory with validation images, directorys grouped by labels or a .txt file list for classification networks or a VOC-formatted dataset for object detection networks
    -m [path]                 Required. Path to an .xml file with a trained model
    -l [absolute_path]        Required for Intel® MKL-DNN (CPU)-targeted custom layers.Absolute path to a shared library with the kernel implementations
    -c [absolute_path]        Required for Intel® Integrated Graphics-targeted custom kernels.Absolute path to the xml file with the kernel descriptions
    -d [device]               Specify the target device to infer on; CPU, Intel® Integrated Graphics, FPGA or MYRIAD is acceptable. The sample looks for a suitable plugin for the specified device. The plugin is CPU by default.
    -b N                      Batch size value. If not specified, the batch size value is determined from IR
    -ppType             Preprocessing type. One of "None", "Resize", "ResizeCrop"
    -ppSize N                 Preprocessing size (used with ppType="ResizeCrop")
    -ppWidth W                Preprocessing width (overrides -ppSize, used with ppType="ResizeCrop")
    -ppHeight H               Preprocessing height (overrides -ppSize, used with ppType="ResizeCrop")
    --dump                    Dump filenames and inference results to a csv file

    Classification-specific options:
      -Czb true               "Zero is a background" flag. Some networks are trained with a modified dataset where the class IDs are enumerated from 1, but 0 is an undefined "background" class (which is never detected)

    Object detection-specific options:
      -ODkind           Kind of an object detection network: SSD
      -ODa [path]             Required for OD networks. Path to the directory containing .xml annotations for images
      -ODc              Required for OD networks. Path to the file containing classes list
      -ODsubdir         Directory between the image path (-i) and image name, specified in the .xml. Use JPEGImages for VOC2007

Option Categories

  • Common options are usually named with a single letter or word, such as -b or –dump. These options have a common sense in all validation_app modes.
  • Network type-specific options are named as an acronym of the network type (such as C or OD, followed by a letter or a word addendum. These options are specific for the network type. For instance, ODa makes sense only for an object detection network.

The next section shows how to use the Validation application in classification mode to score a classification CNN on a pack of images.

Running the Application in Classification Mode

This section demonstrates how to run the Validation application in classification mode to score a classification CNN on a pack of images.

To do inference of a chosen pack of images:

$ ./validation_app -t C -i <path to images main directory or .txt file] -m <model to use for classification] -d <CPU|Intel® Integrated Graphics]

Source dataset format: directories as classes

A correct list of files looks similar to:

<path]/dataset
    /apron
        /apron1.bmp
        /apron2.bmp
    /collie
        /a_big_dog.jpg
    /coral reef
        /reef.bmp
    /Siamese
        /cat3.jpg

To score this dataset put the -i <path]/dataset option in the command line.

Source dataset format: a list of images

This example uses a single list file in the format image_name-tabulation-class_index. The correct list of files:

<path]/dataset
    /apron1.bmp
    /apron2.bmp
    /a_big_dog.jpg
    /reef.bmp
    /cat3.jpg
    /labels.txt

where labels.txt:

apron1.bmp 411
apron2.bmp 411
cat3.jpg 284
reef.bmp 973
a_big_dog.jpg 231

To score this dataset put the -i <path>/dataset/labels.txt option in the command line.

Output Description

A progress bar shows the inference progress. Upon completion, the common information is displayed.

Network load time: time spent on topology load in ms
Model: path to chosen model
Model Precision: precision of a chosen model
Batch size: specified batch size
Validation dataset: path to a validation set
Validation approach: Classification networks
Device: device type

You see statistics such as the average inference time, and top-1 and top-5 accuracy:

Average infer time (ms): 588.977 (16.98 images per second with batch size = 10)

Top1 accuracy: 70.00% (7 of 10 images were detected correctly, top class is correct)
Top5 accuracy: 80.00% (8 of 10 images were detected correctly, top five classes contain required class)

Using Object Detection with the Validation Application

Description

Running the Validation application in object detection mode to score an object detection on the SSD CNN pack of images.

Running SSD on the VOC Dataset

Use these steps to score SSD on the original dataset that was used to test it during its training.

./validation_app -d CPU -t OD -ODa "<...>/VOCdevkit/VOC2007/Annotations" -i "<...>/VOCdevkit" -m "<...>/vgg_voc0712_ssd_300x300.xml" -ODc "<...>/VOC_SSD_Classes.txt" -ODsubdir JPEGImages
  1. Go to the SSD author's github page to select the pre-trained SSD-300.
  2. From the same page, download the VOC2007 test dataset:
    $wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar
    tar -xvf VOCtest_06-Nov-2007.tar
  3. Use the Model Optimizer to convert the model. For help, see https://software.intel.com/en-us/articles/CVSDK-ModelOptimizer
  4. Create a proper class file (made from the original labelmap_voc.prototxt) none_of_the_above 0 aeroplane 1 bicycle 2 bird 3 boat 4 bottle 5 bus 6 car 7 cat 8 chair 9 cow 10 diningtable 11 dog 12 horse 13 motorbike 14 person 15 pottedplant 16 sheep 17 sofa 18 train 19 tvmonitor 20
  5. Save it as VOC_SSD_Classes.txt
  6. Score the model on the dataset:

  7. You see a progress bar followed by your data:
    Progress: [....................] 100.00% done    
    [ INFO ] Processing output blobs
    Network load time: 27.70ms
    Model: /home/user/models/ssd/withmean/vgg_voc0712_ssd_300x300/vgg_voc0712_ssd_300x300.xml
    Model Precision: FP32
    Batch size: 1
    Validation dataset: /home/user/Data/SSD-data/testonly/VOCdevkit
    Validation approach: Object detection network
    
    Average infer time (ms): 166.49 (6.01 images per second with batch size = 1)
    Average precision per class table: 
    
    Class   AP
    1   0.796
    2   0.839
    3   0.759
    4   0.695
    5   0.508
    6   0.867
    7   0.861
    8   0.886arXiv
    9   0.602
    10  0.822
    11  0.768
    12  0.861
    13  0.874
    14  0.842
    15  0.797
    16  0.526
    17  0.792
    18  0.795
    19  0.873
    20  0.773
    Mean Average Precision (mAP): 0.7767

The Mean Value Precision is in a table on the SSD author's page and in the arXiv paper.

Intel® Math Kernel Library (Intel® MKL) 2018 Update 2 ScaLAPACK Symmetric Eigensolver Enhancements

$
0
0

Introduction

The symmetric eigenvalue problem arises in the context of numerous scientific applications. Intel MKL provides functions to solve the symmetric eigenvalue problem in the LAPACK domain for shared memory applications, and in the ScaLAPACK domain for distributed memory applications. Intel MKL 2018 Update 2 provides newly optimized ScaLAPACK eigensolver functions that deliver up to three times the performance over the same functions from earlier Intel MKL releases.

Algorithm

Let A be a real symmetric or complex Hermitian matrix of order N. A scalar λ is called an eigenvalue and a nonzero column vector z the corresponding eigenvector if

A*λ = λ*z.

λ is real whenever A is real symmetric or complex Hermitian.

The goal of the symmetric eigenproblem is to compute the eigenvalues λ and optionally, the corresponding eigenvectors z for a given symmetric or Hermitian matrix A. This task can be formulated as a decomposition of A into the following form:

A = Z*Λ*ZH,

where Z is an orthogonal matrix containing the eigenvectors of A, and Λ is diagonal matrix containing the eigenvalues of A. Such a decomposition can be found for any real symmetric or complex Hermitian matrix.

The traditional first step in computing this factorization is to reduce A to tridiagonal form:

A = Q*T*QH

Then, an iterative method finds the eigenvalues and eigenvectors of the tridiagonal matrix:

T = V*Λ*VH

The final step (called backward transformation) is needed only if eigenvectors are requested. In this step, the eigenvectors of the initial matrix are computed from the orthogonal matrices Q and V as:

A = Q*T*QH = Q* V*Λ*VH*QH= Z*Λ*ZH , where Z=Q*V.

This is the traditional method, used in the reference implementations in Netlib LAPACK and ScaLAPACK. Reduction to tridiagonal form (implemented by the ?SYTRD routine in LAPACK, and P?SYTRD in ScaLAPACK) is the most time-consuming part of this method. About one-half of the operations in the reduction are performed by Level-3 BLAS (symmetric rank-2 update), while the rest use Level-2 BLAS. For Level-3 BLAS, the number of floating-point (FP) and memory operations are estimated as O(n3) and O(n2), respectively, where n is the dimension of A, while for Level-2 BLAS we have O(n2) for both FP and memory operations. Domination of FP operations over memory accesses in Level-3 BLAS allows effective data re-use and achieving close to peak performance on computers with multi-level memory hierarchy. Level-2 BLAS is memory-bound, making it the main bottleneck of the eigensolver.

In 1996, Bischof, Lang and Sun published the Successive Band Reduction (SBR) approach, which addresses this bottleneck. In the SBR approach, the original matrix is first reduced to banded form, and then to tridiagonal form:

A = Q1*B* Q1H = Q1*( Q2*T* Q2H)* Q1H= Q*T*QH, where B is banded.

The full-to-band reduction takes the majority of FP operations in the overall tridiagonal reduction, and can be entirely implemented using Level-3 BLAS (or PBLAS for ScaLAPACK). Details explaining the specifics of this algorithm can be found many papers available online.

Additional performance typically comes at a cost, and in the case of the SBR algorithm, internal memory consumption, number of FP operations and code complexity are increased when compared to the traditional one-step algorithm.

Practical Suggestions for Intel MKL Eigensolver Users

Since Intel MKL 11.1 Update 3, two LAPACK drivers have been enabled with the SBR algorithm: ?SYEV and ?SYEVD. A few SBR-enabled eigensolvers, ?SYEV[X|D]_2STAGE, were introduced in Netlib LAPACK 3.7.0, which was released about two years after Intel MKL 11.1 Update 3. While these *_2STAGE implementations were integrated into Intel MKL Update 2, this was done strictly for compatibility with Netlib LAPACK. Intel MKL LAPACK eigensolvers ?SYEV[D] support the general case when both eigenvalues and eigenvectors are computed, while the *_2STAGE functions only support the case with eigenvalues (JOBZ=’N’), in addition to providing better performance, and therefore it is recommended to use ?SYEV[D] functions instead of the *_STAGE functions.

In Intel MKL 2018 Update 2, the SBR algorithm was implemented in the ScaLAPACK symmetric eigensolver drivers P?SYEVD/P?HEEVD, P?SYEVX/P?HEEVX and P?SYEVR/P?HEEVR. Additionally, redistribution of a user input matrix into a more optimal grid was added to the EVX drivers, and improved in the EVD drivers, as it was observed that some applications sub-optimally distribute matrices passed to ScaLAPACK routines in block-cyclic manner (https://software.intel.com/en-us/mkl-developer-reference-c-scalapack-array-descriptors). A good rule of thumb for a ScaLAPACK distributed matrix is to have the shape of process grid similar to the matrix shape. For example, for square matrices the grid should be close to square, and for an MxN matrix with M>>N (tall-and-skinny matrix) it is good to have process grid of size PxQ, where P>>Q. In most cases, the reasonable block size for block-cyclic distribution should be somewhere between 24 and 64. Experiments show that the time needed for the matrix redistribution is negligible when compared to the computational time, but additional memory is needed for redistribution. As memory has become less of a limiting factor in High Performance Computing (HPC), it was decided to increase demand for additional memory (WORK/RWORK arrays) when significant performance speed-up is expected.

When a new algorithm is added into Intel MKL, it is preferable to integrate it under an existing API (rather than creating new entry point(s)), as there is no need to change source code on the user side and performance improvements are available immediately after relinking the new Intel MKL version. This principle was used when integrating SBR in both LAPACK and ScaLAPACK, and there are now conditions inside existing eigensolvers that determine whether the traditional or the SBR algorithm should be used. Experiments show that it makes sense to use SBR for larger matrices. For example, the matrix size should be at least 3000-4000 for a single-node run (<50 MPI ranks) and about 13000 for a run on 400 MPI processes (these thresholds are provided just as an example and are subject to change).

There are four ScaLAPACK drivers available in MKL for solving the symmetric eigenvalue problem. They are all based on the same reduction approach. The difference between them is the method used for finding eigenvalues and eigenvectors of the tridiagonal matrix T. Some drivers support more options compared to others (see table below).

*redist indicates that internal matrix redistribution is supported

The EVD (divide and conquer) driver is a balanced option providing both good stability and performance for the JOBZ=’V’ case (eigenvectors are needed). If only eigenvalues are needed (JOBZ=’N’), either EVX (bisection followed by inverse iteration) or EVR (Multiple Relatively Robust Representations, or MRRR, algorithm) will be more suitable.

EVX is typically the fastest driver, and scales very well, but should be used with caution for matrices with highly clustered eigenvalues. After computing the eigenvalues, EVX splits the spectrum of the matrix across MPI processes evenly in intervals. This approach works only if a cluster entirely belongs to a single interval, otherwise an error is returned. In Intel MKL 2018 Update 2, this technique was improved by allowing EVX to split the spectrum in uneven intervals, taking into account information about clusters. But, there is still a limitation for the size of a cluster (and hence an interval) to be no more than N/4, otherwise we return an error  (the chance of having a matrix with such large clusters is low in real applications). A slight downside of this approach is that the amount of memory needed for computations depends on the number and size of clusters in the spectrum. As this information is not known in advance, Intel MKL allocates extra memory internally when it is needed.

EVD performance is usually slightly behind the performance of EVX, and EVR struggles to scale when the number of MPI processes is large (several hundreds). The EV driver is the slowest, and does not support either SBR or matrix redistribution, but also the most reliable. It should be used only if all other eigensolvers have failed.

The SBR approach has not yet been implemented for the case when only part of the eigenvalue spectrum is requested. This omission makes it possible for a situation to arise when finding all the eigenvalues and eigenvectors by EVX or EVR can be faster than finding only a subset.

Performance Results

The charts below demonstrate ScaLAPACK eigensolver performance improvements when comparing Intel MKL 2018 Update 2 versus Intel MKL 2018 Update 1. All experiments were carried out on a TACC* Stampede 2* cluster equipped with 2 x 24-core Intel® XeonTM Platinum 8160 processors. See configuration section below for more details. The sequential threading layer of Intel MKL was linked.

Chart 1. Intel MKL ScaLAPACK symmetric eigensolver performance1 for matrix N=20000.

Chart 2. Intel MKL ScaLAPACK symmetric eigensolver performance1 for matrix N=50000.

Experiments with Real Applications

The open source applications CP2K* and MiniDFT* were chosen to validate the improvements in Intel MKL ScaLAPACK symmetric eigensolver performance.

CP2K* is a quantum chemistry and solid state physics software package that can perform atomistic simulations of solid state, liquid, molecular, periodic, material, crystal, and biological systems. Reference: https://www.cp2k.org/

MiniDFT* is a plane-wave density functional theory (DFT) mini-app for modeling materials. Reference: http://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks/minidft/

Below you can find a few details about CP2K* and MiniDFT* runs.

In order to estimate how the actual computations were improved we excluded the initialization/preparation steps from the applications’ total time. This means that we used the “scf_env_do_scf_inner_loop” and “electrons” timings from CP2K* and MiniDFT* output logs, respectively, for the time reported in the chart.

Chart 3. CP2K* and MiniDFT* performance1 with Intel MKL.

One can see that using Intel MKL Update 2 it is possible to improve CP2K performance by 13% and MiniDFT by 8%, compared to the previous Intel MKL release.

Test Configuration and Optimization Notices1

Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown". Implementation of these updates may make these results inapplicable to your device or system.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.  Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions.  Any change to any of those factors may cause the results to vary.  You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Configuration: TACC* Stampede 2* cluster equipped with Intel® XeonTM Platinum 8160 nodes, 48 cores on two sockets (24 cores/socket), 2.1 GHz nominal clock rate (1.4-3.7GHz depending on instruction set and number of active cores), 192GB (2.67GHz) RAM. The interconnect is a 100Gb/sec Intel Omni-Path (OPA) network; Software: Intel icc/ifort 17.0, Intel MKL 2018 Update 1 and Intel MKL 2018 Update 2. Sequential version of Intel MKL was used (no OpenMP* threading). Benchmark Source: Intel Corporation.

Optimization notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Copyright © 2018, Intel Corporation. All rights reserved. Intel, Xeon and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. All rights reserved.

*Other names and brands may be claimed as the property of others.

Code Sample: An Approach to Parallel Processing with Unreal Engine*

$
0
0
File(s):Download (Intel Software Github)
License:Intel Sample Source Code License Agreement
Optimized for... 
Operating System:Windows® 10 (64 bit)
Hardware:GPU required
Software:
(Programming Language, tool, IDE, Framework)
Microsoft Visual Studio* 2017, Unreal Engine* 4, C#
Prerequisites:

Familiarity with Visual Studio, Unreal Engine API, 3D graphics, parallel processing.

Tutorial:Unreal Engine 4 - Parallel Processing a School of Fish

Introduction

The idea behind this project was to provide a demonstration of parallel processing in gaming with Unreal Engine* and how to perform gaming-related physics using this game engine. In this domain, realism is important as an indicator of success. In order to mimic the actual world, many things need to happen at the same time which requires parallel processing. This code and accompanying article (see References below) cover development of a flocking algorithm, which is then demonstrated as schools of fish via an application. This application can be run single- or multi-threaded in order to see the differences in performance.  Furthermore, performance is improved by performing physics calculations on the GPU. The article also covers the overall complexity of the algorithm and how increasing the number of fish affects productivity.

  1. Overview of a flocking algorithm
  2. Overview of the algorithm implementation
  3. Single thread, multi-thread or GPU

Get Started

Overview of a flocking algorithm

In this example, a flock was defined as a school of fish. For each member, the algorithm needs to worry about cohesion, alignment and separation.  Each fish was calculated to “swim” within a school if it was within a certain distance from any other fish in the school. Members of a school will not act as individuals, but only as members of a flock, sharing same parameters such as the speed and the direction.

  • Cohesion: Fish search for their neighbors in a radius defined as the “Radius of Cohesion”. The current positions of all neighbors are summed and the result is divided by the number of neighbors, thus, the center of mass of the neighbors is obtained. This is the point to which the fish strive for cohesion. To determine the direction of movement of the fish, the current position of the fish is subtracted from the result obtained earlier, and then the resulting vector is normalized.
  • Separation: Fish search for their neighbors in a radius defined as the “Separation Radius”. To calculate the motion vector of an individual fish in a specific separation direction from a school, the difference in the positions of the neighbors and its own position is summed. The result is divided by the number of neighbors and then normalized and multiplied by -1 to change the initial direction of the fish to swim in the opposite direction of the neighbors.
  • Alignment: Fish search for their neighbors in a radius defined as the “Radius of Alignment”. The current speeds of all neighbors are summed, then divided by the number of neighbors. The resulting vector is normalized.
  • Reversal: The fish can only swim in a given space, the boundaries of which are specified. The moment a fish crosses a boundary must be identified. If a fish contacts a boundary then direction of the fish is changed to the opposite vector, thereby keeping the fish within the defined space.

These four basic principles of behavior for each fish in a school are combined to calculate the total position values, speed, and acceleration of each fish. These values must be calculated for each fish in each frame.

In the algorithm the concept of weight coefficients was introduced to increase or decrease the influence of each of these three modes of behavior (cohesion, separation, and alignment). The weight coefficient was not applied to the behavior of Reversal because fish were not permitted to swim outside of the defined boundaries. For this reason, Reversal had the highest priority. Also, the algorithm provided for maximum speed and acceleration

Overview of the algorithm implementation

To calculate the state of fish in a school, double buffering is used. Fish states are stored in an array of size N x 2, where N is the number of fish, and 2 is the number of copies of states. The algorithm is implemented using two nested loops. In the internal nested loop, the direction vectors are calculated for the three types of behavior (cohesion, separation, and alignment). In the external nested loop, the final calculation of the new state of the fish is made based on calculations in the internal nested loop. These calculations are also based on the values ​​of the weight coefficients of each type of behavior and the maximum values ​​of speed and acceleration. The article covers each of these loops in detail, as well as the compute shader used to calculate the position of each fish.

Single-thread, multi-thread or GPU

As was mentioned above, the example can be run in single- or multi-threaded mode, which is easily accomplished using ParallelFor.  ParallelFor can be used in either of two modes, depending on the state of the isSingleThread Boolean variable:

ParallelFor(cnt, [&agents, currentStatesIndex, previousStatesIndex, kCoh, kSep, kAlign, rCohesion, rSeparation, rAlignment, maxAccel, maxVel, mapSz, DeltaTime, isSingleThread](int32 fishNum) {

Because the position of each fish and its neighbors needs to be calculated for each frame increasing the number of fish greatly increases the number of calculations. It is not surprising that the application running in multiple threads outperforms the single-threaded mode. Splitting the work farther by utilizing the GPU further improves the performance. Again, this is covered in more detail in the article.

References

Nikolay Lazarev, Integrated Computing Solutions, Inc., Unreal Engine 4 - Parallel Processing a School of Fish, 2018

Updated Log

Created March 20, 2018

Announcing the 2018 Annual SPDK Summit

$
0
0

SPDK Summit 2018 in San Jose - May 15-16

Discover the SPDK Experience

Are you interested in learning how the Storage Performance Developer Kit (SPDK) can help accelerate your storage solutions?

If you are new to SPDK, no worries! We have various sessions on the agenda that introduce you to the project, along with a hands-on lab where you'll learn as you code. Already an avid SPDK developer or user? Great! Participate in our developer meetup to brainstorm and share ideas to improve the project as well as discuss new features, concepts and BKMs.

Join us at the 2018 Annual Storage Performance Developer Kit (SPDK) May 15-16 at the Dolce Hayes Mansion in San Jose, CA.

Registration to the event is free!!!!

About the Summit

The event kicks off with a keynote by Intel Data Center Group Vice President Jennifer Huffstetler. The agenda continues with an awesome line up that includes talks by various SPDK community members, including Alibaba, eBay, Oracle, Cisco, and Nutanix, who are eager to share their experience, learnings, challenges and benefits with using SPDK. SPDK developers will present new content on various SPDK features highlighting their use cases and benefits. You will also get to see SPDK in action with live demos and the hands-on lab. For more info on the SPDK Project please visit http://SPDK.io

The registration link is live for 2018 Annual Storage Performance Development Kit US Summit.  Please use the link below to complete the registration and hotel room booking. Summit details and agenda are in the link below.

Register Now

To guarantee room availability please register before May 2nd. The deadline to register to attend for the event is May 8th.


Intel® MKL support for largest/smallest Eigenvalue and Sparse SVD Problem

$
0
0

Introduction

Intel MKL Extended Eigensolver functionality [1], based on the accelerated subspace iteration FEAST algorithm, is a high-performance solution for obtaining all the eigenvalues, and optionally all the associated eigenvectors, within a user-specified interval.
To extend the available functionality we propose new routines for finding the K largest/ smallest eigenvalues or singular values of a sparse matrix that are available in the MKL 2019 Beta release.

With the help of new routines users of Extended Eigensolver can obtain a portion of the extremal eigenvalues of a standard/generalized eigenproblem or find the truncated SVD decomposition of a large sparse matrix. To achieve the best convergence for any workload we have implemented two alternative approaches: a subspace projection algorithm based on FEAST[2] and a classical Krylov subspace iteration method[3]. Both approaches belong to the class of projection methods that construct bases to particular subspaces and then obtain the corresponding Ritz values and vectors. In the Krylov subspace technique, dimensions of the subspaces grow as iterations proceed and restart techniques are used to obtain good approximation with small dimension of dense eigenproblem. In contrast, the FEAST-based projection method projects onto subspaces of a fixed dimension that depends only on the number of eigenvalues to find. Each method is beneficial on certain workloads and, for simplicity, we implemented heuristics that decide which algorithm to use. However, users can also specify a preferred approach.

FEAST-based approach

The underlying subspace projection method can be divided into two steps. On the first step, we produce an interval [a,b] that contains the largest/smallest eigenvalues in question using classical and recent methods that estimate eigenvalue counts[4]. On the second step, we approximate the spectral projector to the invariant eigenspace correspondent to [a,b] using the FEAST algorithm.

Filtering techniques are implemented using rational expansion for the eigenvalue problem and Chebyshev polynomials for SVD. In the first method, the projector is constructed by integrating the resolvent of the eigenproblem along a contour in the complex plane enclosing the interval [a, b]. In this case the projector is approximated by a rational function of matrix and it is required to solve a sparse linear system on each iteration of algorithm. In the second method, the resulting projector is expanded as a polynomial function of the matrix. Thus, the hotspot for the algorithm is sparse matrix dense matrix multiplication.

The subspace projection method can be beneficial compared to classic Krylov methods in cases when a large portion of eigenvalues is required, as its convergence does not depend on the number of eigenvalues in the subspace.

Krylov subspace iteration approach

To ensure fast convergence when only a few eigenpairs (singular values) are required we implemented the Krylov-Schur algorithm - an effective and robust restarting scheme. The Krylov-Schur method[3] can be seen as an improvement of the implicitly restarted Arnoldi algorithm which employs reordered Schur decompositions to perform restarts and deflations in a numerically reliable manner.

For best possible performance on Intel Architecture, all linear operations are performed using MKL Inspector-executor Sparse BLAS, MKL PARDISO and MKL LAPACK highly optimized functionality.

Functionality overview

New functionality is a compliment for existing Extended Eigensolver functionality and enables user to solve the following problems:

Eigenvalue problems:

Find all or part of numbers Lambda and corresponding vectors X such that:

AX = Lambda*X,

A = AΤ

(Standard eigenvalue problem)
or

AX= Lambda*BX,

A=AT,  B=BT>0

(Generalized eigenvalue problem)

Singular value problem:

Find all or part of numbers SIGMA and corresponding vectors X such that:

A*ATX=SIGMA*X or ATAX = SIGMA*X

Notes:  Currently only following features are supported:

  • problems with real spectrum.
  • input matrix in CSR and BSR formats.
  • single and double real precisions.
  • C and Fortran 90 interfaces.
  • Intel(R) TBB and Open MP parallelism.

For more information on API, see Intel MKL Reference Manual.     

For examples of usage see:

path-to-mkl/examples/solvers_eec/source/dexample_extremal_ev_c.c – C example for finding largest eigenvalues of standard eigenvalue problem

path-to-mkl/examples/solvers_eec/source/dexample_extremal_gv_c.c – C example for finding largest eigenvalues of  generalized eigenvalue problem

path-to-mkl/examples/solvers_eec/source/dexample_extremal_svd_c.c – C example for finding largest singular values of sparse matrix

path-to-mkl/examples/solvers_eef/source/ dexample_largest_eig_f.f90 – F90 example for all the above functionality

Performance

The chart below shows a performance comparison of SLEPc run times compared to those of the new Extended Eigensolver mkl_sparse_d_ev routine on matrices from the Florida Sparse Matrix collection [5]. Each column represents the ratio Time SLEPc/Time MKL Extended Eigensolver. Three runs were performed for each matrix: finding 1, 200 and 500 of largest eigenvalues. Since SLEPc supports only MPI parallelism, MPI runs on 40 MPI processes were compared to Open MP parallel runs of the Extended Eigensolver with 40 threads.

On the chart below users can see what benefits can be achieved by switching from dense functionality for finding largest eigenvalues to new Extended Eigensolver sparse solution. Performance improvement depends on matrix density and number of eigenvalues to find but can be significant. For sparse solution mkl_sparse_d_ev routine was used. For dense solution a combination of the following LAPACK routines were used: dsytrd, dorgtr, dsteqr.

References

[1] Intel Math Kernel Library Extended Eigensolver

[2] P. T. P. Tang, E. Polizzi, FEAST as a Subspace Iteration EigenSolver Accelerated by Approximate Spectral Projection, SIAM Journal on Matrix Analysis and Applications (SIMAX), 35, 354-390 (2014)

[3] Stewart, G. W. (2001). A Krylov–Schur Algorithm for Large Eigenproblems. SIAM J. Matrix Anal. Appl., 23(3):601–614.

[4] E. Di Napoli, E. Polizzi, Y. Saad, Efficient Estimation of Eigenvalue Counts in an Interval

[5] Suite Sparse Matrix Collection https://sparse.tamu.edu/

Using Intel® Advisor on a MacOS* system

$
0
0

Introduction

It is a common use case for high performance, parallel and vectorized code to be run on dedicated remote systems like clusters with strict time schedules and limited capabilities for visualization and data manipulation. Tools for performance assessment are often limited to running from the command line, and many conveniences like GUI interfaces and access to online documentation resources are often unavailable on these types of systems.

The purpose of this article is to show how the Intel® Advisor command line can be used on remote Linux* or Microsoft Windows* system to collect data and the results can be analyzed with the GUI on a MacOS* developer workstation.

Getting Started

To run an Intel® Advisor analysis on a remote (target) system and view the results on a MacOS* computer, you need to perform the following steps:

  1. Install the Intel® Advisor GUI on a local MacOS* system
  2. Install the Intel® Advisor command line on a target system (cluster)
  3. Build your application for analysis with debug information enabled and with inline debug information included (see Intel® Advisor help for more)
  4. Recommendation: the application for analysis, along with its binaries, symbol information and source code, should be located on a shared drive visible to both local and remote machines

Remote analysis workflow

  1. Collect data on the target. The following command will run a survey analysis:
    advixe-cl -collect survey -project-dir /user/test/vec_project /user/test/vec_samples/vec_samples
    This will analyze the vec_samples application and create a result project at /user/test/vec_project.
  2. Next, retrieve and view the results. You can either pack your results or simply copy the result project to a shared drive.
    • Packing Results:
      1. You can pack your results using a command like this one:
        advixe-cl --snapshot --project-dir /user/test/vec_project --pack --cache-sources --cache-binaries -- /tmp/my_proj_snapshot
      2. Copy my_proj_snapshot.advixeexpz to the MacOS* host and open it in the GUI, then view the result:
        advixe-gui my_proj_snapshot.advixeexpz
    • Copying Results:
      1. Copy the vec_project directory to the shared drive.
      2. Open the project in the GUI:
        advixe-gui vec_project
      3. Set up the binary, symbol and source paths for the analyzed application in the shared location using the menu File > Project properties > Binary/Symbol Search and Source Search tabs. Then view the result.

Roofline analysis

Let’s walk through the Intel® Advisor Roofline analysis as an example. Start the Intel® Advisor GUI on the local machine. Click the Get Command Line  button on the Workflow pane next to the desired analysis type – in this case, we will start with a Roofline analysis. A dialog window opens, containing the command for launching this type of analysis with the current settings.

The command line for getting the Roofline chart consists of two commands for Survey and Trip Counts with flops collection. And a couple of parameters with search paths to binary, symbol and source files are included in the auto-generated command line to reflect the project settings.

Copy the line to clipboard and paste it in your job script or in the command line and then start the analysis.

For this case where the results are being analyzed on a MacOS* host system, it is recommended to not use -no-auto-finalize command line option for reducing collection and finalization time. We suggest doing all the finalization on the target system along with the analysis. As a MacOS* workstation most likely does not have the same version of compiler, runtimes, math libraries and other parts of analyzed application stack, finalization on target system should better capture all these details.

Memory Access Patterns and Dependencies Analysis

After getting Survey and/or Roofline analysis results, you may want to perform deeper analysis for certain loops to detect inefficient memory access patterns or determine whether they have data dependencies between iterations. The following commands illustrate these types of analyses. Here, the search directories parameters are omitted for brevity, but be sure to include them for correct mapping of analysis result to sources and binaries.

advixe-cl –project-dir /tmp/roofline_project -mark-up-loops –select foo.cpp:34,bar.cpp:192
advixe-cl -collect map -project-dir /tmp/roofline_project -- /tmp/shared_folder/vec_sample/vec_samples
advixe-cl -collect dependencies -project-dir /tmp/roofline_project -- /tmp/shared_folder/vec_sample/vec_samples

To limit analysis scope only to two specific loops of interest and to reduce analysis time, a -mark-up-loops command is used prior to the analyses in the examples above. This command accepts a comma separated list of loops in a filename:linenumber format, and these loops will remain selected for future tripcounts, map, and dependencies analyses unless otherwise specified.

Alternatively, loops can be specified for the duration of only one map or dependencies command using the -mark-up-list option in the collection command. This option takes a comma separated list of loop IDs numbers. These IDs can be retrieved using Intel® Advisor command line Survey reports or using the GUI by selecting loops with checkboxes on the Survey report tab and getting the corresponding command line on the Workflow window after such selection.

advixe-cl -collect map –mark-up-list=58,72 -project-dir /tmp/roofline_project -- /tmp/shared_folder/vec_sample/vec_samples

Reviewing results on a MacOS* system

Finally we need to copy the result generated on the remote machine to the local machine to be viewed, using either of the two methods described earlier. Once the results have been opened, the full content of the GUI should be available on your MacOS* system. You will be able to view the results of Intel® Advisor analyses, identify the vectorization inefficiencies and optimization opportunities, and study the Roofline chart, with all these performance observations mapped to source code and assembly.

Notes and limitations

There are a few limitations in the MacOS* result viewer in the Intel® Advisor 2019 Beta version worth noting:

  • The PythonAPI is not currenly supported
  • A new analysis cannot be run on MacOS* system
  • There is no MacOS* specific documentation yet, but most of the result viewing tips from the Linux* documentation should be relevant for viewing results on MacOS* systems.

Summary

Intel® Advisor is a must-have tool for getting the most performance out of your HPC applications. Starting with our 2019 Beta you can now visualize your Intel® Advisor results on your MacOS* system. You should follow the remote workflow described in this article to get the most out of Intel Advisor MacOS* analyses.

Architecture Agnostic Spin-Wait Loops

$
0
0

To fully utilize the power of today's multicore processors, game developers are using more advanced tasking systems which distribute the work across multiple threads in a thread pool. As the thread count increases, so does the chance of contention between the threads on constructs such as job queue locks and other shared resources. There are many ways to work around these, but a common construct is the spin wait loop.

while (!acquire_lock())
{
	// Spin for a specific number of iterations
	for (int i = 0; i < max_spin_count; i++)
	{
		// Pause intrinsic
		_mm_pause();
	}
}

The _mm_pause instruction is used here as it gives a hint to the processor that the calling thread is in a spin-wait loop. This will pause the next instruction from executing and in so doing the processor is not under demand and parts of the pipeline will not be used, thus saving power.

The _mm_pause instruction latency has been similar on most Intel platforms for the past few generations, and due to this historical consistency, many developers have tuned their spin loops with this in mind. However, the latency of the pause instruction was significantly increased by an order of magnitude, starting with the 6th Generation Intel® Core™ i processor family, to provide better power saving opportunities in many scenarios.

As a result of this latency modification, the above fixed-count spin loop would now consume an order of magnitude more cycles which could have a detrimental impact on the performance of your application. To avoid any issues with future architectural changes to this instruction, any spin-wait loops should be checked to ensure they are not implemented with a fixed count of pause instructions. An appropriate modification to the above spin-wait loop would be :

while (!acquire_lock())
{
	// __rdtsc intrinsic is used to read the time stamp counter
	// This allows the loop to run for a fixed number of cycles
	uint64_t prev = __rdtsc();
	do
	{
		// Pause intrinsic

		_mm_pause();
	}
	while ((__rdtsc() - prev) < max_spin_time)
	}

While the above spin-wait loop is very simple and software developers would typically use more advanced spin loops with exponential backoffs etc, it does show how to make software more robust with respect to future architectural changes in instruction latencies.

 

Please read the following article for a more detailed look into the _mm_pause instruction and spin-wait loops: Benefitting Power and Performance Sleep Loops

Further information about programming for Intel® architecture can be found in the Software Development Manuals: Intel® 64 and IA-32 Architectures Software Developer Manuals

Challenges and Tradeoffs on the Road to AR

$
0
0

The original article is published by Intel Game Dev on VentureBeat*: Challenges and tradeoffs on the road to AR. Get more game dev news and related topics from Intel on VentureBeat.

Anyone involved in virtual reality over the course of the past few years, whether as a developer of VR, as a user of VR, or simply tracking the industry's progress, will agree there's a word they've heard a few times too many: Holodeck*. The well-trod Star Trek concept has become a threadbare metaphor for a supposed end-point for VR technology.

While aspirational visions serve a purpose, they can also do us disservice. The reality is that we are a very long way from that Holodeck vision and that's OK. VR is already serving many useful purposes with near-term solutions that don't attempt to fool all our senses to the point of a complete suspension of disbelief. Most of the industry, it seems, has come to accept this, as have most VR users. We have, collectively, come to terms with the fact that great product solutions can exist in the near term, that deliver some portion of the Holodeck promise, while leaving other portions to the fictions of Star Trek and other sci-fi.

It is surprising then, when looking at augmented reality1, that so many believe in the promise of a "Holodeck of AR"— sleek and stylish glasses delivered via hardware and software magic that rather than bringing us to any imaginable universe, instead bring any imaginable augmentation of the senses to our real world. Moreover, many believe this is deliverable in the near-term time horizon.

While solutions spanning the immersive technologies domain (AR, VR) will share dependence on common underlying technologies, augmented reality is in many ways a harder problem. AR can be thought of as a whole bouquet of thorny technical problems, each of which is its own very deep rabbit hole.

As with VR, AR involves an input-output loop that needs to execute sufficiently quickly to fool the conscious and subconscious to a degree where the results seem congruous with the surrounding world and the user's sense of what seems natural. What's more, in order to dovetail with the surrounding world, the solution may need to communicate with and draw from surrounding information sources. The sophistication of the processing that the solution may need to perform may vary by use case. And the solution needs to be embodied in something that a user can wear or carry in a manner suitable to their situation.

This is where the challenge becomes apparent. The sheer number of possible inputs and outputs that one can imagine, the depth of each that might be required, the sophistication of the processing that may be required for a given task, and the desired attributes for the embodiment of that solution (price, form factor, etc), make this a boundless problem.

Attributes of AR Solutions

For a sampling of the technical challenges facing AR, see the Illustration below, which attempts to present the wide variety of attributes that an AR solution may embody. Titled the 'Attributes of Augmented Reality2, this — while almost certainly incomplete — is meant to illustrate the breadth of challenging problems to address. I've divided them into four main areas:

  • Sensing: Seeing, hearing, sampling, and otherwise deriving the state of the world surrounding the user.
  • Processing: Making sense of all of that data, what it means in the context of the computational tasks, simulations and/or and applications at hand, and making decisions about what to do next.
  • Augmenting: Taking the output of this processing and presenting it back to the user's senses in a way that augments their sense of their environs.
  • Embodying: The attributes of the physical manifestation of the device or devices that deliver this solution.

This is an admittedly over-simplified division; and the sub-categories within each area are only a subset, to which many working within the field could add. This, in a way, is the point: Solutions that do all of these things, let alone do them well, cheaply, and unobtrusively, are a long way off.

Attributes of Augmented Reality

Even more challenging still is the number of problems in the space that are ones for which solutions do not yet even exist. I like to think of the problems as falling within three distinct domains:

Problems at the intersection of power, performance, and time. For those of us that work in Silicon Valley, these are the easiest to understand. For known solutions to problems, they are simply a matter of "how long before Moore's Law allows us to do this in real-time, within a certain power envelope?"

Problems requiring breakthroughs in science. This is a more challenging category of problems, requiring breakthroughs in limitations of existing technologies — or more often — multiple breakthroughs. Examples in recent years include image-based inside-out 6DOF-tracking, or Waveguide display technologies. Lightfield displays are an example that feels further out on the edge of today's R&D. While predicting when these problems will be solved is much harder, there's a certain faith that people in the field have enough smart people in labs around the world working on these problems to make progress in solving them.

Problems requiring breakthroughs in design, user experience, and social norms. I sometimes encounter folks who believe that if we tackle problems in the two above categories, the third category of problems will be resolved in short order. Personally, I think this is the hardest category of the three. We can look at many technology transitions and see that there was a sort of "maximum rate of absorption" at which the design community could adapt to the new paradigm (e.g. the half-decade of attempts at 3-finger swipes, swirly gestures, and other touchscreen UI attempts before the dust settled on what most apps use today on smartphones).

Similarly, there's an analogous societal component — it takes time for people to get used to intrusions of technology (real or perceived) on their lives. (Google Glass* learned this lesson painfully.)

Specialization Versus Jack of All Trades

Until a point in the far future where we can deliver all of the attributes of AR at extremely high quality, inexpensively, and seamlessly, we're going to see interim solutions that are forced to make tradeoffs between them. This is a Good Thing. I hold a strong conviction that the path to success in this space is in doing fewer things extremely well, not many things in a compromised fashion.

It's likely we'll see AR solutions that tackle particular problems in point solution devices. We'll see solutions that make compromises on some attributes in order to exceed expectations on others. We'll see solutions that complement existing screens rather than replace them. And like with VR, we'll see solutions that leverage the power of existing devices (PCs, game consoles, smartphones, etc.).

Fostering an Environment for Progress

If we take the view that solutions will need to decide on different tradeoffs for different optimal solutions for particular problems, customer segments or form factors — and that we want many solutions to make attempts at different flavors of AR solutions — then how to encourage this?

The first step is to acknowledge that the "AR Holodeck" is not likely to arrive in the near term, and that interim, specialized solutions are not only OK, but may be preferred. Second is to foster an environment that allows a multitude of solutions to materialize — through open platforms and open standards. Finally, the industry requires collaboration — as entrants solve a problem in one domain, to share that solution with others to allow cross-pollination. Through these kinds of efforts, we may get our "holodeck of AR" eventually, but we'll have been using AR for years already by the time that happens.

About the Author

Kim Pallister manages the VR Center of Excellence at Intel. The opinions expressed in this article are his own and not necessarily represent the view of Intel Corporation

Footnotes

1. I'm going to avoid getting into the AR/MR nomenclature debate. For purposes of this article and the illustrative Attributes of AR poster – I'm covering the full spectrum of cases where a solution would supplement a user's environment with spatial elements, regardless of how seamlessly or realistically the solution attempts to integrate them into the environment.

2. To give credit where it's due: I owe thanks to the folks at Ziba Design for helping lay out the design in a far more cohesive way than I originally had it on my whiteboard. Also, a huge thanks to John Manoogian III for his creation of the *brilliant* Cognitive Bias Codex, from which I took inspiration.

Array Shape Check: New in Intel® Fortran Compiler 19.0 BETA

$
0
0

Array Shape Check: New in Intel® Fortran Compiler 19.0 BETA

The array shape checking feature implemented in Intel® Fortran Compiler 19.0 BETA checks for array shape conformance where it is required by the Fortran standard. When enabled, the compiler checks contexts at compile-time and will generate code that checks at run-time that the shapes of arrays conform in various contexts where conformance is required. Try this compiler option to help debug a program with arrays!

To enable array shape checking, compile with -check shape (Linux* and macOS*) or /check:shape (Windows*). Using this code as an example,

program t3
  implicit none
  real(8) :: a(20)=1,b(20)=2
  integer :: n=10,m=20
  a(1:10)=3
  b(1:m)=b(1:m)+a(1:n)
  print *,b
end program t3

the array shape mismatch is detected at runtime. This example output is on Linux and is similar for Windows.

ifort -check shape t3.f90
a.out
forrtl: severe (408): fort: (33): Shape mismatch: The extent of dimension 1 of array B is 20 and the corresponding extent of array A is 10

Image              PC                Routine            Line        Source     
a.out              0000000000405400  Unknown               Unknown  Unknown
a.out              0000000000402CA9  Unknown               Unknown  Unknown
a.out              0000000000402C1E  Unknown               Unknown  Unknown
libc-2.17.so       00007F3285A74AF5  __libc_start_main     Unknown  Unknown
a.out              0000000000402B29  Unknown               Unknown 
a.out              0000000000402C1E  Unknown               Unknown  Unknown
libc-2.17.so       00007F93C172BAF5  __libc_start_main     Unknown  Unknown
a.out              0000000000402B29  Unknown               Unknown  Unknown

The severe error can be changed to a warning by also compiling with -warn shape (Linux and macOS) or /warn:shape (Windows) and the program prints the warning and traceback and, in this case, runs to completion.

ifort -check shape -warn shape t3.f90
a.out
forrtl: warning (406): fort: (33): Shape mismatch: The extent of dimension 1 of array B is 20 and the corresponding extent of array A is 10

Image              PC                Routine            Line        Source     
a.out              0000000000405420  Unknown               Unknown  Unknown
a.out              0000000000402D1B  Unknown               Unknown  Unknown
a.out              0000000000402C1E  Unknown               Unknown  Unknown
libc-2.17.so       00007F51932C3AF5  __libc_start_main     Unknown  Unknown
a.out              0000000000402B29  Unknown               Unknown  Unknown
   5.00000000000000        5.00000000000000        5.00000000000000
   5.00000000000000        5.00000000000000        5.00000000000000
   5.00000000000000        5.00000000000000        5.00000000000000
   5.00000000000000        3.00000000000000        3.00000000000000
   3.00000000000000        3.00000000000000        3.00000000000000
   3.00000000000000        3.00000000000000        3.00000000000000
   3.00000000000000        3.00000000000000

Add -traceback to get more information about the runtime failure. In the traceback in the example below, notice that the shape mismatch is on line 6.

ifort -check shape -traceback t3.f90
a.out
forrtl: severe (408): fort: (33): Shape mismatch: The extent of dimension 1 of array B is 20 and the corresponding extent of array A is 10

Image              PC                Routine            Line        Source     
a.out              00000000004055C0  Unknown               Unknown  Unknown
a.out              0000000000402E58  MAIN__                      6  t3.f90
a.out              0000000000402C1E  Unknown               Unknown  Unknown
libc-2.17.so       00007F13720E8AF5  __libc_start_main     Unknown  Unknown
a.out              0000000000402B29  Unknown               Unknown  Unknown

With the Fortran 2003 standard for an array assignment, if the LHS (left hand side) is an allocatable array, by default the compiler does the following:

  • If the LHS is not allocated, allocate it to the size of the RHS (right hand side).
  • If the LHS is allocated, but is not the same size as the RHS, then deallocate the LHS and reallocate to the size of the RHS.
  • Do the actual array assignment.

As a result, for assignments with an allocatable array on the LHS, there can be no LHS/RHS shape violation. However, if you specify -assume norealloc_lhs (Linux and macOS) or /assume:norealloc_lhs (Windows) or -nostandard-realloc-lhs (Linux and macOS) or /nostandard-realloc-lhs (Windows), then the Fortran 90/95 behavior for allocatable arrays is used, and the allocatable array on the LHS must have the same size as the RHS.

For example, given the following code:

PROGRAM TR
  IMPLICIT NONE
  INTEGER, ALLOCATABLE :: A(:),B(:)
  ALLOCATE(A(20))
  ALLOCATE(B(10))
  A(:) = 10
  B = A
  WRITE(*,*) B(1)
END TR

There is no shape violation, by default, since B will be automatically re-sized to the size of A. If -assume norealloc_lhs is specified, however, a violation does occur, and will be reported, if array shape checking is enabled.

When the array shape check compiler option is enabled, these contexts are checked:

  • right-hand and left-hand side of intrinsic and elemental defined assignments,
  • the operands of intrinsic and elemental defined binary operations,
  • two or more array arguments to ELEMENTAL procedures,
  • the ARRAY= and MASK= arguments to intrinsic procedures as required,
  • and the arguments to the intrinsic module procedures IEEE_SET_FLAG and IEEE_SET_HALTING_MODE.

Array shape checking is also enabled by including -check (Linux and macOS) or /check (Windows) in the compile command.

Add this compiler option to your debugging toolkit!

 

† An array’s shape is determined by its rank and number of extents. The rank is the number of dimensions of the array and the number of elements in a dimension is called the extent of the array.

 

Viewing all 3384 articles
Browse latest View live




Latest Images