Quantcast
Channel: Intel Developer Zone Articles
Viewing all 3384 articles
Browse latest View live

Getting Started Guide for the Intel® Speech Enabling Developer Kit

$
0
0

Designed to enable commercial Smart Home device manufacturers to create new experiences with speech recognition capabilities using Amazon Alexa* Voice Service on Intel® silicon, the Intel® Speech Enabling Developer Kit can be used to create a diverse ecosystem of Smart Home devices with Alexa. Some categories of these devices include: wireless speakers, entertainment devices, and smart appliances. The kit includes a dual DSP with inference engine , an 8-mic circular array, and technology for Alexa wake word recognition, beam forming, noise reduction, and acoustic echo cancellation.

Follow along with the chapters of this getting started guide to get your developer kit up and running. We walk you through hardware assembly and set up as well as firmware and software setup. You will be able to set up a Raspberry Pi* board as a speech analytic engine host system running Raspbian* operating system (OS).

There are two phases to set up the Intel® Speech Enabling Developer Kit:

  1. Hardware setup
  2. Operating system setup and software downloads

If you have already completed the hardware setup instructions in the box, you can skip to Step 4: Connect the Raspberry Pi cable

Phase 1: Hardware Setup

The hardware components in the Intel® Speech Enabling Developer Kit includes:

  • Dual DSP with inference engine
  • Eight digital microphone (DMIC) board
  • Raspberry Pi* connector cable 
  • 6mm screws (x3)
  • Washers (x6)
  • 40mm female-female standoffs (x3)

Additional hardware needed, but not included in the kit:

You can also purchase a Raspberry Pi starter kit, like the Vilros Raspberry Pi 3 Kit, which contains a board, case, power supply and heat sink. 

Contents of Kit

Note: The Raspberry Pi Connector cable may be shorter than the one pictured.


DSP Board Top


DSP Board Bottom



DMIC Board Top


Raspberry Pi connector cable

Assembly

Step 1: Screw and Washer Placement

Start with the DSP board with bottom side facing up. Place one washer on the first screw and insert in the screw slot. Then place another washer on the screw. Attach one of the female-female standoffs to the screw. Repeat this process with the remaining screws, washers and standoffs.

Note: Be careful not to overtighten the screws.

Step 2: Invert Setup for DMIC Board Placement

Turn the setup over so that it is standing on the 40mm female-female standoffs. Place the DMIC board into the connectors on the DSP board.

Note: Ensure that Intel's logos face the same direction on the DSP board and DMIC board and that the three large holes are aligned. All pins should line up and boards should be directly over one another. If alignment is off – even by one pins – the kit will not work.

 

Step 3: Raspberry Pi Connector

Invert the setup and insert the Raspberry Pi connector cable on the board.

Step 4: Connect Cable to Raspberry Pi

Step 5: Connect Power to DSP Board

Note: The LEDs will turn blue to indicate the power is connected correctly.

Step 6: Connect speakers

Connect the speakers to the 3.5mm audio jack on the DSP board.

Completed cable setup below

This image shows the correct cable setup for the needed peripherals into the Raspberry Pi ad the DSP board.

Phase 2: Operating System Setup and Software Downloads

Now that you have the hardware set up, you are ready to set up the Raspberry Pi as the host system for the developer kit. To do this, you’ll flash the Raspbian* OS onto the Raspberry Pi board. This is done in two steps:  

  • First you will need to download the latest version of Raspbian onto your computer and then program the microSD card, using your laptop or PC. 
  • Second, you will move the microSD card to the Raspberry Pi and boot the OS. 

Raspbian* OS Setup

For the OS setup, make sure you have a USB keyboard and mouse connected to the Raspberry Pi’s USB ports. Make sure that the Raspberry Pi’s HDMI output is connected to an HDMI monitor. 

To download and install the OS and additional required packages, you will need to establish an internet connection to both your laptop or PC and to the Raspberry Pi.  

  • Make sure your laptop or PC has an internet connection. 
  • Connect one end of your Ethernet cable to Raspberry Pi’s Ethernet port and the other end of the Ethernet cable to an Ethernet switch/hub or a router that provides a direct internet connection (i.e. no proxy servers to get access to internet). 
  1. On your laptop or PC, install the base Raspberry Pi OS image: https://downloads.raspberrypi.org/raspbian_latest. For more information on installing the OS images, please visit: https://www.raspberrypi.org/documentation/installation/installing-images/README.md 
  2. Load the OS image from the laptop or PC to the micro SD card using a card reader and dedicated software, like Win32DiskImager for Windows*, dd or Etcher for Linux* or Mac. You will need to use a micro SD to SD card adapter or a micro SD to USB adapter depending on what ports are available on the PC/Mac.
  3. Once the image is written to the micro SD card, remove the card from the laptop or PC and Insert the card into the Raspberry Pi. The microSD card slot is located on the bottom of the Raspberry Pi 3 board. 
  4. Plug the USB cable connected to a wall adapter into the micro USB port on Raspberry Pi and power on. Do not plug the USB cable powering the Raspberry Pi into your laptop or PC.

Change Keyboard Layout (Optional)

The default set up for the keyboard is UK English in the Raspbian image. If you are using a keyboard with a different layout, like US English, these steps need to be followed to update the keyboard configuration.  

Open a shell prompt (or command line) and type:
raspi-config

  1. Select Internationalization.
  2. Select Keyboard Setup.
  3. Choose your preferred layout.
  4. Choose your preferred language, e.g. English (US) for American English.

Raspbian OS Preparations

Open a shell prompt (or command line) and type:

sudo rapsi-config
  1. Select Interfacing Options. 
  2. Select SPI.
  3. Enable SPI interface.
  4. Allow to automatically load kernel module.
  5. Select Finish and press Enter.
  6. Select Yes to reboot or manually reboot in shell:
sudo reboot

Security

For security purposes, it is strongly advised that you change the default password for default user pi and set the password for the root user. You can learn more about updating the default password here.

Note: The remaining tasks will be completed on the Raspberry Pi browser. If you can see the Raspberry Pi desktop, you should use it for all remaining steps. 

Get an Amazon Developer Account 

Create an Amazon developer account here: https://developer.amazon.com/. Click Sign In and select Create your Amazon Developer account.
Create a device profile following these instructions: https://github.com/alexa/alexa-avs-sample-app/wiki/Create-Security-Profile  

Note: You might notice slight differences between the user interface and the instructions posted on GitHub. Please continue to follow the instructions as stated on GitHub. 

Install and configure AVS device SDK 

There are two options to do this:

  1. Download the sources and dependent libraries, build from the kernel, driver and SDK from the sources. This option (known as “from scratch”) can take 3 to 4 hours to complete.
  2. Download pre-built kernel, driver, SDK and dependent libraries and install them. This option (known as “use prebuilt”) takes about 30 minutes to complete.

TIP: You may find it easier to open the Getting Started Guide on the Raspberry Pi so that you can copy and paste long commands: https://software.intel.com/en-us/articles/getting-started-guide-for-the-intel-speech-enabling-developer-kit.

  1. Open a terminal window on the Raspberry Pi. 
  2. Execute the followings command to download the installation script:
    cd ~
    wget https://raw.githubusercontent.com/intel-iot-devkit/avs-device-sdk-intel-speech-enabling-kit/master/install_avs_sdk.sh
  3. Execute one of the followings commands depending on the option you choose for the installation
    Build from scratch (3 to 4 hours):
    sudo bash ./install_avs_sdk.sh --from-scratch
    Use prebuilt components (30 mins):
    sudo bash ./install_avs_sdk.sh --use-prebuilt
  4. The script will prompt you for the device credentials. Enter the device credentials you created in your Amazon developer portal. Press the Enter on the keyboard after every entry.
    TIP: Login to the Amazon developer portal using your developer account in a browser on the Raspberry Pi desktop and have the page for device/product you created. You can then copy/paste the credentials from the browser window to the terminal script.
  5. Once the credentials are entered, the script will continue with the installation option you chose. Upon completion of installation, the script will launch the Raspberry Pi’s browser with the URL (https://localhost:3000).
  6. Press Continue or login to your Amazon Developer account as prompted by the website. Wait for the script to complete.
  7. If you selected the “from scratch” option, you may optionally run unit tests to make sure all the components work correctly. If you did not select the “from scratch” option, you should skip this step.
    In terminal window, run the following steps to execute the series of unit tests on your prototype:
    cd /home/pi/sdk-folder/sdk–build
    make all tests
    You should see and hear your prototype run through a series of audio tests as well as functional tests. These tests should take around 4 minutes and result in 571 tests complete with 100% success. As a developer, any time you modify your client’s on-device SDK software, you should run these Unit Tests to ensure nothing was unintentionally broken.
  8. Now you’re ready to launch your client! Start the AVS Sample App with the following commands:
    cd sdk-folder/sdk-build/SampleApp/src/
    TZ=UTC ./SampleApp ../../Integration/AlexaClientSDKConfig.json
  9. Now you can interact with the client using voice commands. Try giving it some commands, such as:
  • “Alexa, what time is it?”
  • “Alexa, what is 1+1?”
  • “Alexa, what’s the weather in New York City today?”
  • “Alexa, do you know rap?”
You should now have a working Alexa prototype!
To exit the Sample App, simply press CTRL-C, or enter “q” into the terminal window and press Enter.
 
If the device is not responding, or if you have any technical questions or issues, please contact Intel Customer Support.
 

Useful links

 


Motion Heatmap Using OpenCV in Python*

$
0
0

This sample application is useful to see movement patterns over time. For example, it could be used to see the usage of entrances to a factory floor over time, or patterns of shoppers in a store.

What You’ll Learn

  • Background subtraction
  • Application of a threshold
  • Accumulation of changed pixels over time
  • Add a color/heat map

Gather Your Materials

  • Python* 2.7 or greater
  • OpenCV version 3.3.0 or greater
  • The vtest.avi video from https://github.com/opencv/opencv/blob/master/samples/data/vtest.avi

This article continues on GitHub

Use TensorFlow* for Deep Learning Training & Testing on a Single-Node Intel® Xeon® Scalable Processor

$
0
0

Introduction

This document provides step-by-step instructions on how to train and test a single-node Intel® Xeon® Scalable processor platform system, using TensorFlow* framework with CIFAR-10 image recognition datasets. This document provides beginner-level instructions, and both training and inference is happening on the same system. The steps have been verified on Intel Xeon Scalable processors, but should work on any latest Intel Xeon processor-based system. None of the software pieces used in this document were performance optimized. This document is follow-on documentation to the article Deep Learning Training and Testing on a Single-Node Intel® Xeon® Scalable Processor System Using Intel® Optimized Caffe* .

This document is targeted toward a beginner-level audience who want to learn how to proceed with training and testing a deep learning dataset using TensorFlow framework once they have Intel Xeon CPU-based hardware. The document assumes that the reader has basic Linux* knowledge and is familiar with the concepts of deep learning training. The instructions can be confidently used as they are, or can be the foundation for enhancements and/or modifications.

There are various ways to install TensorFlow. You can install it using binary packages or from GitHub* sources. This document describes one of the ways that was successfully deployed and tested on a single Intel Xeon Scalable processor system, running CentOS* 7.3. Some other installation methods can be found in2,18. The goal of this document is not to give an elaborate description of how to reach state of the art performance; rather, it’s to dip a toe into TensorFlow and run a simple train and test using the CIFAR-10 dataset on a single-node Intel Xeon Scalable processor system.

This document is divided into six major sections including the introduction. Section II details hardware and software bill of materials used to implement and verify the training. Section III covers installing CentOS Linux as the base operating system. Section IV covers the details on installing and deploying TensorFlow using one of the many ways to install it. Sections V and VI enlist the steps needed to train and test the model with the CIFAR-10 dataset.

The hardware and software bill of materials used for verified implementation is mentioned in Section II. Users can try a different configuration, but the configuration in Section II is recommended. Intel® Parallel Studio XE Cluster Edition is an optional installation for single-node implementation. It provides you with most of the basic tools and libraries in one package. Starting with Intel Parallel Studio XE Cluster Edition from the beginning accelerates the learning curve needed for multi-node implementation of the same training and testing, as this software is significantly instrumental on a multi-node deep learning implementation.

Hardware and Software Bill of Materials

ItemManufacturerModel/Version
Hardware

Intel® Server Chassis

IntelR1208WT
Intel® Server BoardIntelS2600WT

(2x) Intel® Xeon® Scalable processor

Intel

Intel Xeon® Gold 6148 processor

(6x) 32 GB LRDIMM DDR4Crucial*CT32G4LFD4266
(1x) Intel® SSD 1.2 TBIntelS3520
Software

CentOS Linux* Installation DVD

 7.3.1611

Intel® Parallel Studio XE Cluster Edition

 2017.4

TensorFlow*

 setuptools-36.7.2-py2.py3-none-any.whl

Installing the Linux* Operating System

This section requires the following software component: CentOS-7-x86_64-*1611.iso. The software can be downloaded from the CentOS website.

DVD ISO was used for implementing and verifying the steps in this document, but the reader can use Everything ISO and Minimal ISO, if preferred.

  1. Insert the CentOS 7.3.1611 install disc/USB. Boot from the drive and select Install CentOS 7.
  2. Select Date and Time.
  3. If necessary, select Installation Destination.
    1. Select the automatic partitioning option.
    2. Click Done to return home. Accept all defaults for the partitioning wizard if prompted.
  4. Select Network and host name.
    1. Enter “<hostname>” as the hostname.
      1. Click Apply for the hostname to take effect.
    2. Select Ethernet enp3s0f3 and click Configure to set up the external interface.
      1. From the General section, check Automatically connect to this network when it’s available.
      2. Configure the external interface as necessary. Save and Exit.
    3. Select the toggle to ON for the interface.
    4. Click Done to return home.
  5. Select Software Selection.
    1. In the box labeled Base Environment on the left side, select Infrastructure server.
    2. Click Done to return home.
  6. Wait until the Begin Installation button is available, which may take several minutes. Then click it to continue.
  7. While waiting for the installation to finish, set the root password.
  8. Click Reboot when the installation is complete.
  9. Boot from the primary device.
  10. Log in as root.

Configure YUM*

If the public network implements a proxy server for Internet access, Yellowdog Updater Modified* (YUM*) must be configured in order to use it.

  1. Open the /etc/yum.conf file for editing.
  2. Under the main section, append the following line:
    Proxy=http://<address>:<port>
    Where <address> is the address of the proxy server and <port> is the HTTP port.
  3. Save the file and Exit.

Disable updates and extras. Certain procedures in this document require packages to be built against the kernel. A future kernel update may break the compatibility of these built packages with the new kernel, so we disable repository updates and extras to provide further longevity to this document.

This document may not be used as is when CentOS updates to the next version. To use this document after such an update, it is necessary to redefine repository paths to point to CentOS 7.3 in the CentOS vault. To disable repository updates and extras:

Yum-config-manager --disable updates --disable extras

Install EPEL

Extra Packages for Enterprise Linux (EPEL) provides 100 percent, high-quality add-on software packages for Linux distribution. To install EPEL (must have the latest version for all packages):

Yum –y install (download from https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm)

Install GNU* C Compiler

Check whether the GNU Compiler Collection* is installed. It should be part of the development tools install. You can verify the installation by typing:

gcc --version or whereis gcc

Install TensorFlow* Using virtualenv18

  1. Update to the latest distribution of EPEL:
    yum –y install epel-release
  2. To install TensorFlow, you must have the following dependencies installed10:
    1. NumPy*: a numerical processing package that TensorFlow requires.
    2. Devel*: this enables adding extensions to Python*.
    3. Pip*: this enables installing and managing certain Python packages.
    4. Wheel*: this enables managing Python compressed packages in wheel formal (.whl).
    5. Atlas*: Automatically Tuned Linear Algebra Software.
    6. Libffi*: Library provides Foreign Function Interface (FFI) that allows code written in one language to call code written in another language. It provides a portable, high-level programming interface to various calling conventions11.
  3. Install dependencies:
    sudo yum -y install gcc gcc-c++ python-pip python-devel atlas atlas-devel gcc-gfortran openssl-devel libffi-devel python-numpy
  4. Install virtualenv.
    There are various ways to install TensorFlow18. In this document we will use virtualenv. Virtualenv is a tool to create isolated Python environments16:
    pip install --upgrade virtualenv
  5. Create a virtualenv in your target directory:
    virtualenv --system-site-packages <targetDirectory>  
    Example: virtualenv --system-site-packages tensorflow
  6. Activate your virtualenv18:
    source ~/<targetDirectory>/bin/activate 
    Example: source ~/tensorflow/bin/activate
  7. Upgrade your packages, if needed:
    pip install --upgrade numpy scipy wheel cryptography
  8. Install the latest version of Python compressed TensorFlow packages:
    pip install --upgrade
    https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.8.0-cp27-none-linux_x86_64.whl (none of the other versions worked for me so I tried this). OR just do this:
    pip install --upgrade tensorflow

Screenshot of code or command prompt

Train a Convolutional Neural Network (CNN) using a CIFAR-10 dataset3

  1. Download the CIFAR10 training data in /tmp/ directory:
    Download the cifar-10 python version from4,8: https://www.cs.toronto.edu/~kriz/cifar.html
  2. Unzip the tar file in the /tmp/ area as the python script (cifar10_train.py) looks for data in this directory:
    tar –zxf <dir>/cifar-10-python.tar.gz
  3. Change directory to tensorflow:
    cd tensorflow
  4. Make a new directory:
    mkdir git_tensorflow
  5. Change directory to the one created in last step:
    cd git_tensorflow
  6. Get a clone of the tensorflow repository from GitHub9:
    Git clone https://github.com/tensorflow/tensorflow.git
  7. If you notice the Models folder is missing from the tensorflow/tensorflow directory, you can get a Git of models from9
    https://github.com/tensorflow/models.git:
    cd tensorflow/tensorflow
    git clone https://github.com/tensorflow/models.git
  8. Upgrade TensorFlow to the latest version or you might see errors when training your model:
    pip install --upgrade tensorflow
  9. Change directory to CIFAR-10 dir to get the training and evaluation Python scripts14:
    cd models/tutorials/image/cifar10
  10. Before running the training code, you are advised to check the cifar10_train.py code and change steps from 100K to 60K if needed, as well as logging frequency from 10 to whatever you prefer.
    For this document, tests were done for both 100K steps and 60K steps, for a batch size of 128, and logging frequency of 10.

  11. Now, run the training Python script to train your network:
    python cifar10_train.py

This will take few minutes and you will see something like the image below:

Screenshot of code or command prompt

Testing Script and Dataset Terminology:

In the neural network terminology:

  • One epoch = one forward pass and one backward pass ofallthe training examples.
  • Batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need. TensorFlow (TF) pushes all of those through one forward pass (in parallel) and follows with a back-propagation on the same set. This is one iteration, or step.
  • Number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass equals one forward pass plus one backward pass (we do not count the forward pass and backward pass as two different passes).
  • Steps parameter tells TF to run X of these iterations to train the model.

Example: if you have 1000 training examples, and your batch size is 500, then it will take two iterations to complete one epoch.

To learn more about the difference between epoch versus batch size versus iterations, read the article15.

In the cifar10_train.py script:

  • Batch size is set to 128. It is the number of images to process in a batch.
  • Max_step is set to 100000. It is the number of iterations for all epochs. On GitHub code there is a typo; instead of 100K, the number shows 1000K. Please update before running.
  • The CIFAR-10 binary dataset in4 has 60000 images; 50000 images to train and 10000 images to test. Each batch size is 128, so the number of batches needed to train is 50000/128 ~ 391 batches for one epoch.
  • The cifar10_train.py used 256 epochs, so the number of iterations for all the epochs is ~391*256 ~ 100K iterations or steps.

Evaluate the Model

To evaluate how well the trained model performs on a hold-out data set, we will be using the cifar10_eval.py script8:

python cifar10_eval.py

Once expected accuracy is reached you will see a precision@1 = 0.862 printed on your screen when you run the above command. This can be run while your training script in the steps above is still running and is reaching the end number of steps. Or it can be run after the training script has finished.

Screenshot of code or command prompt

Sample Results

Notice that the cifar10_train.py script shows the following results:

Screenshot of code or command prompt

I added a similar-looking result below that was achieved with the system described in Section II of this document. Please be advised that these numbers are only for educational purposes and no specific CPU optimizations were done.

System

Step Time (sec/batch)

Accuracy

2S Intel® Xeon® Gold processors

~ 0.105

85.8% at 60K steps (~2 hrs)

2S Intel Xeon Gold processors

~0.109

86.2% at 100K steps (~3 hrs)

Once you have finished training and testing for your CIFAR-10 dataset, the same Models directory has images for MNIST* and AlexNet* benchmarks. It could be educational to go into MNIST and AlexNet directories and try running the Python scripts there to see the results.

References:

  1. Install TensorFlow on CentOS7, https://gist.github.com/thoolihan/28679cd8156744a62f88
  2. Installing TensorFlow on Ubuntu*, https://www.tensorflow.org/install/install_linux
  3. Install TensorFlow on CentOS7, http://www.cnblogs.com/ahauzyy/p/4957520.html
  4. The CIFAR-10 dataset, https://www.cs.toronto.edu/~kriz/cifar.html
  5. Tensorflow, MNIST and your own handwritten digits, http://opensourc.es/blog/tensorflow-mnist
  6. TensorFlow Tutorial, https://github.com/Hvass-Labs/TensorFlow-Tutorials
  7. Tutorial on CNN on TensorFlow, https://www.tensorflow.org/tutorials/deep_cnn
  8. CIFAR-10 Details, https://www.tensorflow.org/tutorials/deep_cnn
  9. TensorFlow Models, https://github.com/tensorflow/models.git
  10. Installing TensorFlow from Sources, https://www.tensorflow.org/install/install_sources
  11. Libffi, https://sourceware.org/libffi/
  12. Performance Guide for TensorFlow, https://www.tensorflow.org/performance/performance_guide#optimizing_for_cpu
  13. What is batch size in neural network? https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network
  14. Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009
  15. Epoch vs Batch Size vs Iterations, https://towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9
  16. Virtualenv, https://virtualenv.pypa.io/en/stable/
  17. CPU Optimizations: https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture
  18. Download and Setup, https://www.tensorflow.org/versions/r0.12/get_started/os_setup

Roofline with Callstacks

$
0
0
Note: Some screenshots in this article show orange dots. This is not a default setting, and these dots would be red or yellow by default. The orange category between red and yellow was added through the customization menu.

The Roofline with Callstacks is an extension of the existing Cache-Aware Roofline feature in Intel® Advisor. This feature was officially introduced in 2018 Update 1, though it was available as a preview under the name Hierarchical Roofline in the initial 2018 release. The name has since been changed to avoid confusion with a different feature with a similar name.

The mechanical difference between the original Cache-Aware Roofline and the Roofline with Callstacks is the treatment of self data vs total data. Self data is data (memory accesses, FLOPs, and duration) related only to the loop or function itself, excluding data originating in other functions or loops called by it. Total data includes data from called functions and loops in addition to that originating in the outer function or loop.

for (int x = 0; x < X_MAX; x++)
{
 /* These three lines count toward both the self and total
  * data for the outer loop. */
 bar[x * 2] = x * 42.0 + 7.0;
 bar[(x * 2) - 1] = (x - 1.0) * 42.0 + 7.0;
 bar[x] = 23.9 * 83.9 / 31.2;
 /* Operations performed in foobar() and the inner loop
  * only count toward the total data for the outer loop. */
 foobar();
 for (int y = 0; y < Y_MAX; y++)
 {
  foo[y*2] = y * 42.0 + 7.0;
 }
}

In Advisor’s original Cache-Aware Roofline, only self data was shown, and relationships between dots were not indicated.The Roofline with Callstacks makes use of total data in addition to the self data, and provides both a navigable callstack and visual indicators to allow easy identification of related dots.

Use Cases

A simple roofline chart, displayed without callstacks on.Animation of a simple roofline with callstacks turned on, with a dot being collapsed.The ability to use total data gives the Roofline with Callstacks a means of adjusting the granularity of the data. In Advisor’s original implementation of the Roofline, it would have been impossible to get a sense of where the outer loop of the above example, as a whole, would stand. Only the inner loop, the self-data of the outer loop, and foobar() – which does not call anything else – would have been shown.

The original Cache-aware Roofline provided insights into which loops were worth the effort to optimize, in and of themselves, but a cluster of small dots all called from the same location may have evaded notice.

A large roofline node with call arrows leading to several small nodes.The Roofline with Callstacks allows simplification of a dot-heavy Roofline. Several small loops may share an origination point, and it can be beneficial to collapse them into one overall representation of the dot cluster rather than trying to read a chart with dozens of small dots.

By revealing their shared origination point, it allows a developer to investigate the source of the loops rather than just the loops themselves, potentially uncovering a design inefficiency higher up the call chain which could be the root cause of the smaller loops’ poor performance.

The Roofline with Callstacks is also extremely useful for getting a more accurate view of functions or loops that behave differently when called under different circumstances. Where the original Cache-aware Roofline rendered one dot per function, the inclusion of calling information differentiates instances of a function or loop that have different callstacks, thus allowing their behavior and traits to be analyzed separately.

Running the Analysis

Note: For users with the 2018 initial release, the preview feature will need to be enabled by setting the environment variable ADVIXE_EXPERIMENTAL=roofline_ex before starting Advisor to collect and/or view Roofline with Callstack data. For users with 2018 update 1 or later, this setup step is not necessary.

In the GUI

As with the standard roofline, the analysis can be run using either the Run Roofline shortcut button or by running a Survey analysis followed by a Trip Counts analysis with FLOPs (the Trip Counts themselves are optional). In either case, the “Enable Roofline with Callstacks” checkbox under the Run Roofline shortcut button must be checked.

The "Enable Roofline with Callstacks" checkbox is functionally identical to the “Collect stacks” checkbox in the Trip Counts and FLOPs Analysis tab of the Project Properties, and toggling one checkbox will toggle the other.

When viewing the results, be sure to check the “Show Roofline with Callstacks” checkbox in the upper right corner, next to the roofline options. If this checkbox is not visible, widen your roofline display until it appears.

On the Command Line

To collect Roofline with Callstacks information on the command line in 2018 update 1, simply use:

advixe-cl -collect roofline -stacks -project-dir MyResults -- MyExecutable

If you prefer to collect the survey and trip counts analyses separately, or wish to add FLOPS data to a standard survey report you’ve already been working with, use the flop and stacks flags in the command line:

advixe-cl -collect survey -project-dir MyResults -- MyExecutable
advixe-cl -collect tripcounts -flop -stacks -project-dir MyResults -- MyExecutable 

Collection of actual trip counts data is completely optional, as it's the FLOPs that are important.

Note: Neither the roofline collection type nor the flags given above exist in the 2018 initial release. For this version, check that your environment variable is set, then run a standard survey collection followed by a tripcounts collection with the following flags:

advixe-cl -collect survey -project-dir MyResults -- MyExecutable
advixe-cl -collect tripcounts -flops-and-masks -callstack-flops -project-dir MyResults -- MyExecutable

These flags are deprecated in update 1, in favor of the -flop and -stacks flags.

Reading the Chart

The Roofline with Callstacks adds several more symbols and visual indicators to the Roofline chart. For information regarding the basics of reading a Roofline in general, see the Intel® Advisor Roofline article.

B, the selected loop in this Roofline with Callstacks, is called by A, indicated by the blue line, and calls C and D, indicated by the black arrows.One of the most obvious new features are the caller/callee arrows. Upon selection of any given dot on the chart, its direct caller and callees will be indicated by these lines. In the image to the left, the orange dot B is selected, and has a blue line with a backwards arrowhead indicating that the gray dot A is the caller, while black lines with normal arrowheads at the end indicate that the selected loop directly calls the yellow loop C and the gray loop D.

In this image of the Roofline with Callstacks, the yellow dot is selected. The blue line indicates that the orange dot on the right calls it. You can see both of these dots in the callstack pane on the right.The other most obvious feature is the Callstack display on the right. This list displays the entire call chain for the selected loop (but excludes its callees). As you can see in the image to the right, each level of the stack has a dot next to it, which displays the current color of that dot on the chart.

Clicking an entry in this list will cause the corresponding dot to flash on the Roofline chart, for easy identification.

A more detailed call tree can be found in the lower pane, under the Top Down tab. Rather than a single call chain, this displays the whole tree. Branches can be expanded and collapsed, allowing you to find less directly connected nodes.

Similarly, the dots in the Roofline chart can also be expanded and collapsed with their own plus and minus buttons. As with the call tree, everything under a given node, even if it’s not a direct callee, will be hidden when that node is collapsed.

Collapsing or expanding dots switches whether the dot’s display is based on the self data or the total data. Loops and functions that have no self data will simply be grayed out when expanded and in color when collapsed. Nodes that do have self data display at the coordinates, size, and color appropriate to that data when expanded, but have a gray halo of the size associated with their total time. When these loops are collapsed, they will change to the size and color appropriate to their total time, and if applicable, will move to reflect the total performance and total AI.

In this animation, two dots in a roofline with callstacks are collapsed and uncollapsed. One has no self-data, while the other does. Together they display the possible behaviors of collapsed and uncollapsed loops.

The grey information box on the left in the above animation is not present in Advisor, and was edited into the image for additional clarity. However, the information it displays can be found within Advisor in the Code Analytics tab in the lower pane, located next to the Top Down tab. The Code Analytics tab contains a variety of helpful metrics in several collapsible sections. Information on both self and total AI, memory accesses, elapsed time, and FLOPs can be found in the FLOPS collapsible.

As a final note, the cross icon in the Roofline chart represents the application as a whole, being placed at the coordinates of the Total AI and Total GFLOPS of the entire program. As the root node of the chart has the entire application in its call tree, its total data is that of the entire program. Thus, the root node will always collapse to the location of the cross.

CPU Performance Optimization & Differentiation for VR Applications Using Unreal Engine* 4

$
0
0

By: Wenliang Wang

Virtual reality (VR) can bring an unprecedented user immersion experience, but at the same time, due to the characteristics such as binocular rendering, low latency, high resolution, and forced vertical synchronization (vsync), VR generates great pressure on CPU render threads, logic and threads, and computing of the graphics processing unit (GPU)1. How to effectively analyze the bottlenecks of VR application performance and optimize the CPU thread to improve the degree of parallelization on working threads, thereby reducing the GPU waiting time to improve the utilization rate, are keys to determining whether a VR application is running smooth, is free of dizziness, and is immersive.

The Unreal Engine* 4 (UE*4) is one of two major game engines currently used by VR developers. Understanding the CPU thread structure and associated optimization tools of UE4 can help in developing better UE4-based VR applications. This paper covers the CPU performance analysis and debugging instructions, thread structure, optimization methods, and tools on UE4. It also covers how to make full use of idle computing resources of the CPU core to enhance performance of VR content, and provide corresponding performance of audio and visual content based on the different configurations of the various game players. The goal is to make a game that has the best immersive VR experience.

Why Optimize the PC VR Game

Asynchronous timewarp (ATW), asynchronous spacewarp (ASW) and asynchronous reprojection are technologies provided by VR runtime that can generate a composite frame when the frame drop appears in the VR application, by inserting frames; equivalent to reducing the delay. However, these are not perfect solutions, and each have different limitations: ATW and asynchronous reprojection can compensate for the motion-to-photon (MTP) delay generated by the rotational movement, but if the head position is moved or there are moving objects on the screen, even with ATW and asynchronous reprojection the MTP delay cannot be reduced. In addition, the ATW and asynchronous reprojection need to be inserted between the draw call of a GPU. When a draw call is too long (for example, post-processing) or the time left to give the ATW and asynchronous reprojection is insufficient, the frame insertion will fail. ASW will lock the frame rate at 45 frames per second (fps) when rendering cannot keep up, and add 22.2 milliseconds (ms) for a frame to render, to insert a composite frame between two rendering frames using traditional image motion estimation (motion estimation), as shown in Figure 1.

Screenshot of game frame
Screenshot of game frame
Screenshot of game frame

Figure 1:  ASW interpolation effect.

In a synthetic frame, the acute movement or transparent part of the frame produces deformation (for example, the part within the red circles in Figure 1); violent illumination change is also prone to estimating errors. When continuous frames are inserted using ASW, picture shaking can be easily felt by users. These VR runtime technologies are not good solutions to the problem of frequent frame drops. Developers should ensure that VR applications in most cases can be stable running at 90 fps, and only rely on the above methods to solve accidental frame drops.

Introduction to Unreal Engine* 4 Performance Debugging Instructions

Applications developed with UE4 can query various real-time performance data via the stat command in the console command 2-3. The stat unit instruction allows you to see the total frame rendering time (Frame), rendering thread consumption time (Draw), logical thread consumption time (Game), and GPU consumption time (GPU), from which you can see which part is restricting the frame rendering time, as shown in Figure 2. Combined with show or showflag instructions, dynamic switch can be used to control various features to observe the impact of each feature on rendering time, and find out factors that impact the performance, during which the pause command can be executed to suspend the logical thread to observe the result.

It should be noted that the GPU consumption time includes both GPU work time and GPU idle time, so even if it shows that the GPU spent the longest time in the stat unit, it does not necessarily mean that the problem is on the GPU. It is possible that a CPU bottleneck could cause the GPU to be in an idle state most of the time, and extends the time it takes for the GPU to complete a frame rendering. So, there is a need to combine other tools, such as GPUView* 4, to analyze the CPU and GPU time chart from which to locate the bottleneck position.

screenshot of frame statistics
Figure 2:  Stat unit statistics.

In addition, because VR is opened with forced vertical synchronization, as long as the frame render time is more than 11.1 ms, more than 0.1 ms leads to a frame taking two full vertical synchronization cycles to complete. As a consequence, it is easy to slow down the performance of a VR application because of a slight scene change. For a better result use the - emulatestereo command with the resolution set to 2160 x 1200. The screen percentage ratio (screenpercentage) is set to 140, which can be used to analyze the performance without display of the VR head and closure of the vertical synchronization.

The performance data associated with the rendering thread can be seen through the stat scenerendering, including the number of draw calls, visibility culling length, light processing time, and so on. For visibility of culling, the stat initviews instruction can be used to further understanding and analysis of the processing time of each part, including frustum culling, precomputed visibility culling, and dynamic occlusion culling.

To judge the efficiency of each culling, enter the stat sceneupdate command to see the time it takes to update the world scene including add, update, and remove lights. In addition, you can write the frame rendering information into the log by specifying when the render time of a frame is over the t.H itchThreshold by using the stat dumphitches instruction.

To make the game effects match to different PC levels, stat physics, stat anim, and stat particles are frequently used instructions related to CPU performance, corresponding to the physical computing time (cloth simulation, damage effect, and so on), skin meshing computing time, and CPU particle computing time. Because these instructions can be assigned to different work threads for parallel processing in UE4, they can be extended accordingly so that the VR application is effectively adapted to different levels of hardware. As a result, VR immersive experience and overall performance can be enhanced by the increase in the number of CPU cores.

In addition, you can directly enter the console commands stat startfile and stat stopfile to collect the real-time running data for a designated time period, and then use the Stats Viewer in the UE4 session frontend to view the utilization ratio of CPU thread and the call stack, finding the CPU hot spot, and carry on the corresponding optimization, as shown in Figure 3. The functions are similar to the Windows* Performance Analyzer (WPA) in the Windows* Assessment and Deployment Kit (ADK).

Screenshot of UE4 built in Stats Viewer
Figure 3:  The Stats Viewer built in the UE4 session frontend.

CPU Optimization Techniques for UE*4 VR Applications

In the process of VR development, when encountering CPU performance problems, not only do we need to find out where the bottleneck is, but we also have to master the tools provided in UE4 that can help to optimize the bottleneck. By understanding the usage, effects, and differences of each tool we can quickly identify and select the most appropriate strategy to optimize the performance of VR applications. In this section we will focus on the UE4 optimization tools.

Rendering Thread Optimization

Due to performance, bandwidth, and multisample anti-aliasing (MSAA) considerations, current VR applications usually use forward rendering instead of deferred rendering. However, in the forward rendering pipeline of UE4, in order to reduce GPU overdraw, the prepass phase before base pass will force the use of early-z to generate the depth buffer, resulting in reduced GPU workload being submitted before the base pass. In addition, DirectX* 11 is basically in a single-threaded rendering category, and multi-threaded ability is poor. In the situation that there are significant numbers of draw calls or primitives in the VR scene, culling calculation time becomes longer. Basically, the calculation phase before base pass is likely to produce GPU bubbles due to rendering thread bottlenecks, reducing the utilization rate of the GPU, and triggering the frame drop. The optimization of rendering threads is of vital importance to VR development.

Figure 4 shows an example of a VR game that is limited to CPU rendering threads. The VR game runs on HTC Vive*, with an average frame rate of 60 fps. Although the GPU appears to be the main performance bottleneck from the console command stat unit, each frame of the rendering thread Draw takes a very long time. From the frame time in SteamVR*, it can be clearly seen that the CPU even has a late start, which means the workload of the rendering thread is very heavy (in SteamVR, the calculation of the rendering thread of a frame starts 3 ms before the vertical synchronization at the beginning of the frame, which is known as the running start. The intention was to use 3 ms extra delay in exchange for the rendering thread to work in advance, so that the GPU would be able to work immediately after vertical synchronization of the frame to maximize the efficiency of the GPU. If a frame of render thread works before the next vertical sync, the 3 ms is not yet finished. It blocks the running start of the next frame, which is called late start. Late start makes the rendering thread work delayed, resulting in the production of the GPU bubble.

In addition, in the frame time of SteamVR, it can be seen that the time used by the GPU is every other frame. Other is higher, and from the following analysis, it can be seen that this is actually the GPU bubble time before the prepass.

If we use GPUView to analyze the scene in Figure 4, we get the result of Figure 5, where the red arrow refers to the time that the CPU rendering thread starts. Because of the running start, the first red arrow starts counting 3 ms before the vertical sync; but when the time reaches the vertical sync, the GPU still has no work to do until it reaches 3.5 ms, where the GPU works shortly, following with 1.2 ms idle. Only after that can the CPU submit the prepass work to the CPU context queue, and 2 ms after the completion of the prepass, the base pass work can be submitted to the CPU context queue to let the GPU execute.

In Figure 5, the locations indicated by the red circles are GPU idle time; the total time (also known as GPU bubbles) adds up to nearly 7 ms, which directly leads to the frame drop, as the GPU rendering is not able to finish within 11.1 ms. As a result, it needs two vertical synchronization cycles to complete the work of this frame. We can combine the WPA analysis on the call stack of the rendering thread during the GPU bubbles and find out which functions cause the bottleneck1. The second red arrow refers to the location of the start time of the rendering thread for the next frame. Because a frame drop occurred in this frame, the rendering thread of the following frame adds a full vertical synchronization cycle for the calculation.

When the GPU of the next frame starts working after a vertical sync, the rendering thread has filled the CPU context queue, so the GPU has enough work to do without generating GPU bubbles. As long as there are no GPU bubbles a frame rendering in 9 ms will be able to complete, so the next frame will not drop. Three vertical sync cycles are needed to complete the rendering of two frames, which is why the average frame rate is 60 fps.

The analysis from Figure 5 shows that in this example, the GPU is not actually a performance bottleneck; so long as the real bottleneck of the CPU rendering thread is solved, the VR game can reach 90 fps. In fact, we found that rendering thread bottlenecks exist in most development of VR applications with UE4, so familiarization with the following tools for UE4 rendering thread optimization can greatly improve the performance of the VR application.

Screenshot of game
Figure 4:  An example of a VR game with a CPU rendering thread bottleneck, which shows the SteamVR* statistics on time consumption of CPU and GPU per frame.

Screenshot of game
Figure 5:  The time view of GPUView* for the Figure 4 example; you can see the CPU rendering thread bottlenecks leading to the GPU idle, which triggered the frame drop.

Instanced Stereo Rendering

VR doubles the number of draw calls due to binocular rendering, which can easily lead to rendering thread bottlenecks. Instanced stereo rendering only needs to submit a draw call once as the object, and then, respectively, apply the corresponding transformation matrix on the left and right eye angle by the GPU so that the object can be drawn to the left and right eye angle. This part equals the transfer of CPU work to the GPU processing. It increases the GPU vertex shader work but can save a half draw call; therefore, it typically reduces the rendering thread load, resulting in a performance increase of about 20 percent for VR applications, unless the number of draw calls in the VR scene is low (<500). You can choose to turn on or turn off the instanced stereo rendering in the project settings.

Visibility Culling

Rendering thread bottlenecks in VR applications is usually caused by two major reasons; one is static grid computing and the other is a visual culling. Static grid computing can be optimized by merging draw call or mesh, while visual culling needs to reduce the number of primitives or dynamic occlusion culling.

Visibility culling bottlenecks are particularly severe in VR applications because VR is forced to reduce the delay time by limiting the calculation of the CPU rendering thread for each frame in advance to 3 ms before vertical sync (running start/queue ahead), while in the UE4 InitViews (including visibility culling and setting the dynamic shadow) stage the GPU work is not generated. Once InitViews takes more than 3 ms, it produces GPU bubbles and reduces GPU utilization, likely causing dropped frames, so visibility culling in the VR needs to be the major focus of the optimization.

Visibility culling in UE4 consists of four parts; the sequence, according to the calculation complexity from low to high, is:

  1. Distance culling
  2. View frustum culling
  3. Precomputed occlusion culling
  4. Dynamic occlusion culling, including hardware occlusion queries and hierarchical z‑buffer occlusion

During the design, the best way is to remove the majority of primitives using numbers 1 through 3 culling as much as possible, in order to reduce the InitViews bottleneck, because the computing work of number 4 (dynamic occlusion culling) is much greater than the other three. The following focuses on the interpretation of view frustum culling and precomputed occlusion culling.

View Frustum Culling

In UE4 VR, the view frustum culling of a VR application is only done once, separately, to the right and left eye camera, which means that all primitives must be used twice in the scene to complete the entire view frustum culling. But we can change the UE4 code to implement super-frustum culling5, namely the merger of left eye and right eye view frustum to complete view frustum culling and, in one scene, can save the rendering thread roughly half of view frustum culling time.

Precomputed Occlusion Culling

After distance culling and view frustum culling, we can use precomputed occlusion culling to further reduce the number of primitives that need to be sent to the GPU to do dynamic occlusion culling, to reduce the time that the rendering thread is spent processing the visibility culling and, at the same time, to reduce the frame popping phenomenon of the dynamic occlusion system (because the query result of GPU occlusion culling needs one frame before it can returns. It is likely to produce a visibility error when the angle of view is rotating fast or when the object is in the corner attachment).

Precomputed occlusion culling is equivalent to increasing the memory usage and the time to construct the light in exchange for the lower occupation of rendering thread; the larger the memory occupied and the time of pre-stored decoded data will be relatively increased. However, VR scenes are generally smaller relative to traditional games, and most of the objects in the scene are static objects. There is a limit to the user's moveable area, which is a favorable factor for precomputed occlusion culling, and this is also an optimization that must be done for VR application development.

In practice, precomputed occlusion culling would automatically cut the entire scene into the visibility cells of the same size, based on the parameter setting, which covers all the possible locations of the view camera. In the position of each cell, the precomputed occlusion culling stores the primitives in the cell, which will be 100 percent removed. In the actual operation, the look up table (LUT) reads the primitives that are to be removed in the current location. The precomputed occlusions that are stored do not need to do dynamic occlusion culling again in the runtime.

We can use the console command Stat InitViews to see Statically Occluded Primitives to know how many primitives are processed out by precomputed occlusion culling, use Decompress Occlusion to view the decoding time of each frame of stored data, and use Precomputed Visibility Memory in Stat Memory to check memory usage of pre-stored data. Where Occluded Primitives includes the number of primitives that are precomputed and dynamic occlusion culling, increasing the proportion of Statically Occluded Primitives/Occluded Primitives (more than 50 percent) helps to significantly reduce InitViews time. The detailed setup steps and limitations of precomputed occlusion culling in UE4 can be found in6-7.

Screenshot of precomputed occlusion culling example
Figure 6:  Precomputed occlusion culling example.

Static Mesh Actor Merging

The Merge Actors tool in the UE4 can automatically merge multiple static grids into a grid body to reduce the drawing call, and it can be selected in the settings whether or not to merge material, light map, or physical data, according to the actual needs. The setting process can be referred to8. In addition, there is another tool in the UE4—the Hierarchical Level of Detail (HLOD)9; the difference is that HLOD will only merge objects with distant levels of details (LODs).

Instancing

Achieving the same grid body or object in the scene (such as haystacks or boxes), can be implemented using instanced meshes. It only needs to submit one draw call; the GPU in the drawing will do the corresponding coordinate transformation based on the location of the object. If there are many of the same grids in scenes, instancing can effectively reduce the rendering call of the rendering thread. Instancing can be set in the blueprint (BlueprintAPI -> Components -> InstancedStaticMesh (ISM))10. If you want to have different LODs for each instantiated object, you can use hierarchical ISM (HISM).

Monoscopic Far-Field Rendering

Limited by interpupillary distance, the human eye has a different sense of objects at different distances. According to the average of 65mm per capita interpupillary distance of the human eye, the strongest distance depth sensation is between 0.75m to 3.5m; depth sensation beyond eight meters is not easy to perceive, and the degree of sensitivity drops when the distance is farther.

Based on this feature, Oculus* and Epic Games* introduced monocular far-field rendering in the forward rendering pipeline of UE 4.15, allowing VR applications to be set to monocular or binocular, depending on the distance of each object to the view camera11. If there are many long-range objects in the scene, these long-range objects can be used to reduce the rendering of the scene and the cost of pixel shading.

For example, each frame of the Oculus Sun Temple scene can reduce the rendering costs by 25 percent with monocular far-field rendering. It is worth noting that the current single-phase far-field rendering in UE4 can only be used on GearVR*; the support of the PC VR will be included with the new version of UE4. The detailed setting method for monocular far-field rendering in UE4 can be found in reference12. You can also view the contents of a stereoscopic buffer or monoscopic buffer in the control panel by entering the command vr.MonoscopicFarFieldMode0-4.

Logical Thread Optimization

In the VR rendering pipeline of UE4, the logical thread is calculated one frame earlier than the rendering thread, and the rendering thread will generate a proxy based on the result of the previous frame logical thread and render it accordingly, to ensure that the rendering process does not change; at the same time, the logic thread will be updated, and the update results will be reflected through the next line of the screen to the screen. Since the logical thread is calculated one frame in advance in UE4, the logical thread does not become a performance bottleneck unless the logical thread takes more than one vertical sync period (11.1 ms). But the problem is that in UE4, the logical threads and rendering threads can only run on a single thread, the blueprint in the gameplay; actor ticking, artificial intelligence, and other calculations are handled by the logical thread. If there are more actors or interactions in the scene that cause the logical thread to take more than one vertical synchronization cycle, then it needs to be optimized. Here are two performance optimization techniques for logical threads.

Blueprint Nativization

In the UE4 default blueprint conversion process, you need to use a VM to convert the blueprint into C++ code, during which the cost of the VM will result in performance loss. UE 4.12 introduced a blueprint nativization; all or part of the blueprint (inclusive/exclusive) can be directly compiled into C++ code, to dynamically load as a run-time DLL to avoid VM overhead and improve the efficiency of logical threads. Detailed settings can be found in reference13.

It should be noted that if the blueprint itself has been optimized (for example, the calculation of the module directly with C++), blueprint nativization performance improvement is limited. Also, the function UFUNCTION in the blueprint cannot be inline; the function for repeated calls can be used in the blueprint math node (inline) or through a UFUNCTION call inline function. The best way is of course to assign the work directly to other threads14-15.

Skeleton Meshing

If too many actors cause logic thread bottlenecks in the scene, in addition to lower LOD (skeletal mesh LODs) and animation ticking, you can also use LOD or distance from the nearest hierarchical approach to deal with the interaction between the behavior of actors. Sharing of some skeletal resources among several LODs is another viable option16.

CPU Differentiation of UE4 VR Application

The above describes several VR application CPU optimization techniques, but optimization can only ensure that VR applications do not drop frames, or cause motion sickness—it cannot further enhance the experience. If you want to enhance the VR experience, you must make the greatest possible use of the computing power provided by the hardware, and translate these computing resources into content, effects, and picture performance to the end user, which requires the CPU to provide the corresponding differentiated content based on computing power. Following are five techniques of CPU differentiation.

Cloth Painting

UE4 Cloth Painting is performed mainly through the work thread assigned by the physical engine. The impact on the logical thread is small. And Cloth Painting is required for each frame to be calculated, even if the cloth is not within the screen display range, and needs to be calculated to determine whether the update will be displayed to the screen, so the calculation is relatively stable. The corresponding Cloth Painting program can be selected according to the adaptation to CPU capacity17.

Destructible Mesh

In UE4, Destructible Mesh is performed mainly through the work thread assigned by the physical engine; this part can be strengthened if a high performance CPU is available. The results include more objects that can be destroyed, the destruction of more fragments or fragments in the scene, and the existence of a longer time. Destructible Mesh presence will greatly enhance the performance of the scene and more immersive experience; the setting process can refer to18.

CPU Particles

CPU Particles is a module that is relatively easy to expand, although the number of particles from the CPU is less than that of the GPU. Maximizing the use of CPU multi-cores computing power can reduce the burden on the GPU, and CPU particles come with the following unique features. They can:

  • Glow
  • Be set to the particle material and parameters (metal, transparent material, and so on)
  • Be controlled by a specific gravitational trajectory (can be affected by the point, line, or other particles to attract)
  • Produce shadows

During the development process, you can set the corresponding CPU particle effect for different CPUs.

Two screenshot for side to side comparison
Figure 7:  Particles differentiation in the Hunting Project*.

Steam Audio*

For VR applications, in addition to the screen, another important element for creating immersive experience is the audio. Directional 3D audio is an effective way to enhance the immersive VR experience. Oculus has introduced the Oculus Audio* SDK19 to simulate 3D audio, but the SDK is relatively simple on the environmental sound simulation, and relatively not popular. Steam Audio*20 is a new 3D audio SDK offered by Valve*, which supports Unity* 5.2 or newer and UE 4.16 or higher version, and provides a C language interface. Steam Audio has the following features:

  • Provides 3D audio effects based on real physical simulation, supports directional audio filtering for head-related transfer function (HRTF), and ambient sound effects (including sound occlusion, real-world audio transmission, reflection, and mixing sound); also supports access to the inertia data of VR head.
  • It is possible to set the material and parameters (scattering coefficient, absorption rate for different frequencies, and so on) for each object in the scene. The simulation of environmental sound can be processed in real time or by baking, according to the computing power of the CPU.
  • Many of the settings or parameters in the ambient sound can be adjusted according to the quality or performance requirements such as HRTF interpolation methods, the number of audio ray traces and the number of reflections, and the form of mixing.
  • Compared to the Oculus Audio SDK, which only supports the shoebox model and sound masking is not supported, Steam Audio 3D audio simulation is more realistic and complete, providing finer quality control.
  • Free, and not bound to a VR header or platform.

Steam Audio collects the source and the listener's status and information from the logical process of UE4, and uses the work thread for light tracking and environmental reflection simulation of the sound. The calculated impulse response is then transferred to the audio rendering thread for the corresponding filtering and mixing work of the sound source, and then output by the operating system's audio thread to the headset (such as Windows* XAudio2).

The entire process is done by the CPU threads. While adding 3D audio does not increase the load of rendering thread and logical thread, the performance of the original game will not be affected; thus, it is very suitable for a VR experience optimization. The detailed setup process can be found in the Steam Audio documentation21.

Scalability

The scalability setting of UE4 is a set of tools that adjust the performance of the control screen by means of parameters to fit different computing platforms22. For the CPU, the scalability is mainly reflected in the following parameters on the set:

  • View distance): Distance culling. Distance culling scale ratio (r.ViewDistanceScale 0 – 1.0f)
  • Shadows: Shadow quality (sg.ShadowQuality 0 - 3)

Screenshot of shade differentiation in a Tencent* VR game
Figure 8:  Shade differentiation in the Tencent* VR game Hunting Project* .

  • Foliage: Number of foliage being rendered each time (FoliageQuality 0 - 3)

Screenshot of foliage differentiation in a Tencent* VR game
Figure 9:   Foliage differentiation in the Tencent* VR game Hunting Project*.

  • Skeletal mesh LOD bias (r.SkeletalMeshLODBias)
  • Particle LOD bias (r.ParticleLODBias)
  • Static mesh LOD distance scale (r.StaticMeshLODDistanceScale).

Summary

This article describes a variety of CPU performance analysis tools, optimization methods, and differentiation techniques, based on the limitations of the article. To learn more, refer to the reference section. Proficiency in a variety of CPU performance analysis tools and techniques can quickly find bottlenecks and optimize accordingly, and in fact this is very important for VR applications. In addition, while optimizing the use of idle, multi-threaded resources at the same time, you can make the application to achieve better picture effects and performance, providing a better VR experience.

Reference

  1. Performance Analysis and Optimization for PC-Based VR Applications: From the CPU’s Perspective:
    https://software.intel.com/en-us/articles/performance-analysis-and-optimization-for-pc-based-vr-applications-from-the-cpu-s
  2. 2 Unreal Engine Stat Commands: https://docs.unrealengine.com/latest/INT/Engine/Performance/StatCommands/index.html
  3. Unreal Engine 3 Console Commands: https://docs.unrealengine.com/udk/Three/ConsoleCommands.html
  4. GPUView: http://graphics.stanford.edu/~mdfisher/GPUView.html
  5. The Vanishing of Milliseconds: Optimizing the UE4 renderer for Ethan Carter VR: http://www.gamasutra.com/blogs/LeszekGodlewski/20160721/272886/The_Vanishing_of_Milliseconds_Optimizing_the_UE4_renderer_for_Ethan_Carter_VR.php
  6. Precomputed Visibility Volumes: http://timhobsonue4.snappages.com/culling-precomputed-visibility-volumes
  7. Precomputed Visibility: https://docs.unrealengine.com/udk/Three/PrecomputedVisibility.html
  8. Unreal Engine Actor Merging: https://docs.unrealengine.com/latest/INT/Engine/Actors/Merging/
  9. Unreal Engine Hierarchical Level of Detail: https://docs.unrealengine.com/latest/INT/Engine/HLOD/index.html
  10. Unreal Engine Instanced Static Mesh: https://docs.unrealengine.com/latest/INT/BlueprintAPI/Components/InstancedStaticMesh/index.html
  11. Hybrid Mono Rendering in UE4 and Unity: https://developer.oculus.com/blog/hybrid-mono-rendering-in-ue4-and-unity/
  12. Hybrid Monoscopic Rendering (Mobile): https://developer.oculus.com/documentation/unreal/latest/concepts/unreal-hybrid-monoscopic/
  13. Unreal Engine Nativizing Blueprints: https://docs.unrealengine.com/latest/INT/Engine/Blueprints/TechnicalGuide/NativizingBlueprints/
  14. Unreal Engine Multi-Threading: How to Create Threads in UE4: https://wiki.unrealengine.com/Multi-Threading:_How_to_Create_Threads_in_UE4
  15. Implementing Multithreading in UE4: http://orfeasel.com/implementing-multithreading-in-ue4/
  16. Unreal Engine Skeleton Assets: https://docs.unrealengine.com/latest/INT/Engine/Animation/Skeleton/
  17. Unreal Engine Clothing Tool: https://docs.unrealengine.com/latest/INT/Engine/Physics/Cloth/Overview/
  18. How to Create a Destructible Mesh in UE4: http://www.onlinedesignteacher.com/2015/03/how-to-create-destructible-mesh-in-ue4_5.html
  19. Oculus Audio SDK Guide: https://developer.oculus.com/documentation/audiosdk/latest/concepts/book-audiosdk/
  20. A Benchmark in Immersive Audio Solutions for Games and VR: https://valvesoftware.github.io/steam-audio/
  21. Download Steam Audio: https://valvesoftware.github.io/steam-audio/downloads.html
  22. Unreal Engine Scalability Reference: https://docs.unrealengine.com/latest/INT/Engine/Performance/Scalability/ScalabilityReference/

About the Author

Wenliang Wang is a senior software engineer for Intel's Software and Services Group. He works with VR content developers for Intel CPU performance optimization and differentiation, sharing the CPU optimization experience to make more efficient use of the CPU. Wenliang is also responsible for the implementation, analysis, and optimization of multimedia video codecs and real-time applications, and has over 10 years of experience in video codecs, image analysis algorithms, computer graphics, and performance optimization, and has published numerous papers in the industry. Wenliang graduated from the Department of Electrical Engineering and Communication Engineering Research Institute of Taiwan University.

Get Started Using the DPDK Traffic Management API

$
0
0

This article describes the new Data Plane Development Kit (DPDK) API for traffic management (TM) that was introduced in DPDK release 17.08. This API provides a generic interface for Quality of Service (QoS) TM configuration, which is a standard set of features provided by network interface cards (NIC), network processing units (NPU), application-specific integrated circuits (ASIC), field-programmable gate arrays (FPGA), multicore CPUs, and so on. TM includes features such as hierarchical scheduling, traffic shaping, congestion management, packet marking, and so on.

This API is generic, as it is agnostic of the underlying hardware, software, or mixed HW/SW implementation. It is exposed as an extension for the DPDK ethdev API, similar to the flow API. A typical DPDK function call sequence is included to demonstrate implementation.

Main Features

Hierarchical Scheduling

The TM API allows users to select strict priority (SP) and weighted fair queuing (WFQ) for the hierarchical nodes subject to specific implementation support being available. The SP and WFQ algorithms are available at the level of each node of the scheduling hierarchy, regardless of the node level/position in the tree. The SP algorithm is used to schedule between sibling nodes with different priority, while WFQ is used to schedule between groups of siblings that have the same priority.

Example: As shown in Figure 1, the root node (node 0) has three child nodes with different priorities. Hence, the root node will schedule the child nodes using the SP algorithm based on their priority, with zero (0) as the highest priority. Node 1 has three child nodes with IDs 11, 12, and 13, and all the child nodes have the same priority (that is, priority 0). Thus, they will be scheduled using the WFQ mechanism by node 1. The WFQ weight of a given child node is relative to the sum of the weights of all its sibling nodes that have the same priority, with one (1) as the lowest weight.


Figure 1: Hierarchical QoS scheduler.

Traffic Shaping

The TM API provides support for selecting single-rate and dual-rate shapers (rate limiters) for the hierarchy nodes, subject to the specific implementation support available. Each hierarchy node has zero or one private shaper (only one node using it) and/or zero, one, or several shared shapers (multiple nodes use the same shaper instance). A private shaper is used to perform traffic shaping for a single node, while a shared shaper is used to perform traffic shaping for a group of nodes. The configuration of private and shared shapers for a specific hierarchy node is done through the definition of shaper profile, as shown below:

/**
  * Shaper (rate limiter) profile
  */
struct rte_tm_shaper_params  {
	/** Committed token bucket */
	.committed = {
		/* bucket rate (bytes per second) */
		.rate = 0,
		.size = TOKEN_BUCKET_SIZE
	},
	/** Peak token bucket */
	.peak = {
		.rate = TM_NODE_SHAPER_RATE,
		.size = TOKEN_BUCKET_SIZE
	},
	/** framing overhead bytes */
	.pkt_length_adjust = RTE_TM_ETH_FRAMING_OVERHEAD_FCS,
};

Figure 2: Shaper profile parameters.

As shown in Figure 1, each non-leaf node has multiple inputs (its child nodes) and single output (which is input to its parent node). Thus, non-leaf nodes arbitrate their inputs using scheduling algorithms (for example, SP, WFQ, and so on) to schedule input packets to its output while observing the shaping (rate limiting) constraints.

Congestion Management

The congestion management algorithms that can be selected through the API are Tail Drop, Head Drop and Weighted Random Early Detection (WRED). They are made available for every leaf node in the hierarchy, subject to the specific implementation supporting them.

During congestion, the Tail Drop algorithm drops the new packet while leaving the queue unmodified, as opposed to the Head Drop algorithm, which drops the packet at the head of the queue (the oldest packet waiting in the queue).

The Random Early Detection (RED) algorithm works by proactively dropping more and more input packets as the queue occupancy builds up. Each hierarchy leaf node with WRED enabled as its congestion management mode has zero or one private WRED context (only one leaf node using it) and/or zero, one, or several shared WRED contexts (multiple leaf nodes use the same WRED context). A private WRED context is used to perform congestion management for a single leaf node, while a shared WRED context is used to perform congestion management for a group of leaf nodes. The configuration of WRED private and shared contexts for a specific leaf node is done through the definition of WRED profile as shown below:

/**
  * Weighted RED (WRED) profile
  */
struct struct rte_tm_wred_params {
	/* RED parameters */
	 .red_params = {
		[RTE_TM_GREEN] = {.min_th = 48, .max_th = 64, .maxp_inv = 10, .wq_log2 = 9},
		[RTE_TM_YELLOW] = {.min_th = 40, .max_th = 64, .maxp_inv = 10, .wq_log2 = 9},
		[RTE_TM_RED] = {.min_th = 32, .max_th = 64, .maxp_inv = 10, .wq_log2 = 9}
      	   },
};

Figure 3: WRED profile parameters.

Packet Marking

The TM APIs are provided to support various types of packet marking such as virtual local area network drop eligible indicator (VLAN DEI) packet marking (IEEE 802.1Q), IPv4/IPv6 explicit congestion notification marking of TCP and Stream Control Transmission Protocol packets (IETF RFC 3168), and IPv4/IPv6 differentiated services code point packet marking (IETF RFC 2597).

Capability API

The TM API allows users to query the capability information (that is, critical parameter values) that the traffic management implementation (HW/SW) is able to support for the application. The information can be obtained at port level, at a specific hierarchical level, and at a specific node of the hierarchical level.

Steps to Set Up the Hierarchy

Initial Hierarchy Specification

The scheduler hierarchy is specified by incrementally adding nodes to build up the scheduling tree. The scheduling tree consists of leaf nodes and non-leaf nodes.

Each leaf node sits on top of a scheduling queue of the current Ethernet port. Therefore, the leaf nodes have predefined IDs in the range of 0... (N–1), where N is the number of scheduling queues of the current Ethernet port. The non-leaf nodes have their IDs generated by the application outside of the above range, which is reserved for leaf nodes. The unique ID that is assigned to each node when the node is created is further used to update the node configuration or to connect child nodes to it.

The first node that is added to the hierarchy becomes the root node (node 0, Figure 1), and all the nodes that are subsequently added have to be added as descendants of the root node. The parent of the root node has to be specified as RTE_TM_NODE_ID_NULL, and there can only be one node with this parent ID (that is, the root node).

During this phase, some limited checks on the hierarchy specification are conducted, usually limited in scope to the current node, its parent node, and its sibling nodes. At this time, since the hierarchy is not fully defined, there is typically no real action performed by the underlying implementation.

Hierarchy Commit

The hierarchy commit API is called during the port initialization phase (before the Ethernet port is started) to freeze the start-up hierarchy. This function typically performs the following steps:

* It validates the start-up hierarchy that was previously defined for the current port through successive node add API invocations.

* Assuming successful validation, it performs all the necessary implementation-specific operations to install the specified hierarchy on the current port, with immediate effect once the port is started.

Run-Time Hierarchy Updates

The TM API provides support for on-the-fly changes to the scheduling hierarchy; thus, operations such as node add/delete, node suspend/resume, parent node update, and so on can be invoked after the Ethernet port has been started, subject to the specific implementation supporting them. The set of dynamic updates supported by the implementation is advertised through the port capability set.

Typical DPDK Function Call Sequence

Ethernet Device Configuration

/* Device Configuration */
	int status = rte_eth_dev_configure (PORT_ID, N_RXQs, N_TXQs, port_conf);
	CHECK ((status != 0), "port init error") ;

	/* Init TXQs*/
	for (j = 0 ; j < N_TXQs ; j++)  {
		status =  rte_eth_tx_queue_setup (PORT_ID, TX_QUEUE_ID, N_TX_DESC, SOCKET_ID,     			                TX_CONF);
		CHECK ((status != 0), "txq init error") ;
	}

Create and Add Shaper Profiles

/* Define shaper profile parameters for hierarchical nodes */
	struct rte_tm_shaper_params {
		.committed = {.rate = COMMITTED_TB_RATE, .size = COMMITTED_TB_SIZE,},
		.peak = {.rate = PEAK_TB_RATE, .size = PEAK_TB_SIZE,},
		.pkt_length_adjust = PACKET_LENGTH_ADJUSTMENT,
};
/* Create shaper profile */
status = rte_tm_shaper_profile_add (PORT_ID, SHAPER_PROFILE_ID, &params, &error);
		CHECK ((status != 0), "Shaper profile add error") ;
	/* Create shared shaper (if desired) */
	status = rte_tm_shared_shaper_add_update(PORT_ID, SHARED_SHAPER_ID,
SHAPER_PROFILE_ID, &error);
	CHECK ((status != 0), "Shaper profile add error") ;

Create the Initial Scheduler Hierarchy

/* Level 1: Port */
	struct rte_tm_node_params  port_params {
		.shaper_profile_id = SHAPER_PROFILE_ID_PORT,
		.shared_shaper_id = NULL,
		.n_shared_shapers = 0,
		.nonleaf = {
			. wfq_weight_mode = NULL,
			.n_sp_priorities = 0,
		},
		.stats_mask = TM_NODE_STATS_MASK;
	};
status = rte_tm_node_add (PORT_ID, NODE_ID_PORT, RTE_TM_NODE_ID_NULL, 0, 1, 	NODE_PORT_LEVEL, &port_params, &error);
	CHECK ((status != 0), "root node add error") ;

	/* Level 2 : Subport */
	for (i = 0 ; i < N_SUBPORT_NODES_PER_PORT ; i++) {
	struct rte_tm_node_params  subport_params {
			.shaper_profile_id = SHAPER_PROFILE_ID_SUBPORT(i),
			.shared_shaper_id = NULL,
			.n_shared_shapers = 0,
			.nonleaf = {
				. wfq_weight_mode = NULL,
				.n_sp_priorities = 0,
			},
			.stats_mask = TM_NODE_STATS_MASK;
		};
	status = rte_tm_node_add (PORT_ID, NODE_ID_SUBPORT(i), NODE_ID_PORT, 0, 1, 			NODE_SUBPORT_LEVEL, &subport_params, &error);
		CHECK ((status != 0), "subport node add error") ;
}

/* Level 3: Pipe */

for (i = 0, i < N_SUBPORT_NODES_PER_PORT; i++)
for (j = 0, j < N_PIPES_NODES_PER_SUBPORT; j++) {
		struct rte_tm_node_params  pipe_params {
			.shaper_profile_id = SHAPER_PROFILE_ID_PIPE(i, j),
			.shared_shaper_id = NULL,
			.n_shared_shapers = 0,
			.nonleaf = {
				. wfq_weight_mode = NULL,
				.n_sp_priorities = 0,
			},
		.stats_mask = TM_NODE_STATS_MASK;
		};
	status = rte_tm_node_add (PORT_ID, NODE_ID_PIPE(i, j), NODE_ID_SUBPORT(i), 0, 1, 			NODE_PIPE_LEVEL, &pipe_params, &error);
		CHECK ((status != 0), "pipe node add error") ;
}

/* Level 4: Traffic Class */

for (i = 0, i < N_SUBPORT_NODES_PER_PORT; i++)
for (j = 0, j < N_PIPES_NODES_PER_SUBPORT; j++)
for (k = 0, k < N_TRAFFIC_CLASSES; k++)  {
		struct rte_tm_node_params  tc_params {
			.shaper_profile_id = SHAPER_PROFILE_ID_TC(i, j, k),
			.shared_shaper_id = {SHARED_SHAPER_ID_SUBPORT_TC(i,k)},
			.n_shared_shapers = 1,
			.nonleaf = {
				. wfq_weight_mode = NULL,
				.n_sp_priorities = 0,
			},
			.stats_mask = TM_NODE_STATS_MASK;
		};
	status = rte_tm_node_add (PORT_ID, NODE_ID_PIPE_TC(i, j, k), NODE_ID_PIPE(i, j), 0, 1, 			NODE_TRAFFIC_CLASS_LEVEL, &tc_params, &error);
		CHECK ((status != 0), "traffic class node add error") ;
	}

/* Level 4: Queue */

for (i = 0, i < N_SUBPORT_NODES_PER_PORT; i++)
for (j = 0, j < N_PIPES_NODES_PER_SUBPORT; j++)
for (k = 0, k < N_TRAFFIC_CLASSES; k++)
for (q = 0, q < N_QUEUES_PER_TRAFFIC_CLASS; q++) {
		struct rte_tm_node_params  queue_params {
			.shaper_profile_id = SHAPER_PROFILE_ID_TC(i, j, k, q),
			.shared_shaper_id = NULL,
			.n_shared_shapers = 0,
			.leaf = {
				. cman = RTE_TM_CMAN_TAIL_DROP,
			},
			.stats_mask = TM_NODE_STATS_MASK;
		};
	status = rte_tm_node_add (PORT_ID, NODE_ID_QUEUE(i, j, k, q), NODE_ID_PIPE_TC(i, j, 			k), 0, weights[q], NODE_QUEUE_LEVEL, &queue_params, &error);
		CHECK ((status != 0), "queue node add error") ;
}

Freeze and Validate the Startup Hierarchy

status = rte_tm_hierarchy_commit(PORT_ID, clear_on_fail, &error);
	CHECK ((status != 0), "traffic management hierarchy commit error") ;

Start the Ethernet Device

status = rte_eth_dev_start(PORT_ID);
	CHECK ((status != 0), "device start error") ;

Summary

This article discusses the DPDK Traffic Management API that provides an abstraction layer for HW-based, SW-based, or mixed HW/SW-based traffic management implementations. The API is exposed as part of the DPDK ethdev framework. Furthermore, API usage is demonstrated by building and enabling the hierarchical scheduler on the Ethernet device.

Additional Information

Details on Traffic Management APIs can be found at the following links:

About the Authors

Jasvinder Singh is a Network Software Engineer with Intel. His work is primarily focused on development of data plane functions, libraries and drivers for DPDK.

Wenzhuo Lu is a Software Engineer for DPDK at Intel. He is a DPDK contributor and maintainer on dpdk.org.

Cristian Dumitrescu is a Software Architect for Data Plane at Intel. He is a DPDK contributor and maintainer on dpdk.org.

Increasing Efficiency and Uptime with Predictive Maintenance

$
0
0

In many manufacturing plants today, monitoring is a highly manual process. FOURDOTONE Teknoloji analyzes data from sensors to enable manufacturers to respond immediately to problems, and predict when machines are likely to fail.

Executive Summary

Downtime can be expensive, and in a tightly coupled manufacturing line a problem with one machine can have an impact on the entire factory. For many factories, avoiding downtime is a matter of luck rather than science: machine inspections are infrequent, and only capture what’s visible to the eye.

4.1 Industrial Internet of Things Platform (4.1 IIoTP) from FOURDOTONE Teknoloji enables manufacturers to be more responsive and proactive in their maintenance, so they can aim to minimize downtime. Data is gathered from the machines and analyzed in the factory, enabling an immediate response to emergencies or imminent problems. In the cloud, machine-learning algorithms are used to analyze the combined data from all of the machines, so that future maintenance requirements can be predicted. That enables the manufacturer to plan its maintenance to avoid downtime, and optimize its maintenance costs.

The technology provides a foundation for continuous improvement, and enables manufacturers to cut the risk of unplanned downtime.

FOURDOTONE Teknoloji was founded in 2014 in Turkey and specializes in Industry 4.0 projects. The company works on hardware independent projects including condition monitoring, big data process optimization, predictive maintenance and the digital factory. The company serves enterprises in Turkey, Central and Eastern Europe, and the Middle East.

Business Challenge: Avoiding Downtime in Industry

For manufacturing plants, downtime can have a huge impact on the business. A fault in a single machine could halt the production line. For plants that operate around the clock, that time can never be recovered. An unexpected drop in output can result in the business disappointing customers who are depending on its deliveries. It can have a direct impact on revenue, with orders lost and product unavailable to sell.

In many manufacturing businesses, unplanned downtime is hard to avoid. Maintenance remains reactive. Companies are unable to monitor and analyze their plant in real time, so they don’t know that there is anything wrong until a machine grinds to a halt. Without any reliable data on the past, they are unable to make any predictions on when machines are likely to fail.

Efforts to manage the plant are labor intensive, and prone to missing important signals. People might go from machine to machine, checking with the naked eye for any anomalies and collecting data with clipboards. The manual observation and the often infrequent checks make it difficult to detect potential problems without luck. If a check isn’t carried out in the right place at the right time, it’s not going to find a problem that might already be affecting performance, and might ultimately result in an outage.

The Machinery Monitoring Challenge

One organization facing these challenges is the Scattolini Turkey plant. It manufactures floats and tippers for commercial vehicles. The company is headquartered in Italy, and produces more than 200 types of equipment from its plant in Valeggio sul Mincio, and its seven other sites worldwide. Its plant in Turkey manufactures parts for vans.

Uptime is critical for its operations and its profitability. The company wants to transform from reactive maintenance to predictive maintenance: ensuring it can intervene before there is any downtime. A single day’s outage can cost as much as 35,000 EUR, including the cost of repair.

Its existing regime is based on manual inspection with monthly vibration measurements and reactive maintenance. The company needs a way to:

  • Gather data from its plant of over 30 machines, without requiring a visit to them. The machines include cranes, fans, and pumps;
  • Monitor the levels of liquid chemical ingredient pools;
  • Identify any problems as and when they occur, enabling an immediate response to minimize downtime;
  • Model likely future failures, so the maintenance team can carry out any repairs or replacements before there is an outage.

Solution Value: Enabling Predictive Maintenance

4.1 IIoTP gathers data wirelessly from the shop floor and analyzes it. In the case of the Scattolini Turkey plant, the data gathered includes axial vibration, surface temperature of motors, pressure of hydraulic and pneumatic systems, liquid levels in tanks and pools, and the status of the main energy line. The solution enables the plant to have access to more information, and on a more timely basis, than was previously possible. As a result, the company has a clearer insight into the status of its plant and its maintenance requirements. The data is analyzed in two stages: first, if there is an anomaly in the incoming data, an alert is raised immediately. An SMS message or email is sent to the predefined user group. In the event that there is no response, or there is a safety issue, the software can be configured to automatically shut down the machine.

The second stage of analysis takes place in the cloud. Combining the data from all the machines, 4.1 IIoTP can predict likely future outages and maintenance requirements. This approach uses machine-learning techniques to compare what is known about past failures, with current data about the plant and its equipment. By replacing parts before a likely failure, Scattolini can avoid unplanned outage.

The team at the Scattolini Turkey plant can use a cross-platform portal on computers, phones or tablets to monitor the state of the plant in real-time.

By reducing the amount of manual monitoring and transforming the factory to become proactive with its maintenance, Scattolini estimates that it has reduced its costs for maintenance operations by 15 percent.

Solution Architecture: Predictive Maintenance

To enable predictive maintenance, 4.1 IIoTP provides a mechanism for collecting data from the shop floor, analyzing it immediately for anything requiring a prompt response, and carrying out in-depth analysis in the cloud for predictive maintenance.

The machines to be monitored are fitted with battery or DC-powered wireless industrial sensors, manufactured for precision and operation in harsh environments.

4.1 IIoTP uses Intel® IoT Gateway Technology in the Dell Edge* Gateway 5000 to collect data from the sensors. Both wired and wireless connections are supported. The rugged and fanless gateway device is based on the Intel® AtomTM processor, which provides intelligence for devices at the edge of the network. Compute power at the edge enables fast decision making which can be critical at sites such as heavy industrial plants, fast moving production lines and chemical substance storage facilities.

Supervisory control and data access (SCADA) industrial control systems generally show current data. 4.1 IIoTP adds the ability to view historical data, and to automatically detect anomalies and threshold violations at the edge. Alerts can be raised by email or SMS, and 4.1 IIoTP can also intervene directly, changing the configuration of the machine or powering it down. This capability is provided by libraries and frameworks that enable 4.1 IIoTP to control the programmable logic controllers (PLCs). Leading PLCs are supported, including those from Siemens, Omron and Mitsubishi.

Additionally, data is sent to the cloud with 256-bit encryption. 4G/GPRS mobile broadband communication technologies are used to send data to the cloud, because they are more stable than Wi-Fi in industrial environments. In the cloud, data from all the gateways can be collected in one place and analyzed with machine-learning algorithms. This analysis can be used to predict likely machine failures, and to identify other opportunities for efficiency and quality gains. New rules can be created through machine learning for the analysis at the edge, to enable continuous improvement.


Figure 1. Using machine learning and edge analysis, 4.1 IIoTP enables an immediate response to urgent issues, and an in-depth analysis in the cloud to support continuous improvement

The cloud infrastructure is built on Amazon Web Services (AWS*). Amazon Kinesis Streams* are designed to continuously capture large volumes of data, and this service is used by 4.1 IIoTP to collect the data from the monitored devices. That data is also added to Amazon Simple Storage Service* (Amazon S3*), where it serves as a data lake, bringing together data from different types of monitored devices. Amazon Elastic MapReduce* (Amazon EMR*) is used to set up Spark* clusters, and map S3 objects to the EMR file system, so that batch analyses can be performed using the data in the S3 buckets. The predictive models are run in EMR, and data can also be consumed from Kinesis streams to enable real-time analysis on sensor data. The database, API and web servers are also hosted on AWS, using Amazon Elastic Compute Cloud* (Amazon EC2*). Users can remotely monitor the platform using a visual interface, on their choice of Internet-enabled device.

Conclusion

Manufacturers can use 4.1 IIoTP together with sensors fitted to the machines, to get an insight into the current and future health of their factory equipment. Analysis at the edge enables a prompt response in the case of an emergency, power outage or technical fault. Machine-learning algorithms in the cloud can analyze all the data generated by all the machines over time to refine the rules for raising alerts, and provide insight into the optimal time to maintain or replace a machine. This intelligence enables the manufacturer to avoid unplanned downtime, reduce the labor costs associated with monitoring machines manually, and optimize the cost of parts and maintenance. In turn, this enables manufacturers to increase the reliability and predictability of their manufacturing infrastructure.

Find the solution that’s right for your organization. Contact your Intel representative or visit www-ssl.intel.com/content/www/us/en/industrial-automation/overview.html

Intel Solutions Architects are technology experts who work with the world’s largest and most successful companies to design business solutions that solve pressing business challenges. These solutions are based on real-world experience gathered from customers who have successfully tested, piloted, and/or deployed these solutions in specific business use cases. Solutions architects and technology experts for this solution brief are listed on the front cover.

Learn More

Smart Transportation Robots Streamline Manufacturing Operations

$
0
0

SEIT* autonomous mobile robots, running on Intel® technology, enable manufacturers to improve flexibility and efficiency of intralogistics transportation.

Executive Summary

To remain competitive, manufacturers must focus on achieving new growth while driving down costs. Key to achieving this is greater flexibility and a dramatic upturn in operational efficiency across the manufacturing process. One area ripe for improvement is intralogistics transportation.

Many manufacturers still rely on autonomous guide vehicles (AGVs) to undertake repetitive transport tasks; but, rigid in nature, they do not support today’s demand-driven, dynamic manufacturing environments. Intelligent autonomous mobile robots (AMRs), like SEIT* from Milvus Robotics, offer a viable and cost-effective alternative.

This solution brief describes how to solve business challenges through investment in innovative technologies.

If you are responsible for…

  • Business strategy:
    You will better understand how autonomous mobile robots will enable you to successfully meet your business outcomes.

  • Technology decisions:
    You will learn how an autonomous mobile robot solution works to deliver IT and business value.


Figure 1. SEIT AMR from Milvus Robotics

Solution Benefits

  • Efficient operation - Fully autonomous rather than automated, SEIT* AMRs choose and decide the best route to take to optimize workflow and travel time.
  • Safe navigation - SEIT AMRs have the intelligence to navigate safely around people and objects with LiDAR and some additional sensors, and a built-in collision avoidance system.
  • Fast deployment - Do not depend on any physical infrastructure like wires or tapes meaning common failures like gaps in track lines do not occur, costs are reduced and robots can be up and running in just couple of hours.

Succeeding in a Fiercely Competitive Sector

Manufacturers operate in a highly challenging market segment. In some low-cost labor countries, wage rates are rising rapidly. Volatile resource prices, a looming shortage of highly skilled talent, and heightened supply-chain and regulatory risks create an environment that is far more uncertain than it was before the Great Recession1.

At the same time, customer expectations are rising and demand for high-quality customized products and services is greater than ever. To compound matters, competition in the manufacturing sector is fierce, particularly within and from Asia. Manufacturers must remain highly focused on achieving new growth and driving down costs to remain competitive.

To realize these ambitions, manufacturers need to dramatically improve operational efficiency. Inflexible legacy equipment struggles to respond quickly to consumer demand and sometimes unpredictable disruptions. Investing in digital technologies is crucial for driving down costs and creating demand-driven and responsive business models.

Industry 4.0, the latest phase in the digitization of the manufacturing sector, is creating new ways for manufacturers to deliver value. Harnessing the power of the Internet of Things (IoT), manufacturers can now automate and track every step of their production from the receipt of raw materials all the way through to delivery at the customer. They can monitor, collect, process and analyze huge volumes of data every step of the way. From this data, they can then derive insight to improve operational efficiency and productivity, increase flexibility and agility, and ultimately drive down costs.

Streamlining Intralogistics Transportation

Manufacturers work hard to optimize, automate and integrate the logistical flow of materials within the walls of their fulfillment centers, distribution centers, and warehouses. While some still rely on traditional methods of transportation – forklifts and pallets – many have sought to improve intralogistics by rolling out AGVs.

AGVs reduce the need for workers to carry out non-value add activities on the shop floor by undertaking repetitive transportation jobs. They follow magnetic or optical wires dug into the floor or take reference from reflectors placed on the walls and can tow objects behind them or carry materials on a bed. AGVs are used in nearly every industry, including pulp, paper, metals, newspaper, and general manufacturing.

While they offer many benefits, AGVs require large upfront infrastructure investment and are limited to predefined routes as they need fixed references to operate, all of which brings an innate rigidness. Today’s factories, however, are far from static. As manufacturers adapt to meet customers’ ever-changing desires and needs, flexibility is critical. AGVs, unfortunately, are unable to provide this. More recently AGV’s inability to keep up with the demands of the dynamic factory environment led to a surge in human intervention, which in turn, led to an increase in transportation costs. Manufacturers needed another solution.

Solution Value: Agile, Cost-effective, Autonomous Transportation

Using the sensory and processing powers enabled through Industry 4.0, SEIT AMRs from Milvus Robotics provide a much more flexible, efficient and integrated transportation compared to AGVs. Autonomous rather than automated, SEIT AMRs have the intelligence to decide and act according to changing environmental conditions. They choose and decide the best routes to take to optimize workflow and travel time, and can safely navigate around obstacles.

Capable of sharing a space with human workers, SEIT AMRs can integrate with existing management systems, and take orders from them. They can also communicate with robotic arms or a conveyor to undertake loading and unloading. Multiple SEIT AMRs can work harmoniously in the same facility, thanks to vehicle tracking and/ or fleet management systems. The best robot is selected for the job according to already programed jobs, distance to destination, and battery level. Thus, throughput can be optimized in facilities where there would otherwise be bottlenecks.

As they map the environment in which they are working in by a process of natural navigation, they do not need any sort of bands, rails or any other infrastructure investments. They can be up and running in a couple of hours. Technicians just need to create a map, define destination points and construct workflows. This process doesn’t require any third-party vendor intervention or additional training.

Built with industrial grade components, SEIT AMRs are designed to withstand the rigors of industrial environments and can safely handle payloads up to 1500 kg with a maximum speed of 1.5 m/s and a zero turning radius.

SEIT AMRs are controlled via Milvus Fleet Manager*, a Web-based platform built on RESTful* API, that allows users to request data, form new jobs and mission flows, and trigger actions by using any automation platform. It is the main interface to communicate with machines grouped as M2M network. Any authorized person can access the controls from any WiFi-connected device such as cell phone, tablet or computer. They can get real-time information and connect to the rest of the facility production orchestra players to create a fully trackable flow to optimize productivity. Factories can also create their own custom application modules for communication, data transfer, and tracking over internet, including conditional dynamic operations.

Depending on the implementation, AMRs can provide a return on investment after just one or two years, as they increase productivity, streamline operations, reduce accidents and eliminate CAPEX.

Manufacturers from all sectors from FMCG and home appliances, from Turkey to the United States, have rolled out SEIT AMRs.

Solution Architecture: SEIT AMRs, Running on Intel® Technology


Figure 2. SEIT AMRs, running on Intel® technology, improve operational efficiency

Milvus Robotics collaborates with Intel to optimize the operation of its SEIT AMRs.

Each robot contains an Intel® NUC, which provides the necessary processing power to drive the navigation system. The Intel NUC is a mini PC with the power of a desktop in a 4x4 form factor. It features a customizable board and chassis ready to accept the required memory, storage and operating system. Running on the Intel® Core™ i7 processor, small, light and battery-powered, it perfectly fits Milvus Robotics’ requirements. Built-in WiFi capability on the Intel NUC ensures fast and reliable data transfer and communication between each robot and all other systems for route optimization, while built-in Bluetooth is used to control more simple communications such as door opening.

SEIT AMRs use 2D Light Detection and Ranging (LiDAR) to underpin some safety elements but alone it is not enough. To ensure 3D space detection, each robot is also kitted out with Intel® RealSense™ technology. This provides the robot with computer vision so it can recognize objects or people while navigating fulfillment centers, distribution centers, and warehouses.

Conclusion

Manufacturers tasked with keeping pace with ever-changing customer demands for new and personalized products and services, while driving down costs, are looking for ways to increase agility and streamline operations.

Intralogistics transportation has relied on the use of AGVs for nearly fifty years, but they no longer support increasing requirements for highly adaptive manufacturing processes. Cognitive and capable of delivering dynamic and efficient transport in increasingly congested industrial operations, AMRs present a viable and cost-effective alternative to traditional material-handling systems like AGVs.

Solutions Proven By Your Peers

Intel Solutions Architects are technology experts who work with the world’s largest and most successful companies to design business solutions that solve pressing business challenges. These solutions are based on real-world experience gathered from customers who have successfully tested, piloted, and/or deployed these solutions in specific business use cases. Solutions architects and technology experts for this solution brief are listed on the front cover.

Learn More

Solution product company:

Intel products mentioned in the paper:

Find Out How You Could Harness the Power of the Internet of Things

Find the solution that is right for your organization. Contact your Intel representative or visit https://www.intel.co.uk/content/www/uk/en/internet-of-things/overview.html


Intel® MPI Library 2019 Technical Preview

$
0
0

The Intel® MPI Library 2019 Technical Preview provides an opportunity to explore the new multi-endpoint technology being implemented for Intel® MPI Library 2019, as well as other technical capabilities which will be updated for the 2019 version.  The Technical Preview is installed alongside an installation of Intel® MPI Library 2018 Update 1 or later, either as a standalone tool or as part of Intel® Parallel Studio XE.  It is installed by default in the following location:

/opt/intel/parallel_studio_xe_2018.1.038/compilers_and_libraries_2018/linux/mpi_2019

The Technical Preview is only available for Linux*.

Automatic Defect Inspection Using Deep Learning for Solar Farm

$
0
0

Executive Summary

Intel's Software and Services Group (SSG) engineers recently collaborated with Honeywell Corporation to launch a joint systematic project, which is the first commercial unmanned aerial vehicle inspection service for solar farms. The Intel SSG team provided the automatic defect inspection system adopting the Intel® Distribution of Caffe*-based1 deep learning technology. The results proved that deep learning technology can be applied as a general solution for numerous inspection services in the markets. They also showed that the Intel® Xeon® Scalable processors along with Intel optimized deep learning frameworks, as well as functions in the Intel® Math Kernel Library (Intel® MKL)2 can provide sufficiently competitive performance for deep learning training and inference workloads.

Background

A solar farm, which is referred to as a photovoltaic power station, is a large-scale photovoltaic system designed for the supply of merchant power into the electricity grid. To ensure the efficiency of power output, solar farms are usually far away from cities, and sited in agricultural areas with complex terrains. Routine inspection and maintenance is a herculean task for solar farms. The traditional manual inspection method can only support the inspection frequency of once in three months. Because of the hostile environment, solar panels may have defects; broken solar panel units reduce the power output efficiency. Moreover, according to the U.S. Department of Labor, utility inspection worker is one of the top ten most dangerous jobs in the United States. So, how to improve the inspection efficiency and keep workers safe at the same time is the big challenge.

solar farm inspection

Figure 1:  Location shooting of solar farm inspection by Intel Falco™ 8+ UAV.

The collaboration project combines the advanced commercial Intel® Falcon™ 8+ UAV system3 and Intel® optimized deep learning platforms and technologies with Honeywell's leadership in aerospace safety. The combined solution uses multiple Intel Falcon 8+ UAVs to patrol and inspect solar farm infrastructure. The automatic defect inspection system based on the Intel Distribution of Caffe deep learning framework developed by the Intel SSG team can greatly improve the efficiency for data analysis, and reduce the workload of skilled workers. Usually, most of the deep learning workloads are accelerated with graphics processing units. Initially, Honeywell colleagues doubted whether the Intel® architecture platform could provide the capability to support their requirements. Eventually, with optimization provided by the Intel SSG team, this collaboration project provided a positive answer to this question and convinced the Honeywell colleagues of the Intel architecture platform’s computation power for deep learning tasks.

Solutions

In the solar farm panel defect inspection system, the SSG team uses the Faster R-CNN*4 object detection model, using the Intel Distribution of Caffe that is optimized for Intel® architecture platforms. There is no such topology support by original BVLC/Caffe (Berkeley Vision and Learning Center)5, and the Intel Distribution of Caffe enabled and accelerated it for Intel® processors. Faster R-CNN can detect an object's location, size, and type. We used the ZFNet* based6 Faster R-CNN model in our solution.

The training process is supervised learning. The number of sample images for training is very limited due to Honeywell lacking real captured images from UAVs. In order to use a limited training set to ensure the high accuracy and robustness of the Faster R-CNN model, we used the dataset augmentation and ensemble method in our solution. We took about 300 original solar panel images captured by thermal imaging cameras on UAVs as input, labeled them as either passed or had defects, and also marked the positions of the defects, then rotated each image 36 times, 10 degrees each time, for the purpose of data augmentation. The image size is 640 x 480. We did not augment the thermal images with different scales, since the defect detection was size- and position-sensitive. We fed the augmented input into a Faster R-CNN model for training. For the output result, we classified each original image by the results of the 36 augmented images ensemble. The workflow is as follows:

automatic defect inspection system

Figure 2:  Workflow chart of the automatic defect inspection system.

Results

Intel Xeon Platinum 8180 processor can complete the training within six hours; the detection accuracy for solar panels is 96.3 percent, and the mean Average Precision (mAP) is 68.9 percent, given some adverse influences from environments. To illustrate the inspection effect of this system, the following sample images are shown. The image shown on the left is the original thermal image captured by UAVs. The image shown on the right is the detection result obtained from this inspection system. We can see that the classification and detection results are very accurate in this sample. Meanwhile, the confidence ratio of inspection results also provides valuable information for other decision-making systems in the subsequent process.

inspection effect of the automatic defect inspection system

Figure 3:  The inspection effect of the automatic defect inspection system.

In this application case, utility workers would use the inspection report to guide the engineering team to change the bad units of solar panels. We cannot replace only the exact broken part of the solar panel; the solar panel should usually be replaced as a whole. That’s why end users care more about the detection rate—whether there are defects in this solar panel. The inference performance is 188 images per second. Both the training and inference performance met the throughput requirements from Honeywell and utility workers.

Our automatic defect inspection system is very efficient for collected image analysis, and reduces manual workloads. The traditional manual inspection solution can only support the inspection frequency of once in three months. With the UAV-based inspection solution, we can support the inspection frequency of once a week if we still ask skilled utility workers to deal with the raw data, and we can support the inspection frequency of three times a week if adopting the automatic defect inspection system to deal with the raw data collected by UAVs.

HW

TFlops

SW

Batch Size

Images/Second

Time to Train

Intel® Xeon® processor E5-2699 (b)

3.1

Caffe* optimized for Intel® architecture with the Intel® Math Kernel Library (Intel® MKL)

1

0.60

9.3Hrs

Intel Xeon Platinum 8180 processor (2.5 GHz 8180) (a)

8.2

Caffe optimized for Intel architecture with the Intel MKL

1

0.93

6 Hrs

Conclusion

This automatic defect inspection application for solar farms demonstrates that deep learning technology can be applied to solve real-world problems, such as unmanned inspection in harsh or dangerous environments. The architecture of Faster R-CNN can learn the sophisticated features from the input images for classification and detection tasks. This is a general solution for numerous inspection services in the markets, which can be used for oil and gas inspection, such as pipeline seepage and leakage; utilities inspection, like transmission lines and substations; and even for crisis response to emergencies. The UAVs can fly high up for close-up inspections without putting workers in danger. And the automatic defect inspection system can greatly improve the efficiency of mass data analysis without the help of skilled workers.

This project also proved that the newly launched Intel Xeon Scalable processors can provide the necessary computation capability for deep learning training and inference workloads (in addition to classical machine learning and some other artificial intelligence algorithms). Caffe optimized for Intel architecture platforms takes advantage of optimized deep learning functions in the Intel MKL. This has increased the performance in a single server over 100 times. The detailed configuration can refer to “Configuration Details” part. Recent advances in distributed algorithms have enabled the use of hundreds of servers to further reduce the time to train. These significant performance improvements made heavy, deep learning workload training on Intel architecture a reality.

Acknowledgements

A special thanks to the I2R (Ideas2Reality) program7 to support this project with funding, training, coaching and business connections: Kapil Kane, Rebecca Gu, Martin Daffner, Jerry Hu; and Yan Yin, Zhenyu Tang for supports on UAV area, and Young Wang, Frank Zhang, Jiong Gong, Haihao Shen for supports on deep learning area.

References

  1. Fork of BVLC/Caffe
  2. Intel® Math Kernel Library
  3. Intel® Falcon™ 8+ UAV system
  4. S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in neural information processing systems. 2015
  5. BVLC / Caffe
  6. M.D. Zeiler, R. Fergus, Visualizing and understanding Convolutional Networks. In European conference on computer vision. 2014. 
  7. I2R (Ideas2Reality) program

Configuration Details

a. Platform: 2S Intel Xeon Platinum 8180 CPU @ 2.50GHz (28 cores), HT disabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 384GB DDR4-2666 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. SSD: Intel® SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, MLC).

Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact‘, OMP_NUM_THREADS=56, CPU Freq set with cpupower frequency-set -d 2.5G -u 3.8G -g performance

Intel® Distribution of Caffe:, revision c7ed32772affaf1d9951e2a93d986d22a8d14b88. Intel C++ compiler ver. 17.0.2 20170213, Intel® Math Kernel Library version 2018.0.20170425. Intel® Distribution of Caffe run with “numactl -l“.

BVLC-Caffe: revision 91b09280f5233cafc62954c98ce8bc4c204e7475. BLAS: atlas ver. 3.10.1.

b. Platform: 2S Intel Xeon CPU E5-2699 v3 @ 2.30GHz (18 cores), HT enabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 256GB DDR4-2133 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.el7.x86_64. OS drive: Seagate* Enterprise ST2000NX0253 2 TB 2.5" Internal Hard Drive.

Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact,1,0‘, OMP_NUM_THREADS=36, CPU Freq set with cpupower frequency-set -d 2.3G -u 2.3G -g performance

Intel® Distribution of Caffe: revision c7ed32772affaf1d9951e2a93d986d22a8d14b88. Intel C++ compiler ver. 17.0.2 20170213, Intel® Math Kernel Library version 2018.0.20170425. Intel® Distribution of Caffe run with “numactl -l“.

BVLC-Caffe: revision 91b09280f5233cafc62954c98ce8bc4c204e7475. BLAS: atlas ver. 3.10.1.

Speed System & IoT Device Application Development with New Intel® System Studio 2018

$
0
0

Intel System Studio Usages

Simplify System Bring-up, Boost Performance & Reliability, Strengthen Reliability

Development for system and IoT developers just got a bit easier. Intel just released  Intel® System Studio 2018, an all-in-one, cross-platform, comprehensive tool suite for system and IoT device application development. This new release (an update from the 2017 version) provides new tools, libraries, code samples, and capabilities that help shorten the development cycle so developers can bring their products to market faster, boost performance and power efficiency, and strengthen reliability for intelligent systems and IoT device applications running on Intel® processor-based platforms.

See below for deep details about what's new in this release and about Intel System Studio for common usages, who uses this product, and editions.

Download Intel® System Studio 2018 Now
Also new, developers have access to a FREE 90-day renewable commercial license with public community forum support, and paid license offerings providing Priority Support with confidential access to Intel engineers for technical questions.

 

Intel System StudioWhat's New in Intel System Studio 2018

  • New libraries and code samples help shorten the development cycle. By using the Intel® Data Analytics Acceleration Library, developers can speed edge analytics processing and machine learning.
  • New IoT connection tools including advanced cloud connectors and access to 400+ sensors.
  • Support for the latest Intel® processors - utilize Intel® AVX-512 instructions to optimize system and code performance.1
  • Debug capabilities and enhanced workflows that ease system validation for target devices, automate tracing, ensure reliable edge-to-cloud data exchange, and more.
  • Free 90-day free renewable commercial license, which can be refreshed an unlimited number of times to use the latest version
  • New ability to customize your software download - get only the tools you need.

To receive product updates, users must register or set up their account with the Intel® Software Development Products Registration Center.

New Features & Capabilities Details

Below are more details on the new features and capabilities. You can find more information in the tool suite and in individual component tool's release notes.

Eclipse* IDE for Intel System Studio 2018

  • Created an Intel version of the Eclipse* IDE for Intel System Studio 2018
  • Created modular Eclipse IDE structure for contribution to the Intel System Studio product
  • Integrated Intel System Studio for IoT Edition into Intel System Studio 2018
  • Improved remote Linux* OS target support 
    • Added Eclipse Target Communication Framework support for target connection
    • Added basic Makefile support
  • Added wizards for Intel® C++ Compiler integration
    • Added local compiler integration for Linux hosts
    • Added cross-compilation integration with support for Linux and Android* OS targets
  • Improved general user experience
    • Custom perspectives
    • Implemented wizards focusing on Intel System Studio use cases
    • Disabled unsupported wizards

For help creating your first cross compiling project see this article: Cross Development

Intel® C++ Compiler 18.0 

  • Control-Flow Enforcement Technology (CET) support
  • New option -Qimf-use-svml to enforce short vector math library (SVML)
  • Compile-time dispatching for SVML calls
  • All -o* options replaced with -qo* options
  • Support of hardware based Profile Guided Optimization (PGO)
  • Features from OpenMP* TR4 Version 5.0 Preview 1
  • Support for more new features in OpenMP* 4.0 or later
  • New C++17 features supported
  • Support for the atomic keyword introduced in C++11
  • New option –qopt-zmm-usage that defines the level of ZMM registers usage

See also

Intel® Math Kernel Library 2018 (Intel® MKL)

  • BLAS Features
    • Introduced 'compact GEMM' and 'compact TRSM' functions to work on groups of matrices and added service functions to support the new format.
    • Introduced optimized integer matrix-matrix multiplication routine to work with quantized matrices for all architectures.
  • BLAS Optimizations
    • Optimized GEMM_S8U8S32 and GEMM_S16S16S32 for Intel® Advanced Vector Extensions 2 (Intel® AVX-2), and Intel® AVX-512 with support of AVX512_4FMAPS and AVX512_4VNNIW instruction groups.
  • Deep Neural Network
    • Added support for non-square pooling kernels.
    • Optimized conversions between plain (nchw, nhwc) and internal data layouts.
  • LAPACK
    • Added improvements and optimizations for small matrices (N<16).
    • Added ?gesvd, ?geqr/?gemqr, ?gelq/?gemlq optimizations for tall-and-skinny and short-and-wide matrices.
    • Added optimizations for ?pbtrsroutine.
    • Added optimizations for ?potrf routine for Intel® Threading Building Blocks (Intel® TBB) layer.
    • Added optimizations for CS decomposition routines:?dorcsd and?orcsd2by1.
    • Introduced factorization and solve routines based on Aasen's algorithm:?sytrf_aa/?hetrf_aa, ?sytrs_aa/?hetrs_aa.
    • Introduced new (faster)_rk routines for symmetric indefinite (or Hermitian indefinite) factorization with bounded Bunch-Kaufman (rook) pivoting algorithm.
  • ScaLAPACK
    • Added optimizations (2-stage band reduction) for p?syevr/p?heevr routines for JOBZ=’N’ (eigenvalues only) case.
  • FFT
    • Introduced Verbose support for FFT domain, which enables users to capture the FFT descriptor information for Intel® MKL.
    • Improved performance for 2D real-to-complex and complex-to-real matrix multiplication for Intel® Xeon® processors supporting Intel AVX-512.
    • Improved performance for 3D complex-to-complex for Intel Xeon processors supporting Intel AVX-512.
  • Intel Optimized High Performance Conjugate Gradient Benchmark
    • New version of benchmark with Intel MKL API
  • Sparse BLAS
    • Introduced Symmetric Gauss-Zeidel preconditioner.
    • Introduced Symmetric Gauss-Zeidel preconditioner with ddot calculation of resulted and initial arrays.
    • Sparse Matvec routine with ddot calculation of resulted and initial arrays.
    • Sparse Syrk routine with both OpenMP and Intel® TBB support.
    • Improved performance of Sparse MM and MV functionality for Intel AVX-512 instruction set.
  • Direct Sparse Solver for Cluster
    • Add support of transpose solver
  • Vector Mathematics
    • Added 24 functions including optimizations for processors based on Intel AVX-512.
  • Data Fitting
    • Cubic spline-based interpolation in ILP64 interface was optimized up to 8x times on Intel Xeon processors supporting Intel AVX-512.

See also:

Intel® Data Analytics Acceleration Library (Intel® DAAL)

  • Introduced API modifications to streamline library usage and enable consistency across functionality.
  • Introduced support for Decision Tree for both classification and regression. The feature includes calculation of Gini index and Information Gain for classification, and mean squared error (MSE) for regression split criteria, and Reduced Error Pruning.
  • Introduced support for Decision Forest for both classification and regression. The feature includes calculation of Gini index for classification, variance for regression split criteria, generalization error, and variable importance measures such as Mean Decrease Impurity and Mean Decrease Accuracy.
  • Introduced support for varying learning rate in the Stochastic Gradient Descent algorithm for neural network training.
  • Introduced support for filtering in the Data Source including loading selected features/columns from CSV data source and binary representation of the categorical features
  • Extended Neural Network layers with Element Wise Add layer.
  • Introduced new samples that allow easy integration of the library with Spark* MLlib
  • Introduced service method for enabling thread pinning; performance improvements in various algorithms on Intel Xeon processors supporting IntelAVX-512.

For more information on Intel® DAAL see: Introduction to Intel® DAAL 

Intel® Integrated Performance Primitives 2018 (Intel® IPP)

  • Optimized functions for LZ4 data compression and decompression, a fast compression algorithm suitable for applications where speed is key - especially in communication channels.
  • Optimized functions for GraphicsMagick*, a popular image processing toolbox, so customers using this function can achieve improved performance using drop-in optimization with Intel® IPP functions.
  • Removed the cryptography code dependency on the main package.
  • Extended support of platform-aware APIs, which automatically detects whether image vectors and length are 32-bit or 64-bit, provides 64-bit parameters for image dimensions and vector length, and abstracts this away from the users.

See also: Building a faster LZ4 with Intel® Integrated Performance Primitives

Intel® Threading Building Blocks 2018 (Intel® TBB)

  • this_task_arena::isolate() function is now a fully supported feature. Also, this_task_arena::isolate() function and task_arena::execute() methods were extended to pass on the value returned by the executed functor (this feature requires C++11). The task_arena::enqueue() and task_group::run() methods extended to accept move-only functors.
  • Added support for Android* NDK r15, r15b.
  • Added support for Universal Windows Platform*.

IoT connection tools: MRAA & UPM Libraries

  • Includes more than 400 sensor and actuator libraries, with a built-in GUI for exploring the repository
  • Support for these libraries included for Ubuntu*, Wind River Linux*, and Wind River Pulsar*
  • Additional samples included which show how to leverage MRAA and UPM in combination with various cloud services.

See also : Developing with Intel® System Studio - Sensor libraries 

Intel® VTune™ Amplifier 2018 

  • Easier Analysis of Remote Linux Systems
    • Automated install of Intel® Vtune™ Amplifier collectors on a remote Linux target.
  • Enhanced Python* Profiling
    • Locks and Waits analysis tunes threaded performance of mixed Python* and native code.
    • Preview: Memory consumption analysis. Python, C, C++.
  • Optimize Private Cloud-Based Applications 
    • Profile inside Docker & Mesos containers.
    • Attach to running Java services and daemons.
  • Media Developers: GPU In-kernel Profiling
    • Analyze GPU kernel execution to find memory latency or inefficient kernel algorithms.
  • Easier Threading Optimization of Applications Using Intel TBB
    • Advanced threading analysis extends classification of high overhead and spin time. 
  • Latest Processors
    • New Intel® processors including Intel Xeon Scalable processor.
  • Cross OS Analysis for All Supported OSes
    • Download other OSes as needed. e.g., collect data on Linux, then analyze it on Windows* or macOS*.

See also:

Energy Analysis/Intel® SoC Watch

  • Added Eclipse* Plug-in for Energy analysis [Preview]

See also:Energy analysis in Intel® System Studio 2018

Intel® Inspector 2018

  • Support for C++17 std::shared_mutex and Windows SRW Locks, that enable threading error analysis for applications with read/write synchronization primitives.
  • Support for cross-OS analysis to all license types. The installation packages for additional operating systems can be downloaded from registrationcenter.intel.com.
  • Microsoft Visual Studio 2017* integration and support.

Intel® Graphics Performance Analyzers

  • Multi-Frame Analyzer Feature Pack 1
  • Trace Analyzer PA Replacement
  • 8th Gen Intel Core Processor(formerly Kaby Lake Refresh) Windows 10 support
  • Windows Redstone 3 support

Intel® System Debugger 2018

  • Added new method for connecting to target systems, called Target Connection Agent.
  • Support for Intel Atom® Processor C3xxx target added for both Windows and Linux hosts.
  • Support for Intel Xeon Scalable Processor / Intel® C620 Series chipset target added for Windows host.
  • Support for 8th generation Intel® Core Processor / Intel® 100 Series Chipset added for Windows host.
  • Support for “8th gen Intel Core processor / Intel®  Z370 Series Chipset target added for Windows host.

See also: Using the Target Connection Agent with Intel® System Debugger

Intel® Debug Extensions for WinDbg*

  • WinDbg* supports Windows Driver Kit (WDK) version 1703. Added support for a new eXDI callback (DBGENG_EXDI_IOCTL_V3_GET_NT_BASE_ADDRESS_VALUE) to locate windows key structure KdVersionBlock.
  • Extended Intel® Debug Extensions for WinDbg* for Intel® Processor Trace plug-in to support Windows public symbol information.
  • Extended Intel Debug Extensions for WinDbg* for Intel Processor Trace plug-in to support ring 3 tracing.
  • Extended Intel Debug Extensions for WinDbg* for Intel Processor Trace plug-in to support decoding Intel® Processor Trace data from crash dump.

GNU* GDB and source

  • Added visualizer for PKeys hardware register and GS_base and FS_base system registers in Linux.
  • Added Python* call backs for Intel® Processor Trace.

For questions or technical support, visit Intel® Software Products Support

About Intel System Studio

 
Intel System Studio has 3 Editions
  • Composer
  • Professional
  • Ultimate

This comprehensive tool suite helps streamline development so developers can move from prototype to production faster. It is used by device manufacturers, system integrators, and embedded and IoT application developers on solutions that benefit from improved systems and IoT applications, including industrial and manufacturing, health care, retail, smart cities/buildings/homes, transportation, office automation, and many more. Learn more.

 

 

1Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.  Notice revision #20110804.

 

 

 
 
 
 
 
 

Intel® System Studio Release Notes

Intel® System Studio 2018 System Requirements

$
0
0

This page provides the information on the system requirements for the Intel® System Studio 2018 product.

To get product updates, log in to the Intel® Software Development Products Registration Center.

For questions or technical support, visit Intel® Software Products Support.

You can register and download the Intel® System Studio 2018 package  here

Intel® System Studio supports development for Android*, Embedded Linux*, Yocto Project* and Wind River Linux* deployment targets from Linux*,Windows* or macOS* host.

Note: Individual Intel® System Studio components may have a broader range of supported features/targets than the suite level support covers. See the system requirements of the individual components for detailed information.

Supported Host Operating Systems

Below is the list of distributions supported by most components.  Intel System Studio supports Intel® 64 Host architectures only.  

Linux* Host:

  • Red Hat Enterprise* Linux* 6, 7
  • Ubuntu* 14.04 LTS, 16.04 LTS
  • Fedora* 24, 25, 26
  • Wind River* Linux* 7, 8, 9
  • openSUSE* Leap 42.2
  • SUSE LINUX Enterprise Server* 12 SP1
  • CentOS* 7.1

Windows* Host:

  • Microsoft Windows* 7, 8.x, 10

macOS* Host:

  • macOS*  10.12, 10.13

In most cases, Intel® System Studio will install and work on a standard Linux* OS distribution based on current Linux* kernel versions without problems, even if they are not listed below. You will, however, receive a warning during installation for Linux* OS distributions that are not listed.

Supported Target Operating Systems

Linux target:

  • Red Hat Enterprise* Linux* 6 and 7
  • CentOS* 7.1
  • Yocto project* 1.7, 2.0, 2.2 based environment
  • Ubuntu* 14.04, 16.04
  • Wind River* Linux* 7, 8, 9 based environment
  • SUSE LINUX Enterprise Server* 11, 12

Windows target:

  • Microsoft Windows* 7,8.x,10 (PC & Embedded)

Android target :

  • Android* M, N

Supported target hardware platforms

  • Development platform based on the Intel® Atom™ processor Z5xx, N4xx, N5xx, D5xx, E6xx, N2xxx, D2xxx, E3xxx, Z2xxx, Z3xxx, C2xxx, or Intel® Atom™ processor CE4xxx, CE53xx and the Intel® Puma™ 6 Media Gateway
  • Intel® Atom™ Processors X Series Cxxx,Exxx,Zxxx
  • Development platform based on a 2nd, 3rd, 4th, 5th, 6th or 7th generation Intel® Core™ processor
  • Intel® Xeon® processors based on 2nd, 3rd 4th, 5th or 6th generation
  • Intel® Xeon® Scalable processors series

Space Requirements by Component

Component

Minimum RAM

Recommended RAM

Disk space

Intel® C/C++ Compiler

Host: 1GB

Host 2 GB

Host 4GB for all features
Target – 13MB(IA-32)/15MB(IA-64)

GNU* GDB

1 GB

2 GB

200MB

Intel® Inspector

2GB

4GB

350MB

Intel® Integrated Performance Primitives

1 GB

4GB

2-4GB

Intel® Math Kernel Library

1GB

4GB

2.3GB

Intel® System Debugger

1GB

2GB

1.4 GB

Intel® VTune Amplifier

2GB

4GB

650MB

Docker* build workflow

4GB

 

20GB for Docker images and containers

 

Prerequisites by Component

Intel® System Studio 2018 might also require installation of webkitgtk for using Eclipse*:

  • Linux* host -
    • RedHat/Fedora: dnf install webkitgtk
    • Ubuntu/Debian : apt-get install libwebkitgtk-1.0.0

Intel® C/C++ Compiler

  • Linux* target - 
    • Linux Developer tools component installed, including gcc, g++ and related tools
      • gcc versions 4.3 - 6.3 supported
      • binutils versions 2.20-2.26 supported
    • Development for a 32-bit target on a 64-bit host may require optional library components (ia32-libs, lib32gcc1, lib32stdc++6, libc6-dev-i386, gcc-multilib, g++-multilib) to be installed from your Linux distribution.

Docker* based application workflow

  • Using Intel® System Studio to target Ubuntu Desktop with the free "Community Edition" (CE) version of Docker* requires Docker version 1.13.0 (Jan 2017 release) or later. We recommend that you install the latest version of Docker on your development system to ensure expected functionality.

Intel® VTune™ Amplifier

  • Linux* target 
    • Linux* Kernel version has to be 2.6.32 or higher for Intel® VTune Amplifier power and performance analysis.
    • Kernel Configuration

Intel® System Debugger

  • Linux* host - 
    • Install fxload package for all types of target communication
      • Ubuntu*: sudo apt-get install fxload
      • Fedora*: sudo yum install fxload 
  • Windows* host - 
    • Microsoft .NET Framework 4 (dotNetFx40_Full_x86_x64.exe) Microsoft .NET Framework 3.5 SP1 runtime (pre-installed by default on Microsoft* Windows* 7)
      • Download Microsoft .NET Framework 4 web installer from Microsoft.com.
      • Run dotNetFx40_Full_x86_x64.exe 

 

Development environments supported

Eclipse* IDE

An Intel flavor of the Eclipse* IDE is available for Intel® System Studio 2018. Check out the What's new page for more details 

Microsoft Visual Studio* integration

To use the Microsoft Visual Studio development environment or command-line tools to build IA-32 or Intel® 64 architecture applications, one of:

  • Microsoft Visual Studio 2017* Professional Edition (or higher edition) with 'Desktop development with C++' component installed
  • Microsoft Visual Studio 2015* Professional Edition (or higher edition) with 'Common Tools for Visual C++ 2015' component installed 
  • Microsoft Visual Studio Community 2015* with 'Common Tools for Visual C++ 2015' component installed 
  • Microsoft Visual Studio 2013* Professional Edition (or higher edition) with C++ component installed
  • Microsoft Visual Studio Community 2013* with C++ component installed

To use command-line tools only to build IA-32 architecture applications, one of:

  • Microsoft Visual C++ Express 2015 for Windows Desktop*
  • Microsoft Visual C++ Express 2013 for Windows Desktop*

To use command-line tools only to build Intel® 64 architecture applications, one of:

  • Microsoft Visual C++ Express 2015 for Windows Desktop*
  • Microsoft Visual C++ Express 2013 for Windows Desktop*

Component system requirements 

ComponentSystem requirements 
Intel® C/C++ Compiler

Windows host
Linux host
Windows target 

GNU* GDB

Windows host
Linux host

Intel® Data Analytics Acceleration Library (Intel® DAAL)All hosts
Intel® Debug Extensions for WinDbg*All hosts
Energy Analysis

Windows target
Linux target
Android target

Intel® Graphics Performance Analyzers (Intel® GPA) All hosts
Intel® Inspector All hosts
Intel® Integrated Performance Primitives (Intel® IPP)All hosts
Intel® Math Kernel Library (Intel® MKL)All hosts
Intel® System Debugger (System Debug)

Windows host
Linux host

Intel® System Debugger (System trace)

Windows host
Linux host

Intel® Threading Building Blocks (Intel® TBB) All hosts
Intel® VTune™ Amplifier & Performance Snapshots

All hosts(hardware requirements)

All hosts(software requirements)

MRAA IO Communication LayerMRAA
UPM Sensor and Actuator LibraryUPM
 

Intel® System Studio 2018 Release Notes

$
0
0

This page provides the Release Notes for the Intel® System Studio 2018 product.

To get product updates, log in to the Intel® Software Development Products Registration Center.

For questions or technical support, visit Intel® Software Products Support.

You can register and download the Intel® System Studio 2018 package  here.


Intel® System Studio 2018

To find out What's New in the Beta, see this page: What's New? 

Studio ComponentRelease Notes

Intel® System Studio (full product)

All Hosts

Intel® C/C++ CompilerLinux Host
Windows Host
Windows Target
GNU* GDBLinux Host
Windows Host
Intel® Math Kernel Library (Intel® MKL)All Hosts
Intel® Integrated Performance Primitives (Intel® IPP)All Hosts
Intel® Threading Building Blocks (Intel® TBB)All Hosts
Intel® Data Analytics Acceleration Library (Intel® DAAL)All Hosts
MRAA IO Communication Layer / UPM Sensor and Actuator LibraryMRAA
UPM
Intel® VTune™ AmplifierAll Hosts
Intel® SoC WatchLinux Target
Windows Target
Android Target
Intel® InspectorAll Hosts
Intel® Graphics Performance Analyzers (Intel® GPA)All Hosts
Intel® System Debugger (System Debug & System Trace)System Debug Linux Host
System Debug Windows Host
System Trace Linux Host
System Trace Windows Host
Intel® Debug Extensions for WinDbg*

To check the system requirements, visit: System Requirements

 

 

Intel® VTune™ Amplifier Tutorials

$
0
0

The following tutorials are quick paths to start using the Intel® VTune™ Amplifier. Each demonstrates an end-to-end workflow you can ultimately apply to your own applications.

Take This Short TutorialLearn To Do This

Finding Hotspots
Duration: 10-15 minutes

C++ Tutorial
Windows* OS: HTML | PDF
Linux* OS: HTML | PDF
Sample code: tachyon_vtune_amp_xe

Fortran Tutorial
Windows* OS: HTML | PDF
Linux* OS: HTML | PDF
Sample code: nqueens_fortran

Identify where your application is spending time, detect the most time-consuming program units and how they were called.

Analyzing Locks and Waits
Duration: 10-15 minutes

C++ Tutorial
Windows* OS: HTML | PDF
Linux* OS: HTML | PDF
Sample code: tachyon_vtune_amp_xe

Identify locks and waits preventing parallelization.

Identifying Hardware Issues
Duration: 10-15 minutes

C++ Tutorial
Windows* OS: HTML | PDF
Linux* OS: HTML | PDF
Sample code: matrix_vtune_amp_xe

Identify the hardware-related issues in your application such as data sharing, cache misses, branch misprediction, and others.

Analyzing Disk Input/Output Waits
Duration: 10-15 minutes

C++ Tutorial
Linux* OS: HTML | PDF
Sample code: diskio

Analyze an I/O bound application that uses the system file cache and user buffer to work with the I/O device.

Identifying False Sharing
Duration: 10-15 minutes

C Tutorial
Linux* OS: HTML | PDF
Sample code: linear_regression

Identify false sharing.

Analyzing an OpenMP* and MPI Application
Duration: 60+ minutes

C++ Tutorial
Linux* OS: HTML
Sample code: https://github.com/CardiacDemo/Cardiac_demo

Identify issues in a hybrid OpenMP and MPI application using MPI Performance Snapshot, Intel Trace Analyzer and Collector, and Intel VTune Amplifier.

Enabling Performance Collection on an Embedded Linux* System
Duration: 60+ minutes

C++ Tutorial
Linux* OS: HTML | PDF
Sample code: tachyon_vtune_amp_xe

Configure a remote Linux embedded system built with the Yocto Project* 2.1 environment for application analysis with VTune Amplifier sampling drivers. Analyze where your application is spending time and identify the most time-consuming program units with Advanced Hotspots analysis.

Finding Hotspots on an Android* Platform
Duration: 10-15 minutes

C++ Tutorial
Windows* OS: HTML | PDF
Linux* OS: HTML | PDF
Sample code: tachyon_vtune_amp_xe

Configure and run a remote Basic Hotspots analysis on an Android target system.

Analyzing Energy Usage on an Android* Platform
Duration: 10-15 minutes

Tutorial
Linux* OS: HTML | PDF
Windows* OS: HTML | PDF

Run the Energy analysis with the Intel SoC Watch collector (available with the Intel System Studio) directly on the target Android system and view the collected data with the VTune Amplifier installed on the host Windows* or Linux* system.

Analyzing Energy Usage on a Windows* Platform
Duration: 20-30 minutes

Tutorial
Windows* OS: HTML | PDF
Sample code: Pi_Console.exe

Run the Energy analysis of an idle system and a sample application with the Intel SoC Watch collector (available with Intel System Studio) directly in the target Windows* system. Copy the results to the Windows host system and view the collected data with VTune Amplifier.

NOTE: For more end-to-end analysis use cases, explore the Intel VTune Amplifier Cookbook.


Download Documentation: Intel® System Studio (Current and Previous)

$
0
0

This page provides downloadable documentation packages for all editions of Intel® System Studio. 

Each package includes documentation for Intel System Studio components, such as Intel C++ Compiler, libraries (e.g., Intel Math Kernel Library, Intel Integrated Performance Primitives), performance analyzers (e.g., Intel VTune Amplifier, Intel Inspector), and others. The full list of included components and respective documentation formats is available in the readme file in each package.

The packages provide downloadable copies of the web documentation formats and do not include the documents shipped offline with the product.

To get product updates, log in to the Intel® Software Development Products Registration Center.
For questions or technical support, visit Intel® Software Developer Support.

 

2018

 

CPU Performance Optimization & Differentiation for VR Applications Using Unreal Engine* 4

$
0
0

By: Wenliang Wang

Virtual reality (VR) can bring an unprecedented user immersion experience, but at the same time, due to the characteristics such as binocular rendering, low latency, high resolution, and forced vertical synchronization (vsync), VR generates great pressure on CPU render threads, logic and threads, and computing of the graphics processing unit (GPU)1. How to effectively analyze the bottlenecks of VR application performance and optimize the CPU thread to improve the degree of parallelization on working threads, thereby reducing the GPU waiting time to improve the utilization rate, are keys to determining whether a VR application is running smooth, is free of dizziness, and is immersive.

The Unreal Engine* 4 (UE*4) is one of two major game engines currently used by VR developers. Understanding the CPU thread structure and associated optimization tools of UE4 can help in developing better UE4-based VR applications. This paper covers the CPU performance analysis and debugging instructions, thread structure, optimization methods, and tools on UE4. It also covers how to make full use of idle computing resources of the CPU core to enhance performance of VR content, and provide corresponding performance of audio and visual content based on the different configurations of the various game players. The goal is to make a game that has the best immersive VR experience.

Why Optimize the PC VR Game

Asynchronous timewarp (ATW), asynchronous spacewarp (ASW) and asynchronous reprojection are technologies provided by VR runtime that can generate a composite frame when the frame drop appears in the VR application, by inserting frames; equivalent to reducing the delay. However, these are not perfect solutions, and each have different limitations: ATW and asynchronous reprojection can compensate for the motion-to-photon (MTP) delay generated by the rotational movement, but if the head position is moved or there are moving objects on the screen, even with ATW and asynchronous reprojection the MTP delay cannot be reduced. In addition, the ATW and asynchronous reprojection need to be inserted between the draw call of a GPU. When a draw call is too long (for example, post-processing) or the time left to give the ATW and asynchronous reprojection is insufficient, the frame insertion will fail. ASW will lock the frame rate at 45 frames per second (fps) when rendering cannot keep up, and add 22.2 milliseconds (ms) for a frame to render, to insert a composite frame between two rendering frames using traditional image motion estimation (motion estimation), as shown in Figure 1.

Screenshot of game frame
Screenshot of game frame
Screenshot of game frame

Figure 1:  ASW interpolation effect.

In a synthetic frame, the acute movement or transparent part of the frame produces deformation (for example, the part within the red circles in Figure 1); violent illumination change is also prone to estimating errors. When continuous frames are inserted using ASW, picture shaking can be easily felt by users. These VR runtime technologies are not good solutions to the problem of frequent frame drops. Developers should ensure that VR applications in most cases can be stable running at 90 fps, and only rely on the above methods to solve accidental frame drops.

Introduction to Unreal Engine* 4 Performance Debugging Instructions

Applications developed with UE4 can query various real-time performance data via the stat command in the console command 2-3. The stat unit instruction allows you to see the total frame rendering time (Frame), rendering thread consumption time (Draw), logical thread consumption time (Game), and GPU consumption time (GPU), from which you can see which part is restricting the frame rendering time, as shown in Figure 2. Combined with show or showflag instructions, dynamic switch can be used to control various features to observe the impact of each feature on rendering time, and find out factors that impact the performance, during which the pause command can be executed to suspend the logical thread to observe the result.

It should be noted that the GPU consumption time includes both GPU work time and GPU idle time, so even if it shows that the GPU spent the longest time in the stat unit, it does not necessarily mean that the problem is on the GPU. It is possible that a CPU bottleneck could cause the GPU to be in an idle state most of the time, and extends the time it takes for the GPU to complete a frame rendering. So, there is a need to combine other tools, such as GPUView* 4, to analyze the CPU and GPU time chart from which to locate the bottleneck position.

screenshot of frame statistics
Figure 2:  Stat unit statistics.

In addition, because VR is opened with forced vertical synchronization, as long as the frame render time is more than 11.1 ms, more than 0.1 ms leads to a frame taking two full vertical synchronization cycles to complete. As a consequence, it is easy to slow down the performance of a VR application because of a slight scene change. For a better result use the - emulatestereo command with the resolution set to 2160 x 1200. The screen percentage ratio (screenpercentage) is set to 140, which can be used to analyze the performance without display of the VR head and closure of the vertical synchronization.

The performance data associated with the rendering thread can be seen through the stat scenerendering, including the number of draw calls, visibility culling length, light processing time, and so on. For visibility of culling, the stat initviews instruction can be used to further understanding and analysis of the processing time of each part, including frustum culling, precomputed visibility culling, and dynamic occlusion culling.

To judge the efficiency of each culling, enter the stat sceneupdate command to see the time it takes to update the world scene including add, update, and remove lights. In addition, you can write the frame rendering information into the log by specifying when the render time of a frame is over the t.H itchThreshold by using the stat dumphitches instruction.

To make the game effects match to different PC levels, stat physics, stat anim, and stat particles are frequently used instructions related to CPU performance, corresponding to the physical computing time (cloth simulation, damage effect, and so on), skin meshing computing time, and CPU particle computing time. Because these instructions can be assigned to different work threads for parallel processing in UE4, they can be extended accordingly so that the VR application is effectively adapted to different levels of hardware. As a result, VR immersive experience and overall performance can be enhanced by the increase in the number of CPU cores.

In addition, you can directly enter the console commands stat startfile and stat stopfile to collect the real-time running data for a designated time period, and then use the Stats Viewer in the UE4 session frontend to view the utilization ratio of CPU thread and the call stack, finding the CPU hot spot, and carry on the corresponding optimization, as shown in Figure 3. The functions are similar to the Windows* Performance Analyzer (WPA) in the Windows* Assessment and Deployment Kit (ADK).

Screenshot of UE4 built in Stats Viewer
Figure 3:  The Stats Viewer built in the UE4 session frontend.

CPU Optimization Techniques for UE*4 VR Applications

In the process of VR development, when encountering CPU performance problems, not only do we need to find out where the bottleneck is, but we also have to master the tools provided in UE4 that can help to optimize the bottleneck. By understanding the usage, effects, and differences of each tool we can quickly identify and select the most appropriate strategy to optimize the performance of VR applications. In this section we will focus on the UE4 optimization tools.

Rendering Thread Optimization

Due to performance, bandwidth, and multisample anti-aliasing (MSAA) considerations, current VR applications usually use forward rendering instead of deferred rendering. However, in the forward rendering pipeline of UE4, in order to reduce GPU overdraw, the prepass phase before base pass will force the use of early-z to generate the depth buffer, resulting in reduced GPU workload being submitted before the base pass. In addition, DirectX* 11 is basically in a single-threaded rendering category, and multi-threaded ability is poor. In the situation that there are significant numbers of draw calls or primitives in the VR scene, culling calculation time becomes longer. Basically, the calculation phase before base pass is likely to produce GPU bubbles due to rendering thread bottlenecks, reducing the utilization rate of the GPU, and triggering the frame drop. The optimization of rendering threads is of vital importance to VR development.

Figure 4 shows an example of a VR game that is limited to CPU rendering threads. The VR game runs on HTC Vive*, with an average frame rate of 60 fps. Although the GPU appears to be the main performance bottleneck from the console command stat unit, each frame of the rendering thread Draw takes a very long time. From the frame time in SteamVR*, it can be clearly seen that the CPU even has a late start, which means the workload of the rendering thread is very heavy (in SteamVR, the calculation of the rendering thread of a frame starts 3 ms before the vertical synchronization at the beginning of the frame, which is known as the running start. The intention was to use 3 ms extra delay in exchange for the rendering thread to work in advance, so that the GPU would be able to work immediately after vertical synchronization of the frame to maximize the efficiency of the GPU. If a frame of render thread works before the next vertical sync, the 3 ms is not yet finished. It blocks the running start of the next frame, which is called late start. Late start makes the rendering thread work delayed, resulting in the production of the GPU bubble.

In addition, in the frame time of SteamVR, it can be seen that the time used by the GPU is every other frame. Other is higher, and from the following analysis, it can be seen that this is actually the GPU bubble time before the prepass.

If we use GPUView to analyze the scene in Figure 4, we get the result of Figure 5, where the red arrow refers to the time that the CPU rendering thread starts. Because of the running start, the first red arrow starts counting 3 ms before the vertical sync; but when the time reaches the vertical sync, the GPU still has no work to do until it reaches 3.5 ms, where the GPU works shortly, following with 1.2 ms idle. Only after that can the CPU submit the prepass work to the CPU context queue, and 2 ms after the completion of the prepass, the base pass work can be submitted to the CPU context queue to let the GPU execute.

In Figure 5, the locations indicated by the red circles are GPU idle time; the total time (also known as GPU bubbles) adds up to nearly 7 ms, which directly leads to the frame drop, as the GPU rendering is not able to finish within 11.1 ms. As a result, it needs two vertical synchronization cycles to complete the work of this frame. We can combine the WPA analysis on the call stack of the rendering thread during the GPU bubbles and find out which functions cause the bottleneck1. The second red arrow refers to the location of the start time of the rendering thread for the next frame. Because a frame drop occurred in this frame, the rendering thread of the following frame adds a full vertical synchronization cycle for the calculation.

When the GPU of the next frame starts working after a vertical sync, the rendering thread has filled the CPU context queue, so the GPU has enough work to do without generating GPU bubbles. As long as there are no GPU bubbles a frame rendering in 9 ms will be able to complete, so the next frame will not drop. Three vertical sync cycles are needed to complete the rendering of two frames, which is why the average frame rate is 60 fps.

The analysis from Figure 5 shows that in this example, the GPU is not actually a performance bottleneck; so long as the real bottleneck of the CPU rendering thread is solved, the VR game can reach 90 fps. In fact, we found that rendering thread bottlenecks exist in most development of VR applications with UE4, so familiarization with the following tools for UE4 rendering thread optimization can greatly improve the performance of the VR application.

Screenshot of game
Figure 4:  An example of a VR game with a CPU rendering thread bottleneck, which shows the SteamVR* statistics on time consumption of CPU and GPU per frame.

Screenshot of game
Figure 5:  The time view of GPUView* for the Figure 4 example; you can see the CPU rendering thread bottlenecks leading to the GPU idle, which triggered the frame drop.

Instanced Stereo Rendering

VR doubles the number of draw calls due to binocular rendering, which can easily lead to rendering thread bottlenecks. Instanced stereo rendering only needs to submit a draw call once as the object, and then, respectively, apply the corresponding transformation matrix on the left and right eye angle by the GPU so that the object can be drawn to the left and right eye angle. This part equals the transfer of CPU work to the GPU processing. It increases the GPU vertex shader work but can save a half draw call; therefore, it typically reduces the rendering thread load, resulting in a performance increase of about 20 percent for VR applications, unless the number of draw calls in the VR scene is low (<500). You can choose to turn on or turn off the instanced stereo rendering in the project settings.

Visibility Culling

Rendering thread bottlenecks in VR applications is usually caused by two major reasons; one is static grid computing and the other is a visual culling. Static grid computing can be optimized by merging draw call or mesh, while visual culling needs to reduce the number of primitives or dynamic occlusion culling.

Visibility culling bottlenecks are particularly severe in VR applications because VR is forced to reduce the delay time by limiting the calculation of the CPU rendering thread for each frame in advance to 3 ms before vertical sync (running start/queue ahead), while in the UE4 InitViews (including visibility culling and setting the dynamic shadow) stage the GPU work is not generated. Once InitViews takes more than 3 ms, it produces GPU bubbles and reduces GPU utilization, likely causing dropped frames, so visibility culling in the VR needs to be the major focus of the optimization.

Visibility culling in UE4 consists of four parts; the sequence, according to the calculation complexity from low to high, is:

  1. Distance culling
  2. View frustum culling
  3. Precomputed occlusion culling
  4. Dynamic occlusion culling, including hardware occlusion queries and hierarchical z‑buffer occlusion

During the design, the best way is to remove the majority of primitives using numbers 1 through 3 culling as much as possible, in order to reduce the InitViews bottleneck, because the computing work of number 4 (dynamic occlusion culling) is much greater than the other three. The following focuses on the interpretation of view frustum culling and precomputed occlusion culling.

View Frustum Culling

In UE4 VR, the view frustum culling of a VR application is only done once, separately, to the right and left eye camera, which means that all primitives must be used twice in the scene to complete the entire view frustum culling. But we can change the UE4 code to implement super-frustum culling5, namely the merger of left eye and right eye view frustum to complete view frustum culling and, in one scene, can save the rendering thread roughly half of view frustum culling time.

Precomputed Occlusion Culling

After distance culling and view frustum culling, we can use precomputed occlusion culling to further reduce the number of primitives that need to be sent to the GPU to do dynamic occlusion culling, to reduce the time that the rendering thread is spent processing the visibility culling and, at the same time, to reduce the frame popping phenomenon of the dynamic occlusion system (because the query result of GPU occlusion culling needs one frame before it can returns. It is likely to produce a visibility error when the angle of view is rotating fast or when the object is in the corner attachment).

Precomputed occlusion culling is equivalent to increasing the memory usage and the time to construct the light in exchange for the lower occupation of rendering thread; the larger the memory occupied and the time of pre-stored decoded data will be relatively increased. However, VR scenes are generally smaller relative to traditional games, and most of the objects in the scene are static objects. There is a limit to the user's moveable area, which is a favorable factor for precomputed occlusion culling, and this is also an optimization that must be done for VR application development.

In practice, precomputed occlusion culling would automatically cut the entire scene into the visibility cells of the same size, based on the parameter setting, which covers all the possible locations of the view camera. In the position of each cell, the precomputed occlusion culling stores the primitives in the cell, which will be 100 percent removed. In the actual operation, the look up table (LUT) reads the primitives that are to be removed in the current location. The precomputed occlusions that are stored do not need to do dynamic occlusion culling again in the runtime.

We can use the console command Stat InitViews to see Statically Occluded Primitives to know how many primitives are processed out by precomputed occlusion culling, use Decompress Occlusion to view the decoding time of each frame of stored data, and use Precomputed Visibility Memory in Stat Memory to check memory usage of pre-stored data. Where Occluded Primitives includes the number of primitives that are precomputed and dynamic occlusion culling, increasing the proportion of Statically Occluded Primitives/Occluded Primitives (more than 50 percent) helps to significantly reduce InitViews time. The detailed setup steps and limitations of precomputed occlusion culling in UE4 can be found in6-7.

Screenshot of precomputed occlusion culling example
Figure 6:  Precomputed occlusion culling example.

Static Mesh Actor Merging

The Merge Actors tool in the UE4 can automatically merge multiple static grids into a grid body to reduce the drawing call, and it can be selected in the settings whether or not to merge material, light map, or physical data, according to the actual needs. The setting process can be referred to8. In addition, there is another tool in the UE4—the Hierarchical Level of Detail (HLOD)9; the difference is that HLOD will only merge objects with distant levels of details (LODs).

Instancing

Achieving the same grid body or object in the scene (such as haystacks or boxes), can be implemented using instanced meshes. It only needs to submit one draw call; the GPU in the drawing will do the corresponding coordinate transformation based on the location of the object. If there are many of the same grids in scenes, instancing can effectively reduce the rendering call of the rendering thread. Instancing can be set in the blueprint (BlueprintAPI -> Components -> InstancedStaticMesh (ISM))10. If you want to have different LODs for each instantiated object, you can use hierarchical ISM (HISM).

Monoscopic Far-Field Rendering

Limited by interpupillary distance, the human eye has a different sense of objects at different distances. According to the average of 65mm per capita interpupillary distance of the human eye, the strongest distance depth sensation is between 0.75m to 3.5m; depth sensation beyond eight meters is not easy to perceive, and the degree of sensitivity drops when the distance is farther.

Based on this feature, Oculus* and Epic Games* introduced monocular far-field rendering in the forward rendering pipeline of UE 4.15, allowing VR applications to be set to monocular or binocular, depending on the distance of each object to the view camera11. If there are many long-range objects in the scene, these long-range objects can be used to reduce the rendering of the scene and the cost of pixel shading.

For example, each frame of the Oculus Sun Temple scene can reduce the rendering costs by 25 percent with monocular far-field rendering. It is worth noting that the current single-phase far-field rendering in UE4 can only be used on GearVR*; the support of the PC VR will be included with the new version of UE4. The detailed setting method for monocular far-field rendering in UE4 can be found in reference12. You can also view the contents of a stereoscopic buffer or monoscopic buffer in the control panel by entering the command vr.MonoscopicFarFieldMode0-4.

Logical Thread Optimization

In the VR rendering pipeline of UE4, the logical thread is calculated one frame earlier than the rendering thread, and the rendering thread will generate a proxy based on the result of the previous frame logical thread and render it accordingly, to ensure that the rendering process does not change; at the same time, the logic thread will be updated, and the update results will be reflected through the next line of the screen to the screen. Since the logical thread is calculated one frame in advance in UE4, the logical thread does not become a performance bottleneck unless the logical thread takes more than one vertical sync period (11.1 ms). But the problem is that in UE4, the logical threads and rendering threads can only run on a single thread, the blueprint in the gameplay; actor ticking, artificial intelligence, and other calculations are handled by the logical thread. If there are more actors or interactions in the scene that cause the logical thread to take more than one vertical synchronization cycle, then it needs to be optimized. Here are two performance optimization techniques for logical threads.

Blueprint Nativization

In the UE4 default blueprint conversion process, you need to use a VM to convert the blueprint into C++ code, during which the cost of the VM will result in performance loss. UE 4.12 introduced a blueprint nativization; all or part of the blueprint (inclusive/exclusive) can be directly compiled into C++ code, to dynamically load as a run-time DLL to avoid VM overhead and improve the efficiency of logical threads. Detailed settings can be found in reference13.

It should be noted that if the blueprint itself has been optimized (for example, the calculation of the module directly with C++), blueprint nativization performance improvement is limited. Also, the function UFUNCTION in the blueprint cannot be inline; the function for repeated calls can be used in the blueprint math node (inline) or through a UFUNCTION call inline function. The best way is of course to assign the work directly to other threads14-15.

Skeleton Meshing

If too many actors cause logic thread bottlenecks in the scene, in addition to lower LOD (skeletal mesh LODs) and animation ticking, you can also use LOD or distance from the nearest hierarchical approach to deal with the interaction between the behavior of actors. Sharing of some skeletal resources among several LODs is another viable option16.

CPU Differentiation of UE4 VR Application

The above describes several VR application CPU optimization techniques, but optimization can only ensure that VR applications do not drop frames, or cause motion sickness—it cannot further enhance the experience. If you want to enhance the VR experience, you must make the greatest possible use of the computing power provided by the hardware, and translate these computing resources into content, effects, and picture performance to the end user, which requires the CPU to provide the corresponding differentiated content based on computing power. Following are five techniques of CPU differentiation.

Cloth Painting

UE4 Cloth Painting is performed mainly through the work thread assigned by the physical engine. The impact on the logical thread is small. And Cloth Painting is required for each frame to be calculated, even if the cloth is not within the screen display range, and needs to be calculated to determine whether the update will be displayed to the screen, so the calculation is relatively stable. The corresponding Cloth Painting program can be selected according to the adaptation to CPU capacity17.

Destructible Mesh

In UE4, Destructible Mesh is performed mainly through the work thread assigned by the physical engine; this part can be strengthened if a high performance CPU is available. The results include more objects that can be destroyed, the destruction of more fragments or fragments in the scene, and the existence of a longer time. Destructible Mesh presence will greatly enhance the performance of the scene and more immersive experience; the setting process can refer to18.

CPU Particles

CPU Particles is a module that is relatively easy to expand, although the number of particles from the CPU is less than that of the GPU. Maximizing the use of CPU multi-cores computing power can reduce the burden on the GPU, and CPU particles come with the following unique features. They can:

  • Glow
  • Be set to the particle material and parameters (metal, transparent material, and so on)
  • Be controlled by a specific gravitational trajectory (can be affected by the point, line, or other particles to attract)
  • Produce shadows

During the development process, you can set the corresponding CPU particle effect for different CPUs.

Two screenshot for side to side comparison
Figure 7:  Particles differentiation in the Hunting Project*.

Steam Audio*

For VR applications, in addition to the screen, another important element for creating immersive experience is the audio. Directional 3D audio is an effective way to enhance the immersive VR experience. Oculus has introduced the Oculus Audio* SDK19 to simulate 3D audio, but the SDK is relatively simple on the environmental sound simulation, and relatively not popular. Steam Audio*20 is a new 3D audio SDK offered by Valve*, which supports Unity* 5.2 or newer and UE 4.16 or higher version, and provides a C language interface. Steam Audio has the following features:

  • Provides 3D audio effects based on real physical simulation, supports directional audio filtering for head-related transfer function (HRTF), and ambient sound effects (including sound occlusion, real-world audio transmission, reflection, and mixing sound); also supports access to the inertia data of VR head.
  • It is possible to set the material and parameters (scattering coefficient, absorption rate for different frequencies, and so on) for each object in the scene. The simulation of environmental sound can be processed in real time or by baking, according to the computing power of the CPU.
  • Many of the settings or parameters in the ambient sound can be adjusted according to the quality or performance requirements such as HRTF interpolation methods, the number of audio ray traces and the number of reflections, and the form of mixing.
  • Compared to the Oculus Audio SDK, which only supports the shoebox model and sound masking is not supported, Steam Audio 3D audio simulation is more realistic and complete, providing finer quality control.
  • Free, and not bound to a VR header or platform.

Steam Audio collects the source and the listener's status and information from the logical process of UE4, and uses the work thread for light tracking and environmental reflection simulation of the sound. The calculated impulse response is then transferred to the audio rendering thread for the corresponding filtering and mixing work of the sound source, and then output by the operating system's audio thread to the headset (such as Windows* XAudio2).

The entire process is done by the CPU threads. While adding 3D audio does not increase the load of rendering thread and logical thread, the performance of the original game will not be affected; thus, it is very suitable for a VR experience optimization. The detailed setup process can be found in the Steam Audio documentation21.

Scalability

The scalability setting of UE4 is a set of tools that adjust the performance of the control screen by means of parameters to fit different computing platforms22. For the CPU, the scalability is mainly reflected in the following parameters on the set:

  • View distance): Distance culling. Distance culling scale ratio (r.ViewDistanceScale 0 – 1.0f)
  • Shadows: Shadow quality (sg.ShadowQuality 0 - 3)

Screenshot of shade differentiation in a Tencent* VR game
Figure 8:  Shade differentiation in the Tencent* VR game Hunting Project* .

  • Foliage: Number of foliage being rendered each time (FoliageQuality 0 - 3)

Screenshot of foliage differentiation in a Tencent* VR game
Figure 9:   Foliage differentiation in the Tencent* VR game Hunting Project*.

  • Skeletal mesh LOD bias (r.SkeletalMeshLODBias)
  • Particle LOD bias (r.ParticleLODBias)
  • Static mesh LOD distance scale (r.StaticMeshLODDistanceScale).

Summary

This article describes a variety of CPU performance analysis tools, optimization methods, and differentiation techniques, based on the limitations of the article. To learn more, refer to the reference section. Proficiency in a variety of CPU performance analysis tools and techniques can quickly find bottlenecks and optimize accordingly, and in fact this is very important for VR applications. In addition, while optimizing the use of idle, multi-threaded resources at the same time, you can make the application to achieve better picture effects and performance, providing a better VR experience.

Reference

  1. Performance Analysis and Optimization for PC-Based VR Applications: From the CPU’s Perspective:
    https://software.intel.com/en-us/articles/performance-analysis-and-optimization-for-pc-based-vr-applications-from-the-cpu-s
  2. 2 Unreal Engine Stat Commands: https://docs.unrealengine.com/latest/INT/Engine/Performance/StatCommands/index.html
  3. Unreal Engine 3 Console Commands: https://docs.unrealengine.com/udk/Three/ConsoleCommands.html
  4. GPUView: http://graphics.stanford.edu/~mdfisher/GPUView.html
  5. The Vanishing of Milliseconds: Optimizing the UE4 renderer for Ethan Carter VR: http://www.gamasutra.com/blogs/LeszekGodlewski/20160721/272886/The_Vanishing_of_Milliseconds_Optimizing_the_UE4_renderer_for_Ethan_Carter_VR.php
  6. Precomputed Visibility Volumes: http://timhobsonue4.snappages.com/culling-precomputed-visibility-volumes
  7. Precomputed Visibility: https://docs.unrealengine.com/udk/Three/PrecomputedVisibility.html
  8. Unreal Engine Actor Merging: https://docs.unrealengine.com/latest/INT/Engine/Actors/Merging/
  9. Unreal Engine Hierarchical Level of Detail: https://docs.unrealengine.com/latest/INT/Engine/HLOD/index.html
  10. Unreal Engine Instanced Static Mesh: https://docs.unrealengine.com/latest/INT/BlueprintAPI/Components/InstancedStaticMesh/index.html
  11. Hybrid Mono Rendering in UE4 and Unity: https://developer.oculus.com/blog/hybrid-mono-rendering-in-ue4-and-unity/
  12. Hybrid Monoscopic Rendering (Mobile): https://developer.oculus.com/documentation/unreal/latest/concepts/unreal-hybrid-monoscopic/
  13. Unreal Engine Nativizing Blueprints: https://docs.unrealengine.com/latest/INT/Engine/Blueprints/TechnicalGuide/NativizingBlueprints/
  14. Unreal Engine Multi-Threading: How to Create Threads in UE4: https://wiki.unrealengine.com/Multi-Threading:_How_to_Create_Threads_in_UE4
  15. Implementing Multithreading in UE4: http://orfeasel.com/implementing-multithreading-in-ue4/
  16. Unreal Engine Skeleton Assets: https://docs.unrealengine.com/latest/INT/Engine/Animation/Skeleton/
  17. Unreal Engine Clothing Tool: https://docs.unrealengine.com/latest/INT/Engine/Physics/Cloth/Overview/
  18. How to Create a Destructible Mesh in UE4: http://www.onlinedesignteacher.com/2015/03/how-to-create-destructible-mesh-in-ue4_5.html
  19. Oculus Audio SDK Guide: https://developer.oculus.com/documentation/audiosdk/latest/concepts/book-audiosdk/
  20. A Benchmark in Immersive Audio Solutions for Games and VR: https://valvesoftware.github.io/steam-audio/
  21. Download Steam Audio: https://valvesoftware.github.io/steam-audio/downloads.html
  22. Unreal Engine Scalability Reference: https://docs.unrealengine.com/latest/INT/Engine/Performance/Scalability/ScalabilityReference/

About the Author

Wenliang Wang is a senior software engineer for Intel's Software and Services Group. He works with VR content developers for Intel CPU performance optimization and differentiation, sharing the CPU optimization experience to make more efficient use of the CPU. Wenliang is also responsible for the implementation, analysis, and optimization of multimedia video codecs and real-time applications, and has over 10 years of experience in video codecs, image analysis algorithms, computer graphics, and performance optimization, and has published numerous papers in the industry. Wenliang graduated from the Department of Electrical Engineering and Communication Engineering Research Institute of Taiwan University.

Explore Unity Technologies ML-Agents* Exclusively on Intel® Architecture

$
0
0

Abstract

This article describes how to install and run Unity Technologies ML-Agents* in CPU-only environments. It demonstrates how to:

  • Train and run the ML-Agents Balance Balls example on Windows* without CUDA* and cuDNN*.
  • Perform a TensorFlow* CMake build on Windows optimized for Intel® Advanced Vector Extensions 2 (Intel® AVX2).
  • Create a simple Amazon Web Services* (AWS) Ubuntu* Amazon Machine Image* environment from scratch without CUDA and cuDNN, build a “headless” version of Balance Balls for Linux*, and train it on AWS.

Introduction

Unity Technologies released their beta version of Machine Learning Agents* (ML-Agents*) in September 2017, offering an exciting introduction to reinforcement learning using their 3D game engine. According to Unity’s introductory blog, this open SDK will potentially benefit academic researchers, industry researchers interested in “training regimes for robotics, autonomous vehicle, and other industrial applications,” and game developers.

Unity’s ML-Agents SDK leverages TensorFlow* as the machine learning framework for training agents using a Proximal Policy Optimization (PPO) algorithm. There are several example projects included in the GitHub* download, as well as a Getting Started example and documentation on how to install and use the SDK.

One downside of the SDK for some developers is the implied dependencies on CUDA* and cuDNN* to get the ML-Agents environment up and running. As it turns out, it is possible to not only explore ML-Agents exclusively on a CPU, but also perform a custom build of TensorFlow on a Windows® 10 computer to include optimizations for Intel® architecture.

In this article we show you how to:

  • Train and run the ML-Agents Balance Balls (see Figure 1) example on Windows without CUDA and cuDNN.
  • Perform a TensorFlow CMake build on Windows* optimized for Intel® Advanced Vector Extensions 2 (Intel® AVX2).
  • Create a simple Amazon Web Services* (AWS) Ubuntu* Amazon Machine Image* (AMI) environment from scratch without CUDA and cuDNN, build a “headless” version of Balance Balls for Linux*, and train it on AWS.


Figure 1. Trained Balance Balls model running in Unity* software.

 

Target Audience

This article is intended for developers who have had some exposure to TensorFlow, Unity software, Python*, AWS, and machine learning concepts.

System Configurations

The following system configurations were used in the preparation of this article:

Windows Workstation

  • Intel® Xeon® processor E3-1240 v5
  • Microsoft Windows 10, version 1709

Linux Server (Training)

  • Intel® Xeon® Platinum 8180 processor @ 2.50 GHz
  • Ubuntu Server 16.04 LTS

AWS Cloud (Training)

  • Intel® Xeon® processor
  • Ubuntu Server 16.04 LTS AMI

In the section on training ML-Agents in the cloud we use a free-tier Ubuntu Server 16.04 AMI.

Install Common Windows Components

This section describes the installation of common software components required to get the ML-Agents environment up and running. The Unity ML-Agents documentation contains an Installation and Setup procedure that links to a webpage instructing the user to install CUDA and cuDNN. Although this is fine if your system already has a graphics processing unit (GPU) card that is compatible with CUDA and you don’t mind the extra effort, it is not a requirement. Either way, we encourage you to review the Unity ML-Agents documentation before proceeding. 

There are essentially three steps required to install the common software components:

  1. Download and install Unity 2017.1 or later from the package located here.
  2. Download the ML-Agents SDK from GitHub. Extract the files and move them to a project folder of your choice (for example, C:\ml-agents).
  3. Download and install the Anaconda* distribution for Python 3.6 version for Windows, located here.

Install Prebuilt TensorFlow*

This section follows the guidelines for installing TensorFlow on Windows with CPU support only. According to the TensorFlow website, “this version of TensorFlow is typically much easier to install (typically, in 5 or 10 minutes), so even if you have an NVIDIA* GPU, we recommend installing this version first.” Follow these steps to install prebuilt TensorFlow on your Windows 10 system:

  1. In the Start menu, click the Anaconda Prompt icon (see Figure 2) to open a new terminal.


    Figure 2. Windows* Start menu.

  2. Type the following commands at the prompt:

    > conda create -n tensorflow-cpu python=3.5
    > activate tensorflow-cpu
    > pip install --ignore-installed --upgrade tensorflow

  3. As specified in the TensorFlow documentation, ensure the installation worked correctly by starting Python and typing the following commands:

    > python
    >>> import tensorflow as tf
    >>> hello = tf.constant('Hello')
    >>> sess = tf.Session()
    >>> print (sess.run(hello))

  4. If everything worked correctly, 'Hello' should print to the terminal as shown in Figure 3.


    Figure 3. Python* test output.

    You may also notice a message like the one shown in Figure 3, stating “Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2.” This message may vary depending on the Intel® processor in your system; it indicates TensorFlow could run faster on your computer if you build it from sources, which we will do in the next section.

  5. To close Python, at the prompt, press CTRL+Z.
     
  6. Navigate to the python subdirectory of the ML-Agents repository you downloaded earlier, and then run the following command to install the other required dependencies:

    > pip install.

  7. Refer to the Building Unity Environment section of the “Getting Started with Balance Ball Example” tutorial to complete the ML-Agents tutorial.

Install TensorFlow from Sources

This section describes how to build an optimized version of TensorFlow on your Windows 10 system.

The TensorFlow website states, “We don't officially support building TensorFlow on Windows; however, you may try to build TensorFlow on Windows if you don't mind using the highly experimental Bazel on Windows or TensorFlow CMake build.” However, don’t let this discourage you. In this section we provide instructions on how to perform a CMake build on your Windows system.

The following TensorFlow build guidelines complement the Step-by-step Windows build instructions shown on GitHub. To get a more complete understanding of the build process, we encourage you to review the GitHub documentation before continuing. 

  1. Install Microsoft Visual Studio* 2015. Be sure to check the programming options as shown in Figure 4.


    Figure 4. Visual Studio* programming options.

  2. Download and install Git from here. Accept all default settings for the installation.
     
  3. Download and extract swigwin from here. Change folders to C:\swigwin-3.0.12 (note that the version number may be different on your system).
     
  4. Download and install CMake version 3.6 from here. During the installation, be sure to check the option Add CMake to the system path for all users.
     
  5. In the Start menu, click the Anaconda Prompt icon (see Figure 2) to open a new terminal. Type the following commands at the prompt:

    > conda create -n tensorflow-custom36 python=3.6
    > activate tensorflow-custom36

  6. Run the following command to set up the environment:

    > "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\vcvarsall.bat"

    (Note: If vcvarsall.bat is not found, try following the instructions provided here.)
     
  7. Clone the TensorFlow repository and create a working directory for your build:

    cd /
    > git clone https://github.com/tensorflow/tensorflow.git
    > cd tensorflow\tensorflow\contrib\cmake
    > mkdir build
    > cd build

  8. Type the following commands (Note: Be sure to check the paths and library version shown below on your own system, as they may be different):

    > cmake .. -A x64 -DCMAKE_BUILD_TYPE=Release ^
    -DSWIG_EXECUTABLE=C:\swigwin-3.0.12/swig.exe ^
    -DPYTHON_EXECUTABLE=C:/Users/%USERNAME%/Anaconda3/python.exe ^
    -DPYTHON_LIBRARIES=C:/Users/%USERNAME%/Anaconda3/libs/python36.lib ^
    -Dtensorflow_WIN_CPU_SIMD_OPTIONS=/arch:AVX2

  9. Build the pip package, which will be created as a .whl file in the directory .\tf_python\dist (for example, C:\tensorflow\tensorflow\contrib\cmake\build\tf_python\dist\tensorflow-1.4.0-cp36-cp36m-win_amd64.whl).

    > C:\Windows\Microsoft.NET\Framework64\v4.0.30319\MSBuild /p:Configuration=Release tf_python_build_pip_package.vcxproj

    (Note: Be sure to check the path to MSBuild on your own system as it may be different.)
     
  10. Install the newly created TensorFlow build by typing the following command:

    pip install C:\tensorflow\tensorflow\contrib\cmake\build\tf_python\dist\tensorflow-1.4.0-cp36-cp36m-win_amd64.whl

  11. As specified in the TensorFlow documentation, ensure the installation worked correctly by starting Python and typing the following commands:

    > python
    >>> import tensorflow as tf
    >>> hello = tf.constant('Hello')
    >>> sess = tf.Session()
    >>> print (sess.run(hello))

  12. If everything worked correctly, 'Hello' should print to the terminal. Also, we should not see any build optimization warnings like we saw in the previous section (see Figure 5).


    Figure 5. Python* test output.

  13. To close Python, at the prompt, press CTRL+Z.
     
  14. Navigate to the python subdirectory of the ML-Agents repository you downloaded earlier, and then run the following command to install the other required dependencies:

    > pip install .

  15. Refer to the Building Unity Environment section of the “Getting Started with Balance Ball Example” tutorial to complete the ML-Agents tutorial.

Train ML-Agents in the Cloud

The ML-Agents documentation provides a guide titled “Training on Amazon–Web Service” that contains instructions for setting up an EC2 instance on AWS for training ML-Agents. Although this guide states, “you will need an EC2 instance which contains the latest Nvidia* drivers, CUDA8, and cuDNN,” there is a simpler way to do cloud-based training without the GPU overhead.

In this section we perform the following steps:

  • Create an Ubuntu Server 16.04 AMI (free tier).
  • Install prerequisite applications on Windows for interacting with the cloud server.
  • Install Python and TensorFlow on the AMI.
  • Build a headless Linux version of the Balance Balls application on Windows.
  • Export the Python code in the PPO.ipynb Jupyter Notebook* to run as a stand-alone script in the Linux environment.
  • Copy the python directory from Windows to the Linux AMI.
  • Run a training session on AWS for the ML-Agents Balance Balls application.
  1. Create an account on AWS if you don’t already have one. You can follow the steps shown in this section with an AWS Free Tier account; however, we do not cover every detail of creating an account and configuring an AMI, because the website contains detailed information on how to do this.
  2. Create an Ubuntu Server 16.04 AMI. Figure 6 shows the machine instance we used for preparing this article.


    Figure 6. Linux* Server 16.04 LTS Amazon Machine Image*.

  3. Install PuTTY* and WinSCP* on your Windows workstation. Detailed instructions and links for installing these components, connecting to your Linux instance from Windows using PuTTY, and transferring files to your Linux instance using WinSCP are provided on the AWS website.
     
  4. Log in to the Linux Server AMI using PuTTY, and then type the following commands to install Python and TensorFlow:

    > sudo apt-get update
    > sudo apt-get install python3-pip python3-dev
    > pip3 install tensorflow
    > pip3 install image

    Note: The next steps assume you have already completed the ML-Agents Getting Started with Balance Ball Example tutorial. If not, be sure to complete these instructions and verify you can successfully train and run a model on your local Windows workstation before proceeding.
     
  5. Ensure your Unity software installation includes Linux Build Support. You need to explicitly specify this option during installation, or you can add it to an existing installation by running the Unity Download Assistant as shown in Figure 7.


    Figure 7. Unity* software Linux* build support.

  6. In Unity software, open File – Build Settings and make the following selections:
    • Target Platform: Linux
    • Architecture: x86_64
    • Headless Mode: Checked
  7. These settings are shown in Figure 8.


    Figure 8. Unity* software build settings for headless Linux operation.

  8. After clicking Build, create a unique name for the application and save it in the repository’s python folder (see Figure 9). In our example we named it Ball3DHeadless.x86_64 and will refer to it as such for the remainder of this article.


    Figure 9. Build Linux* application.

  9. In order to run through a complete training session on the Linux AMI we will export the Python code in the PPO.ipynb Jupyter Notebook so it can run as a stand-alone script in the Linux environment. To do this, follow these steps:

    - In the Start menu, to open a new terminal, click the Anaconda Prompt icon (Figure 2).
    - Navigate to the python folder, and then type Jupyter Notebook on the command line.
    - Open the PPO.ipynb notebook, and then click File – Download As – Python (.py). This will save a new file named “ppo.py” in the Downloads folder of your Windows computer.
    - Change the filename to “ppo-test.py” and then copy it to the python folder in your ML-Agents repository.
    - Open ppo-test.py in a text editor, and then change the env_name variable to “Ball3DHeadless”:
    - env_name = “Ball3DHeadless” # Name of the training environment file.
    - Save ppo-test.py, and then continue to the next step.

  10. Once the application has been built for the Linux environment and the test script has been generated, use WinSCP to copy the python folder from your ML-Agents repository to the Ubuntu AMI. (Details on transferring files to your Linux instance using WinSCP are provided on the AWS website.)
  11. In the PuTTY console, navigate to the python folder and run the following commands:

    > cd python
    > chmod +x Ball3DHeadless.x86_64
    > python3 ppo-test.py
    If everything went well you should see the training session start up as shown in Figure 10.


    Figure 10. Training session running on an Amazon Web Services* Linux* instance.

Summary

In the output shown in Figure 10, notice that the time (in seconds) is printed to the console after every model save. Code was added to the ppo-test.py script for this article in order to get a rough measure of the training time between model saves.

To instrument the code we made the following modifications to the Python script:

import numpy as np
import os
import tensorflow as tf
import time # New Code
.
.
.
trainer = Trainer(ppo_model, sess, info, is_continuous, use_observations, use_states)
timer_start = time.clock() # New Code
.
.
.
Save_model(sess, model_path=model_path, steps=steps, saver=saver)
print(“ %s seconds “ % (time.clock() – timer_start)) # New Code
timer_start = time.clock() # New Code
.
.
.

Using this informal performance metric, we found that the average difference in training time between a prebuilt TensorFlow GPU binary and prebuilt CPU-only binary on the Windows workstation was negligible. The training time for the custom CPU-only TensorFlow build was roughly 19 percent faster than the prebuilt CPU-only binary on the Windows workstation. When training was performed in the cloud, the AWS Ubuntu Server AMI performed roughly 29 percent faster than the custom TensorFlow build on Windows.

Installing the Intel® Distribution for Python* and Intel® Performance Libraries with pip and PyPI

$
0
0

The Intel® Distribution for Python* has now added the option of installing the distribution and specialized packages from the Python Package Index (PyPI) using pip.  The packages require the use of pip version 9.0.1, and are available utilizing the following instructions:

Intel® Runtime Packages

Package Namepip commandPlatform Availability
mklpip install mklLinux, Win, macOS
ipppip install ippLinux, Win, macOS
impipip install impiLinux, Win
daalpip install daalLinux, Win, macOS
intel-openmppip install intel-openmpLinux, Win, macOS

 

Performance Packages

To avoid issues with dependencies with NumPy or SciPy, they should be uninstalled before installing the Intel® variants of the packages–use pip uninstall numpy scipy -y to remove the packages first.

Package Namepip commandPlatform Availability
numpypip install intel-numpyLinux, Win
scipypip install intel-scipyLinux, Win

 

Note: If one installs intel_numpy, one would also get mkl_fft and mkl_random (with NumPy).
Similarly, if one installs intel_scipy, one would also get intel_numpy along with the installation of SciPy.

Specialized NumPy packages

In order to utilize these packages, the standard NumPy installation must be removed first using the command: pip uninstall numpy -y

Package Namepip commandPlatform Availability
mkl_fftpip install mkl_fftLinux, Win
mkl_randompip install mkl_randomLinux, Win

 

Development only packages

 

Package Namepip commandPlatform Availability
mkl-develpip install mkl-develLinux, Win, macOS
ipp-develpip install ipp-develLinux, Win, macOS
daal-develpip install daal-develLinux, Win, macOS

 

Troubleshooting

While `pip install`-ing any package, if installation fails with the following error message :

zlib.error: Error -5 while decompressing data: incomplete or truncated stream

retry after running the following command: rm -rf ~/.cache/pip

 

 

Large Matrix Operations with SciPy* and NumPy*: Tips and Best Practices

$
0
0

Introduction

Large matrix operations are the cornerstones of many important numerical and machine learning applications. In this article, we provide some recommendations for using operations in Scipy or Numpy for large matrices with more than 5,000 elements in each dimension.

General Advice for Setting up Python*

Use the latest version of Intel® Distribution for Python* (version 2018.0.0 at the time of this writing), and preferably Python (version 3.6) for better memory deallocation and timing performance. This means downloading the Miniconda* installation script (that is, Miniconda3-latest-Linux-x86_64.sh) with Python (version 3.6). After downloading the script, the following lines are some example bash commands that you can use to execute the script for the installation of conda*, a Python package manager that we will use throughout this guide:

INSTALLATION_DIR=$HOME/miniconda3
bash $<DOWNLOAD_PATH>/Miniconda3-latest-Linux-x86_64.sh -b -p $INSTALLATION_DIR -f
CONDA=${INSTALLATION_DIR}/bin/conda
ACTIVATE=${INSTALLATION_DIR}/bin/activate

For the installation of the Python packages from Intel, you can create a conda environment called idp as follows:

$CONDA create -y -c intel -n idp intelpython3_core python=3 

To activate the conda environment for use, run

source $ACTIVATE idp

Then you will be able to use the installed packages from the Intel Distribution for Python.

Tips for Using the Matrix Operations

These recommendations may help you obtain faster computational performance for large matrix operations on compatible Intel® processors. From our benchmarks, we see great speedups of these large matrix operations when used in parallel on the Intel processors that belong to the server class, such as the Intel® Xeon® processors and Intel® Xeon Phi™ processors. These speedups are a result of the parallel computation at the multithreading layer of the Numpy and Scipy libraries.

We based the five tips described below on the performance observed for the matrix operation benchmarks, which are version controlled. After activating the conda environment for the Intel Distribution for Python, you can install our benchmarks following these steps for the bash command line:

git clone https://github.com/IntelPython/ibench.git
cd ibench
python setup.py install

The example SLURM job script for running the benchmark on an Intel Xeon Phi processor (formerly code-named Knights Landing) processor can be found at the following link, while the example SLURM job script for an Intel Xeon processor can be found at the following link.

Tip 1: Tune the relevant environmental variable settings

To make the best use of the multithreading capabilities of server-class Intel processors, we recommend you tune the threading and memory allocation settings accordingly. Factors that can affect the performance of the matrix operations include:

  • Shape and size of the input matrix
  • Matrix operation used
  • Amount of computational resources on the system
  • Usage pattern of computational resource of the specific code base that runs the matrix operation(s)

For Intel® Xeon® Phi™ processors

Some example (bash environmental) settings are listed below as a baseline for tuning the performance.

The first set of parameters determines how many threads will be used.

export NUM_OF_THREADS=$(grep 'model name' /proc/cpuinfo | wc -l)
export OMP_NUM_THREADS=$(( $NUM_OF_THREADS / 4  ))
export MKL_NUM_THREADS=$(( $NUM_OF_THREADS / 4  ))

The parameters OMP_NUM_THREADS and MKL_NUM_THREADS specify the number of threads for the matrix operations. On an Intel Xeon Phi processor node dedicated for the computation job, we recommend using one thread per available core as a starting point for tuning the performance.

The second set of parameters specifies the behavior of each thread.

export KMP_BLOCKTIME=800
export KMP_AFFINITY=granularity=fine,compact
export KMP_HW_SUBSET=${OMP_NUM_THREADS}c,1t

The parameter KMP_BLOCKTIME specifies how long a thread should stay active after the completion of a compute task. When KMP_BLOCKTIME is longer than the default value of 200 milliseconds, there will be less overhead for waking up the thread for subsequent computation(s).

The KMP_AFFINITY parameter dictates the placement of neighboring OpenMP thread context. For instance, the setting of compact assigns the OpenMP thread <n>+1 to a free thread context as close as possible to the thread context where the <n> OpenMP thread was placed. A detailed guide for KMP_AFFINITY setting can be found at this link.

The KMP_HW_SUBSET setting specifies the number of threads placed onto each processor core.

The last parameter controls the memory allocation behavior.

export HPL_LARGEPAGE=1

The parameter HPL_LARGEPAGE enables large page size for the memory allocation of data objects. Having a huge page size can enable more efficient memory patterns for large matrices due to better translation lookaside buffer behavior.

For Intel® Xeon® processors (processors released with a code-name newer than Haswell)

Some (bash environmental) example settings are:

export NUM_OF_THREADS=$(grep 'model name' /proc/cpuinfo | wc -l)
export OMP_NUM_THREADS=$(( $NUM_OF_THREADS ))
export MKL_NUM_THREADS=$(( $NUM_OF_THREADS ))
export KMP_HW_SUBSET=${OMP_NUM_THREADS}c,1t
export HPL_LARGEPAGE=1
export KMP_BLOCKTIME=800

Tip 2: Use Fortran* memory layout for Numpy arrays (assuming matrix operations are called repeatedly)

Numpy arrays can use either Fortran memory layout (column-major order) or C memory layout (row-major order). Many Numpy and Scipy APIs are implemented with LAPACK and BLAS, which require Fortran memory layout. If a C memory layout Numpy array is passed to a Numpy or Scipy API that uses Fortran order internally, it will perform a costly transpose first. If a Numpy array is used repeatedly, convert it to Fortran order before the first use.

The Python function that can enable this memory layout conversion is numpy.asfortranarray. Here is a short code example:

import numpy as np
matrix_input = np.random.rand(5000, 5000)
matrix_fortran = np.asfortranarray(matrix_input, dtype=matrix_input.dtype)

Tip 3: Save the result of a matrix operation in the input matrix (kwargs: overwrite_a=True)

It is natural to obtain large outputs from matrix operations that have large matrices as inputs. The creation of additional data structures can add overhead.

Many Scipy matrix linear algebra functions have an optional parameter called overwrite_a, which can be set to True. This option makes the function provide the result by overwriting an input instead of allocating a new Numpy array. Using the LU decomposition from the Scipy library as an example, we have:

import scipy
scipy.linalg.lu(a=matrix, overwrite_a=True)

Tip 4: Use the TBB or SMP Python modules to avoid the oversubscription of threads

Efficient parallelism is a known recipe for unlocking the performance on an Intel processor with more than one core. The reasoning about parallelism within Python, however, is sometimes less than transparent. Individual Python libraries may implement their own mechanism and level of parallelism. When a combination of Python libraries is used, it can result in an unintended usage pattern of the computational resources. For example, some Python modules (for example, multiprocessing) and libraries (for example, Scikit-Learn*) introduce parallelism into some functions by forking multiple Python processes. Each of these Python processes can spin up a number of threads as specified by the library.

Sometimes the number of threads can be bigger than the available amount of CPU resources in a way that is known as thread oversubscription. Thread oversubscription can slow down a computational job. Within the Intel Distribution for Python, several modules help to manage threads more efficiently by avoiding oversubscription. These are the Static Multi-Processing library (SMP) and the Python module for Intel® Threading Building Blocks (Intel® TBB).

In Table 1, Intel engineers Anton Gorshkov and Anton Malakhov provide recommendations for which module to use for a Python application based on the parallelism characteristics, in order to achieve good thread subscription for parallel Python applications. These recommendations can serve as a starting point for tuning the parallelism.

Table 1. Module recommendations based on parallelism characteristics.

Innermost parallelism level

Outermost parallelism level

Balanced work

Unbalanced work

Low subscription

High subscription

 

Low subscription

Python

Python with SMP

Python with Intel® Threading Building Blocks

High subscription

KMP_COMPOSABILITY

Below we give the exact command line instructions for each entry in the table.

The SMP and Intel TBB modules can be easily installed in an Anaconda* or a Miniconda installation of Python using the conda installer commands:

conda install -c intel smp
conda install -c intel tbb

More specifically, we explain the bash commands for each possible scenario outlined in Table 1. Assume the Python script that we try to run is called PYTHONPROGRAM.py, the syntax for invoking the script at a bash command line is:

python PYTHONPROGRAM.py                  # No change 

If we want to use the Python modules with the script, we invoke one of the Python modules by supplying one of them as the argument of the -m flag of the Python executable:

python -m smp PYTHONPROGRAM.py           # Use SMP with Python script
python -m tbb PYTHONPROGRAM.py           # Use TBB with Python script

Additionally, the TBB module has various options. For example, we can supply a -p flag with Tbb to limit the maximum number of threads:

python -m tbb -p $MAX_NUM_OF_THREADS PYTHONPROGRAM.py

And for a Python script with high thread subscription in the inner parallelism level and unbalanced work, we can set the variable KMP_COMPOSABILITY for a bash shell as follows:

KMP_COMPOSABILITY=mode=exclusive python

To read about additional resources for Intel TBB and SMP, we recommend the SciPy 2017 tutorial given by Anton Malakhov. Or you can find a tutorial at Unleash the Parallel Performance of Python* Programs.

Tip 5: Turn on transparent hugepages for memory allocation on Cray Linux*

The Linux* OS has a feature called hugepages that can accelerate programs with a large memory footprint by 50 percent or greater. For many Linux systems, the hugepages functionality is transparently provided by the kernel. Enabling hugepages for Python on Cray Linux requires some extra steps. Our team at Intel has verified that the presence of hugepages is needed to achieve the best possible performances for some large matrix operations (> 5000 x 5000 matrix size). These operations include LU, QR, SVD, and Cholesky decompositions.

First, on the relevant Cray system, check that the system hugepages module is loaded. For example, if Tcl* modules are used, you can find and load the system hugepages module via instructions similar to:

module avail craype-hugepages2M
module load craype-hugepages2M

You will also need to run the one-line ainstallation instruction of the hugetlbfs package within a Conda Python environment or installation:

You will also need to run the one-line installation instruction of the hugetlbfs package within a Conda Python environment or installation:

conda install -c intel hugetlbfs

After that the Python binary within the conda environment will allocate the memory for objects using hugepages.

Viewing all 3384 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>