Quantcast
Channel: Intel Developer Zone Articles
Viewing all 3384 articles
Browse latest View live

Using Intel MKL BLAS and LAPACK with PETSc

$
0
0

Overview

Summary
This document describes how to build the Portable Extensible Toolkit for Scientific Computation (PETSc) with Intel® Math Kernel Library (Intel® MKL) BLAS and LAPACK.

PETSc (http://www.mcs.anl.gov/petsc/index.html) is a set of libraries that provides functions for building high-performance large-scale applications. PETSc library includes routines for vector manipulation, sparse matrix computations, distributed arrays, linear and non-linear solvers, and extensible PDE solvers.

This application note focuses on building PETSc for Intel® Architecture Processors including IA-32 and Intel® 64 architecture, running Linux*.

Version Information
This document applies to Intel® MKL 2018 update 2  and Intel® Composer XE for Linux* and PETSc 3.8.4

Step 1– Download

PETSc releases are available for download from the PETSc web site at http://www.mcs.anl.gov/petsc/download/index.html
To get Intel® MKL go to /en-us/intel-mkl/.

Step 2 – Configuration

  1. Use the following command to extract the PETSc files. A new folder petsc will be created:
    $git clone -b maint https://bitbucket.org/petsc/petsc petsc
    or
    $ tar –xvzf petsc-3.8.4.tar.gz
  2. Change to the PETSc folder:

$ cd petsc-3.8.4

  1. Set the PETSC_DIR environment variable to point to the location of the PETSc installation folder. For bash shell use:

$ export PETSC_DIR=$PWD

Step 3– Build PETSc

PETSc includes a set of python configuration files which support the use of various compilers, MPI implementations and math libraries. The examples below show options for configuring PETSc to link to Intel MKL BLAS and LAPACK functions. Developers need to ensure that other options are configured appropriately for their system. See the PETSc installation documentation for details: http://www.mcs.anl.gov/petsc/documentation/installation.html#blaslapack

Intel MKL

Intel provides blas/lapack via MKL library. It usually works from GNU/Intel compilers on linux and MS/Intel compilers on Windows. One can specify it to PETSc configure with for eg: --with-blaslapack-dir=/opt/intel/mkl

If the above option does not work - one could determine the correct library list for your compilers using Intel MKL Link Line Advisor and specify with the configure option --with-blaslapack-lib

  1. Invoke the configuration script with the following options to build PETSc with Intel MKL (installed to the default location /opt/intel/mkl).
    • For Intel processors with Intel 64 use the following option:

$ ./config/configure.py
...
--with-blas-lapack-dir=/opt/intel/mkl/lib/intel64

  • For Intel 32-bit processors use the following options:

$ ./config/configure.py
...
--with-blas-lapack-dir==/opt/intel/mkl/lib/ia32

  1. *P.S. If you ​get error message about lack of linking with some other MKL library while you execute python source file, please try to set blas library instead of directory, --with-blas-lapack-lib=\"$MKLROOT/lib/<intel64|ia32>/libmkl_rt.so\"
  2. Use the make file to build PETSc:

    $ make all

Step 4 - Run PETSc

Run the PETSc tests to verify the build worked correctly:

$ make test

Appendix A – System Configuration

PETSc build and testing was completed on a system with an Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz running Ubuntu* 16.04.4 LTS

Appendix B - References


The Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Feature on Intel® Xeon® Scalable Processors

$
0
0

Introduction

Intel® Xeon® Scalable processors support the increasing demands in performance with Intel® Advanced Vector Extensions 512 (Intel® AVX-512), which is a set of new instructions that can accelerate performance for demanding computational workloads.

The full specification of the Intel® AVX-512 instruction set consists of several separate subsets:

  • Foundation Instructions
  • Conflict Detection Instructions (CDI)
  • Byte (char or int8) and Word (short or int16) Instructions
  • Double-word (int32 or int) and Quad-word (int64 or long) Instructions
  • Vector Length Extensions (VLE)

A more detailed description of the above subsets can be found at Improve Performance Using Vectorization and Intel® Xeon® Scalable Processors and "AVX-512 May Be a Hidden Gem" in Intel Xeon Scalable Processors.

Intel AVX-512 may increase the vectorization efficiency of our codes, both for current hardware and also for future generations of parallel hardware. This is not only because the new instructions can operate on 512-bit registers, but also because the new instructions in Intel AVX-512 offer new features that can benefit vectorization, and also because they can operate on 32 of these vector registers. In Intel Xeon Scalable processors, these new features offered by Intel AVX-512 are also available to operate in registers of different size, which make these new features available to a larger number of applications. This new functionality is offered by the additional orthogonal capability vector length extensions.

What are Vector Length Extensions?

Compared to the Intel® Advanced Vector Extensions 2 (Intel® AVX2) instruction set, Intel AVX-512 doubles the number of vector registers, and each vector register can pack twice the number of floating point or double-precision numbers. Intel® AVX2 offers 256-bit support. This means more work can be achieved per CPU cycle, because the registers can hold more data to be processed simultaneously.

However, as not all applications might benefit from the extended 512-bit registers (applications containing few vectorized loops or low trip counts, for example), the VLE orthogonal feature allows applications to take advantage of most Intel AVX-512 instructions on shorter vector lengths: 128-bit (XMM registers) and 256-bit (YMM registers), on top of 512-bit (ZMM registers). These Intel AVX-512 instructions, while running on different vector lengths, still can take advantage of the larger number of registers per core (32) and opmask registers (8).

Using the new functionality in VLE, applications have more options for optimization, because most Intel AVX-512 instructions can be used while using different vector lengths as constrained by the algorithm or data.

Example

To demonstrate the VLE orthogonal feature, we use an example code that computes the histogram of an image. The description of the algorithm and code is available in the Improve Vectorization Performance with Intel® AVX-512 tutorial, from where the sample code can be downloaded. This code example is used only for demonstration purposes.

This code shows an example of loops that are not vectorized by the Intel® C++ Compiler using the Intel AVX2 instruction set architecture (ISA), due to data dependencies caused by indirect referencing in the array computing the histogram. However, when using the Intel AVX-512 ISA, the compiler is able to vectorize these loops using instructions from the CDI subset.

The CDI subset includes functionality to detect data conflicts in vector registers, and stores this information in mask vectors, which are used in the vector computations. As explained in the tutorial, the result is that only the elements of the array without conflicts (identical grayscale values) are processed simultaneously.

In the next three experiments, this sample code will be compiled using the Intel C++ Compiler, using three different sets of options. The last experiment shows that the CDI functionality is still used, even when we direct the compiler to use YMM (256-bit) registers, instead of ZMM (512-bit) registers.

Experiment 1. Let us first compile the example code using the Intel AVX2 flag:

icpc Histogram_Example.cpp -O3 -restrict  -qopt-report -qopt-report-file=runAVX2.optrpt -xCORE-AVX2 -lopencv_highgui -lopencv_core -lopencv_imgproc -o runAVX2

Notice that we have compiled the code using the Intel C++ Compiler option -qopt-report to generate an optimization report. Here is a section of the optimization report showing that the loop computing the filter and histogram (loop in line107) has not been vectorized:

LOOP BEGIN at Histogram_Example.cpp(107,5)
remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
LOOP END

Experiment 2. However, compiling the code with Intel AVX-512, adding the Intel C++ Compiler option -qopt-zmm-usage=high to use ZMM (512-bit) registers, the optimization report shows that the loop, as expected, has been vectorized:

icpc Histogram_Example.cpp -O3 -restrict  -qopt-report -qopt-report-file=runAVX512.optrpt -xCORE-AVX512 -qopt-zmm-usage=high -lopencv_highgui -lopencv_core -lopencv_imgproc -o runAVX512

LOOP BEGIN at Histogram_Example.cpp(107,5)
remark #15300: LOOP WAS VECTORIZED
LOOP END

The option -qopt-zmm-usage=high/low used above is a new option that has been added to Intel® compilers (starting with version 18.0) to enable more flexible use of single instruction, multiple data (SIMD) vectorization on the Intel Xeon processor Scalable family. This new option should be used on top of the -xCORE-AVX512 option, as shown above, and may be used as well on top of the -qopt-report option, which asks the compiler to generate an optimization report that helps developers to understand compiler-generated optimizations, as well as to look for more optimization opportunities. More information about these new features in Intel compilers can be found in Tuning SIMD vectorization when targeting Intel Xeon Processor Scalable Family.

By changing the option -qopt-report to -qopt-report=5, the compiler will generate a more detailed vectorization report. In particular, we can see in the report that the compiler has generated code to use ZMM registers to store 16 floats.

LOOP BEGIN at Histogram_Example.cpp(107,5)
	(…)
	remark #15416: vectorization support: irregularly indexed store was generated for the variable <hist2[*(image2+position*4)]>, masked, part of index is read     from memory   [ Histogram_Example.cpp(122,8) ]
	remark #15415: vectorization support: irregularly indexed load was generated for the variable <hist2[*(image2+position*4)]>, masked, part of index is read     from memory   [ Histogram_Example.cpp(122,8) ]
	remark #15305: vectorization support: vector length 16
	remark #15300: LOOP WAS VECTORIZED
	(…)
LOOP END

Furthermore, taking a look at the assembly code (generated by the Intel C++ Compiler by using the –S option), we notice that the compiler has vectorized the computation of the histogram in the loop (line 122 in the source code):

hist2[ int(image2[position]) ]++;

by using the conflict detection instructions from the CDI subset on ZMM registers:

vpbroadcastmw2d %k2, %zmm6     			           #122.8
vpconflictd %zmm4, %zmm2{%k2}{z}              		   #122.8
vpandd    %zmm6, %zmm2, %zmm5                  		   #122.8

The above example shows the expected result of compiling this code with the combination of compiler options that make the compiler use ZMM registers: -xCORE-AVX512 -qopt-zmm-usage=high.

Experiment 3. However, using the combination of options: -xCORE-AVX512 -qopt-zmm-usage=low (NOTE: -qopt-zmm-usage=low is the default for -xCORE-AVX512) tells the compiler that the program is unlikely to benefit from using ZMM registers, and most likely will use shorter registers:

icpc Histogram_Example.cpp -O3 -restrict  -qopt-report -qopt-report-file=runAVX512.optrpt -xCORE-AVX512 -qopt-zmm-usage=low  -lopencv_highgui -lopencv_core -lopencv_imgproc -o runAVX512
LOOP BEGIN at Histogram_Example.cpp(107,5)
remark #15300: LOOP WAS VECTORIZED
remark #15321: Compiler has chosen to target XMM/YMM vector. Try using -qopt-zmm-usage=high to override
LOOP END

Or changing the option -qopt-report to -qopt-report=5, the compiler gives more details:

LOOP BEGIN at Histogram_Example.cpp(107,5)
	(…)
remark #15416: vectorization support: irregularly indexed store was generated for the variable <hist2[*(image2+position*4)]>, masked, part of index is read     from memory   [ Histogram_Example.cpp(122,8) ]
remark #15415: vectorization support: irregularly indexed load was generated for the variable <hist2[*(image2+position*4)]>, masked, part of index is read     from memory   [ Histogram_Example.cpp(122,8) ]
remark #15305: vectorization support: vector length 8
remark #15300: LOOP WAS VECTORIZED
(…)
LOOP END

which shows that this time the compiler is generating code that uses YMM (256-bit) registers. However, by looking at the assembly code, we notice that this loop is still being vectorized using the conflict detection instructions, but now operating on YMM registers:

vpbroadcastmw2d %k2, %ymm5  			                #122.8
vpconflictd %ymm3, %ymm2{%k2}{z}                  		#122.8
vpand     %ymm5, %ymm2, %ymm4                     		#122.8

which is the result of the VLE orthogonality being used. The VLE orthogonal feature allows this application to take advantage of the CDI on shorter vectors (256-bit YMM registers in this case), which is not possible by just using Intel AVX2 instructions, as shown in Experiment 1, above. Again, even while using YMM registers, this application still can take advantage of the larger number of registers per core (32) and opmask registers (8) present on Intel Xeon Scalable processors.

Conclusion

Intel AVX-512 is available in Intel Xeon Scalable processors. This new instruction set can accelerate performance for several workloads and usages because it offers enhanced vector processing capabilities, such as a larger number of registers per core, as well as vector operations that can operate on wider 512-bit registers.

To make the new features in Intel AVX-512 available to a larger number of applications, the new VLE orthogonal feature lets applications use most of the Intel AVX-512 instructions on shorter vector lengths, while still taking advantage of the larger number of registers. This new feature benefits applications that naturally perform SIMD operations on 128-bit or 256-bit registers.

The VLE orthogonal feature was demonstrated here using a code sample showing a clear benefit when using Intel AVX-512 instructions. Specifically, it benefits from automatic vectorization using CDI. VLE potentially broadens the applicability of these kinds of applications by letting them operate on shorter registers (XMM or YMM), while still taking advantage of the conflict detection instructions for vectorization, as well as the larger number of registers.

Speed System & IoT Application Development with New Intel® System Studio 2018 Update 1

$
0
0

Intel System Studio Usages

Simplify System Bring-up, Boost Performance, Strengthen Reliability

Just released! Get Intel® System Studio 2018 Update 1 and tap into new features that make system and IoT application development easier.

  • Move from prototype to product easier with new capabilities that seamlessly import apps from Arduino Create* to Intel System Studio. Take advantage of System Studio’s analyzers and debug tools.
  • Developers can create, build, and edit native Java apps using Intel System Studio – use the easy cloud connectors and innovate with access to sensor libraries.
  • Use code samples easily through the new Project Creation Wizard, which automatically sets configuration options for you. 

FREE Download
See below for technical details about what's new in this release.

About Intel System Studio
Intel System Studio is all-in-one, cross-platform, comprehensive tool suite for system and IoT device application development. It helps system engineers and developers shorten the development cycle so products can be brought to market faster, boost performance and power efficiency, and strengthen reliability for intelligent systems and IoT device applications running on Intel® processor-based platforms.

Who needs it?
The tools suite is used by device manufacturers, system integrators, and embedded and IoT application developers on solutions that benefit from improved systems and IoT applications, including industrial and manufacturing, health care, retail, smart cities/buildings/homes, transportation, office automation, and many more.

Free Download
Developers can download the Ultimate Edition FREE, using a renewable 90-day commercial license for the latest version available, with public community forum support. Paid license offerings providing Priority Support with confidential access to Intel engineers for technical questions are also available.

 

Intel System StudioWhat's New in Intel System Studio 2018 Update 1

  • Move from prototype to product easier with new capabilities that seamlessly import applications from Arduino Create* to Intel System Studio. Take advantage of System Studio’s advanced analyzers and debug tools for advanced system development. More tools and libraries also now support the Up Squared* Grove* IoT Development Kit. Learn more.
     
  • Java* support– Developers can now create, build, run and edit native Java applications using Intel System Studio. Through the Project Creation Wizard, use the cloud connectors and access Intel IoT sensor libraries. Java examples can also be enabled for the Up Squared* Grove* IoT Development Kit. 
     
  • Easier access to code samples, automated configuration– Using code samples is now easier using the new Project Creation Wizard. All required configuration options are automatically set when sample projects are created.
     
  • Connect with various cloud service providers’ APIs more simply via the cloud connector API explorer.
     
  • Includes the latest updates for many of the performance libraries, and analysis and debugger tools.

To automatically receive product updates, users must register or set up their account with the Intel® Software Development Products Registration Center.

Technical Details

Start development or optimization easy with these Getting Started guides. You can find more information documentation and in individual component tool's release notes.

Eclipse* IDE 

  • Eclipse IDE on Linux Ubuntu* 16.04.4 LTS now depends on GTK3, Ubuntu 17 and 18 will continue to use GTK2.
  • Platform Manager now performs verbose Docker* image builds. Project Creation Wizards and Sensor Explorer have been streamlined for better user experience.
  • Support was added for development of Wind River Linux LTS* 17 applications. This is only supported on a Linux* host and does not support creation of an LTS 17 kernel image.

For help creating your first cross compiling project see this article: Cross Development

For a video showing how to create a project using the new container based workflow see this page: Getting Started with Samples

Intel® C++ Compiler 18.0 

  • More stable integration with Microsoft Visual Studio* 2017
  • Fixes previously reported issues

See also

Intel® Data Analytics Acceleration Library (Intel® DAAL)

  • Host application interface is added to Intel® DAAL, which enables algorithm-level computation cancelling by user-defined callback. This interface is available in Decision Forest and Gradient Boosting Trees algorithms. New example code is provided.
  • New technical preview for experimental Intel DAAL and its extension library
    • Introduced distributed k-Nearest Neighbors classifiers for both training and prediction. Included new sample that demonstrates how to use this algorithm with Intel® MPI Library.
    • Developed experimental extension library on top of existing pyDAAL package that provides an easy to use API for Intel DAAL neural networks. The extension library allows users to configure and train neural network models in a few lines of code, and the ability to use existing TensorFlow* and Caffe* models on inference stage.
  • Gradient Boosting Trees training algorithm was extended with inexact splits calculation mode. It is applied to continuous features that are bucketed into discrete bins, and the possible splits are restricted by the buckets borders.
  • Intel® Threading Building Blocks (Intel® TBB) dependency is removed in library sequential mode.

For more information on Intel® DAAL see: Introduction to Intel® DAAL

Intel® Math Kernel Library (Intel® MKL)

Intel® Integrated Performance Primitives (Intel® IPP)

IoT connection tools: MRAA & UPM Libraries

MRAA IO Communication Layer

  • New APIs for sysfs onboard LED control using the gpio-leds driver
  • Restructured and cleaned-up basic examples

UPM Sensor and Actuator Library

  • Extended LED library to support the new MRAA gpio-leds APIs
  • Cleaned-up doxygen tags in headers and class names in JSON library files to facilitate integration with the Sensor Explorer

See also : Developing with Intel System Studio - Sensor libraries 

Intel® VTune™ Amplifier 

  • New CPU/FPGA interaction analysis (Technical Preview) to assess the balance between the CPU and FPGA on systems with a discrete Intel® Arria® 10 FPGA running OpenCL™ applications
  • New Graphics Rendering analysis (Technical Preview) for CPU/GPU utilization of your code running on the Xen* virtualization platform installed on a remote embedded target
  • Support for the sampling command-line analysis on remote QNX* embedded systems via ethernet connection

See also:

Energy Analysis/Intel® SoC Watch

       Intel® SoC Watch for Windows

  • Add support for Intel platform code named Gemini Lake
  • Resolves several issues

       Intel SoC Watch for Linux/Android*

  • Add support for Intel® platform codenamed Gemini Lake.
  • New feature group “sstate” added: Measures both operating system (Sx) and hardware (S0ix) platform sleep states on platforms that measure both.

See also: Energy analysis in Intel® System Studio 2018 Update 1

Intel® Inspector

  • Deadlocks detection on std::shared_mutex (C++17 standard)
  • New OS support - Fedora Core* 27, Ubuntu* 17.10, Microsoft Windows* 10 RS3
  • Bug fixes

Intel® System Debugger

See also: Using the Target Connection Agent with Intel® System Debugger

Intel® Debug Extensions for WinDbg*

  • Support for event-based breakpoints to debug ACPI Machine Language (AML)
  • Added feature to collect BSOD information with the get_bsod_info script

GNU* GDB and source

  • GDB Server is supported on Wind River Linux LTS 17.
  • The GDB Server binaries from the WindriverLinux9 directory can also be used for Wind River Linux LTS 17.

For questions or technical support, visit Intel® Software Products Support

1Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.  Notice revision #20110804.

 

 

 
 
 
 
 
 

Accelerating Media, Video & Computer Vision Processing: Which Tool Do I Use?

$
0
0

Intel has a multitude of awesome software development tools, including ones for innovating and optimizing media applications, immersive video including 360° and virtual reality, graphics, integrating visual understanding, and more. But sometimes, it's hard to figure out just which development tool or tools are best for your particular needs and usages.

Below you'll find a few insights to help you get to the right Intel software tool faster for media and video solutions, so you can focus on the really fun stuff - like building new competitive products and solutions, improving media application performance or video streaming quality for devices from edge to cloud, or even transitioning to more efficient formats like HEVC. 


Intel® Media SDK

Developing for:

  • Intel® Core™ or Intel® Core™ M processors 
  • Select SKUs of Intel® Celeron™, Intel® Pentium® and Intel® Atom® processors with Intel® HD Graphics supporting Intel® Quick Sync Video
  • Client, mobile and embedded devices - desktop or mobile media applications
  • OS - Windows* and Embedded Linux*
  • An Open Source version is also available at Github under the MIT license 

Uses & Needs

  • Fast video playback, encode, processing, media formats conversion or video conferencing
  • Accelerated processing of RAW video or images
  • Screen capture
  • Audio decode & encode support
  • Used with smart cameras across drones, phones, editors/players, network video recorders, and connected cars
  • Supports HEVC, AVC, MPEG-2 and audio codecs

Free Download


intel media server studioIntel® Media Server Studio

Three editions are available:

FREE Community

Essentials

Professional

Developing for:

Format Support HEVC, AVC, MPEG-2 and MPEG-Audio

Uses & Needs

  • High-density and fast video decode, encode, transcode
  • Optimize performance of Media/GPU pipeline 
  • Enhanced graphics programmability or visual analytics (for use with OpenCL™ applications)
  • Low-level control over encode quality
  • Debug, analysis and performance/quality optimization tools
  • Speed ransition to real-time 4K HEVC
  • Need to measure visual quality (Video Quality Caliper)
  • Looking for an enterprise-grade telecine interlace reverser (Premium Telecine Interlace Reverser)
  • Audio codecs
  • Screen capture

 Free Download & Paid Edition Options 


Intel® Collaboration Suite for WebRTC

This Client SDK builds on top of the W3C standard WebRTC APIs to accelerate development of real-time communications (RTC), including broadcast, peer-to-peer, conference mode communications, and online gaming/VR streaming. 

Use with Andrioid*, web (JavaScript* built), iOS* and Windows* applications. 

Free Download

Intel® SDK for OpenCL™ Applications

Developing for:

General purpose GPU acceleration on select Intel® processors (see technical specifications). OpenCL primarily targets execution units. An increasing number of extensions are being added to Intel processors to make the benefits of Intel’s fixed function hardware blocks  accessible to OpenCL applications.

Free Download


Intel® Computer Vision SDK

Accelerate computer vision solutions:

  • Easily harness the performance of computer vision accelerators from Intel
  • Add your own custom kernels into your workload pipeline
  • Quickly deploy computer vision algorithms with deep-learning support using the included Deep Learning Deployment Toolkit Beta
  • Create OpenVX* workload graphs with the intuitive and easy-to-use Vision Algorithm Designer

Free Download


Altera® Software (now part of Intel) 
Video & Image Processing Suite MegaCore Functions (part of Intel® Quartus® Prime Software Suite IP Catalog)

Developing for:

  • All Altera FPGA families
  • Video and image processing applications, such as video surveillance, broadcast, video conferencing, medical and military imaging and automotive displays

Uses & Needs

  • For design, simulation, verification of hardware bit streams for FPGA devices
  • optimized building blocks for deinterlacing, color space conversion, alpha blending, scaling and more

Intel® C for Media and related tools

We have recently open-sourced CM runtime together with Intel media driver and CM compiler. We also put the CM package in 01.org.

The source code can be accessed by clicking the link below respectively,

  • Intel Media Driver for VAAPI and Intel® C for Media Runtime: available at GitHub 
  • Intel C for Media Compiler and examples: available at GitHub 
  • Intel Graphics Compiler: available at GitHub 

 

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

Wind River Helix Device Cloud* Application Deployment: POC Retail Vending Machine

$
0
0

Intro

Securely and easily managing an IOT software solution on multiple gateways across the world can be a challenge. However, for gateways running Wind River Helix Device Cloud* there is a clear path to follow that diminishes the challenge. The Wind River Helix Device Cloud allows for complete device lifecycle management, from deploying, to monitoring, to updating, to decommissioning. Wind River Helix Device Cloud has telemetry capabilities as well, allowing it to receive and store data in the cloud, as well as act on data using triggers and alarms. This article will explore a proof of concept that leverages the capabilities of Helix Device Cloud (HDC) and uses the UP Squared* board as the vending machine gateway. HDC will allow the stock of the vending machine to be monitored and set up of automated triggers to restock it automatically as well as view the stock data in a dashboard. 

To learn more about the Helix Device Cloud:
https://www.helixdevicecloud.com

To learn more about the UP Squared Board: 
https://software.intel.com/en-us/iot/hardware/up-squared-grove-dev-kit

Figure 1: High Level Component Diagram with UP Squared, Grove* Sensors, and Helix Device Cloud

Helix Device Cloud*

This section will be a brief overview of the Helix Device Cloud and the pieces needed for this POC. For a more in-depth guide see https://knowledge.windriver.com/en-us/000_Products/040/080/000/000_Wind_River_Helix_Device_Cloud_Getting_Started
There are two main parts to configure in the Helix Device Cloud, the Thing Definition and Application. Both of these are found under the developer tab (see Figure 2). 
A Thing Definition defines the characteristics of the IOT solution: attributes, alarms, properties, methods, and more. 
  • Attributes are for string data that is typically static, like OS version or MAC id.
  • Properties are for integer values that will change over time such as temperature. 
  • Alarms are for events and alerts, like signaling the temperature is too high. 
  • Methods define a behavior to invoke on the device or client application, like rebooting the device or sending more stock to the vending machine. Methods are implemented in code on the device level but can be triggered from the cloud. 
Application defines how the thing is going to be used, with different levels of access and privilege. With this model, there can be multiple Applications for one Thing Definition, for example one with ‘Org Admin’ privileges and one without. Each client application running on the device will also receive its own unique device id. This combined with the application name will create the key for the application in the cloud. The recommended model is to have a separate Device Manager application from the client application to keep certain permissions like ‘reboot’ and ‘decommission’ device separate. 
Figure 3, located below, outlines these key parts.
Figure 2: Where to find Application and Thing Definitions
Figure 3: Diagram of Components on Device and in the Cloud

Set-up

This article assumes that chocolate bar vending machines have been deployed in various locations, and that they’re controlled by a gateway with the HDC agent installed and a Device Manager Application running. The Device Manager has the ability to send and receive files, reboot the device, remotely log in, update device software, and more. The POC uses the UP Squared board running Ubuntu* 16.04 Server OS (comes installed by default on the board). The vending machine functionality is provided by the Wind River Helix Device Cloud Python agent, and Grove* sensors from Seeed* Studio. The UP Squared board has a button sensor to indicate a purchase of the product.  It also uses LEDS as indicators. A green LED turns on when purchase is successful and a red LED turns on when the product is out of stock. A temperature sensor will monitor the vending machine’s temperature to see if the chocolate bars are in danger of melting. In addition, it has a motion sensor to count traffic passing by the vending machine which turns on a blue LED when motion is detected. The software for the vending machine is written in Python* and uses the HDC python device cloud API. 
For instructions on how to install and configure the HDC Agent and the sample Device Manager Application on Ubuntu, refer to this guide in the Wind River Knowledge Library: 
Additionally, see the link below for how to import the sample Device Manager Application and Thing Definition into HDC.
To interface with the Grove sensors, MRAA and UPM need to be installed on the UP Squared board. If they’re not present, use these commands:
sudo add-apt-repository ppa:mraa/mraa
sudo apt-get update
sudo apt-get install libmraa1 libmraa-dev mraa-tools python-mraa python3-mraa
sudo apt-get install libupm-dev libupm-java python-upm python3-upm node-upm upm-examples

Code 1: Commands to Install MRAA and UPM on Ubuntu

In the Helix Device Cloud, the Thing Definition needs to be added for the POC Vending Thing. Figure 4 shows the full overview of the thing, including alarms and the states to configure. There is a chocobar_out_of_stock alarm for alerting that all the chocolate bars have been sold. Figure 5 shows the telemetry properties, the data being collected is telemetry_motion, telemetry_temp, and telemetry_stock_chocobars. Figure 6 shows the method, method_restock, which needs to be configured so that the chocolate bars can be restocked. The fields labeled ‘Key’ are what will be used in the code in the client application. Note that Thing Definition can also be exported and imported as json files, so alternatively import thing_defs.json as shown in Code 2 and update it with your org id instead of manually inputting everything.

Figure 4: POC Vending Thing Definitions and Alarms View

Figure 5: POC Vending Thing Definitions Properties Figure 6: POC Vending Thing Definitions Methods

Figure 6: POC Vending Thing Definitions Methods

[{"ownerOrgId":"yourOrgIdHere","key":"poc_vending_thing","name":"POC Vending Thing","version":13,"autoDefProps":true,"autoDefAttrs":true,"properties":{"telemetry_motion":{"name":"auto:telemetry_motion","calcAggregates":false},"telemetry_stock_chocobars":{"name":"auto:telemetry_stock_chocobars","calcAggregates":false},"telemetry_temp":{"name":"auto:telemetry_temp","calcAggregates":false}},"alarms":{"chocobar_out_of_stock":{"name":"Chocolate Bars Out of Stock","states":[{"name":"Out of Stock","color":"#E61717"},{"name":"Stock Available","color":"#3BBD1B"}]},"high_temp":{"name":"High Temperature Alarm","states":[{"name":"Melted","color":"#EB0505"},{"name":"Normal","color":"#46E61E"}]}},"methods":{"method_restock":{"name":"Restock Chocolate Bars","description":"Send more stock to the vending machine","notificationVariables":{"num_chocobars_sent":{"name":"num_chocobars_sent","type":"int","uiType":"text"}}}}}]

Code 2: thing_def.json for Importing the Thing Definition

Next, configure the Application definition and link it to the POC Vending Thing Definition as per Figure 7. An Application token will automatically be generated after creation, see Figure 8. The token will be used on the device so the client application will know which Application and Thing Definition it will be using.

Figure 7: POC Vending Application

Figure 8: POC Vending Application View

Vending Machine Telemetry

The data collected from the vending machine is where the real value comes into play. The gateway application will collect motion, temperature, and inventory data, and send it to the Helix Device Cloud. The application is a python script ‘iot-poc-vending.py’ that will be turned into a service. Then in HDC, a variety of triggers and alarms can be set up to handle the values coming in. For example, if inventory runs out, a trigger can be set up to send more inventory out to the machine automatically. The Grove sensors will supply the data to upload. To interface with sensors through the Grove shield, add the line below in the code. This will tie into MRAA and GROVEPI. GROVEPI will allow the sensors to talk to the gateway, and MRAA handles the IO pin communications. Note that root access is required to access the shield by default, so when running the python script, it must be run as sudo.

Import upm
import mraa
# Interface with Grove Sensors
mraa.addSubplatform(mraa.GROVEPI, "0 ")

Code 3: Line to Have MRAA use GROVEPI

Using GROVEPI will shift all the pin numbers by 512, so pin A0 for the temperature sensor is really pin 512 + 0.  
Grove shield pins:
  • Temperature sensor: A0
  • Button sensor: D8
  • Motion sensor: D7
  • Blue motion indicator LED: D2
  • Red out of stock indicator LED: D4
  • Green purchase indicator LED: D3
temperature_sensor = grove.GroveTemp(512 + 0)
button_sensor = grove.GroveButton(512 + 8)
motion_sensor = upmMotion.BISS0001(512 + 7)
blue_motion_led = grove.GroveLed(512 + 2)
red_stock_led = grove.GroveLed(512 + 4)
green_stock_led = grove.GroveLed(512 + 3)

Code 4: Grove Sensor Initialization Code

The program’s loop will gather the sensor data, handle items being purchased, publish alarms as needed, and then send data to HDC every 10 seconds.

counter = 0
    while running and client.is_alive():
        counter += 1
			
        #purchase flow
        green_stock_led.off()
        customer_purchase = button_sensor.value()
        if (stock_chocobars > 0):
            red_stock_led.off()
            if (customer_purchase):
	client.info("Customer purchasing item")
                green_stock_led.on()
                stock_chocobars -= 1
        else:
            red_stock_led.on()
            client.alarm_publish("chocobar_out_of_stock", 0)
        current_motion= motion_sensor.value()
        if(current_motion):
            motion +=1
            blue_motion_led.on()
        else:
            blue_motion_led.off()
        celsius = temperature_sensor.value()
        fahrenheit = celsius * 9.0/5.0 + 32.0;	
        if (fahrenheit >= 90):
            client.alarm_publish("high_temp", 0)
        if counter >= TELEMINTERVAL:
            send_telemetry()
            # Reset counter after sending telemetry
            counter = 0
        sleep(1)

Code 5: The Main Loop of the Program

To send telemetry to HDC, it is one line of code using telemetry_publish. If the property is not registered in HDC already, publish will fail on the first try and then it will auto register the property (if enabled in the Thing Definition, refer back to Figure 4).

# temperature telemetry to send
	client.info("Publishing Property: %s to %s", fahrenheit, "telemetry_temp")
	ts = datetime.datetime.utcnow()
	status = client.telemetry_publish("telemetry_temp", fahrenheit, cloud_response, timestamp=ts)
	# Log response from cloud
	if cloud_response:
		if status == iot.STATUS_SUCCESS:
			client.log(iot.LOGINFO, "Telemetry Publish - SUCCESS")
		else:
			client.log(iot.LOGERROR, "Telemetry Publish - FAIL")

Code 6: Code Needed for each Telemetry Metric

Properties can be seen in the thing view in the Helix Device Cloud in graph form.

Figure 9: Thing View in HDC

Methods

Methods can be called from the cloud directly, or called automatically with Triggers, which will be discussed in the next section. 
In addition to configuring the method in the cloud, the function to call must be registered in the client application so HDC knows which function to execute when it is called.

client.action_register_callback("method_restock", method_restock)

Code 7: Register the Callback Function for the Method

Next, add the function for what to do when the method call is received from the gateway. Here it is mimicking restocking the chocolate bars, so the sent number will be added to the current stock value.

def method_restock(client, params):
    """
    Restocks Chocolate Bars
    """
    global stock_chocobars

    message = params.get("num_chocobars_sent")
    stock_chocobars += message
	
    p = {}
    p['chocobars'] = "RESTOCKED"

    msgstr = "Chocolate Bars Restocked"
    client.info(msgstr)
    client.event_publish(msgstr)
    return (iot.STATUS_SUCCESS, "", p)

Code 8: Code to Handle Method Received from HDC

Triggers

Now that the data and methods have been set up in the cloud and on the device, the triggers feature can be leveraged. This will help monitor the vending machine when data is received for conditions that require attention.

The vending machine needs to send out an email alert if the temperature gets too high, as the chocolate inside might melt. To create a new rule, go to the Developer -> Triggers and click on New Trigger

Name the trigger ‘POC Vending High Temperature Alarm’. Right click on the trigger event and choose Event type of alarm.change, add the Thing key, Alarm Key of high_temp, Alarm State of 0 for melted as configured back in the Thing Definition, and 0 as the Time in condition (see Figure 10). From the Trigger actions menu on the side, go into Networking and drag out email.send node. Alternatively, http, mqtt, or sms messages could be sent instead. Configure the email node with the message, subject, and to email (see Figure 11). Then, at the bottom of the Trigger actions, expand End and drag out a Success and Failure node. Lastly click on the triangles at the bottom of the nodes and drag it to appropriate node (see Figure 12 as reference). Now that the Trigger has been created, view it and click on Start to activate it. 

Figure 10: Config of Initial Trigger Event

Figure 11: Email Configuration for the Trigger

Figure 12: Node Flow of the High Temperature Alarm Trigger

The Trigger to auto restock the chocolate bars is very similar except with an added method.exec node from the Method actions and an alarm.publish from Alarm. Select the POC Vending Thing as the Thing Definition. This is how it knows which methods are available. Then select method_restock as the method. Add the Thing Key to execute the method on, the Ack Timeout, and the method input (which is num_chocobars_sent). Also add the alarm.publish to change the alarm state of chocobar_out_of_stock to 1 to indicate they are back in stock now. See Figure 13, 14, and 15 for reference. 

Figure 13: Config of method.exec for Auto Restocking

Figure 14: Config of alarm.publish to Change Alarm State back to Chocobars in Stock

Figure 15: Node Flow of the Method Restock Trigger

Figure 16: Triggers for Restocking and High Temperature are Started

Deployment

There are 4 files needed to run the client application on the device: iot-vending-connect.cfg, iot-poc-vending.py, device_id, and HDC_VendingMachine.service. The service file will turn the code into a service running continuously on the gateway, even after reboot. The device_id file should already be on the device in the device-cloud folder. The iot-vending-connect.cfg will link the application up to cloud and contains the HDC host name and the Application Token; see the getting started guide and Code 9 for more information. Make sure device_id and iot-vending-connect.cfg are in the same directory and update the config_dir in iot-poc-vending.py to their location.

{
  "cloud": {
    "host": "yourHostName", 
    "port": 8883, 
    "token": "yourAppToken"
  }, 
  "qos_level": 1, 
  "validate_cloud_cert": true
}

Code 9: iot-vending-connect.cfg file

#!/usr/bin/python
from __future__ import print_function
import argparse
import errno
import random
import signal
import sys
import os

import math
import mraa
import time, sys, signal, atexit
from upm import pyupm_grove as grove
from upm import pyupm_biss0001 as upmMotion
# Interface with Grove sheild
mraa.addSubplatform(mraa.GROVEPI, "0")

import datetime
import time
from time import sleep
head, tail = os.path.split(os.path.dirname(os.path.realpath(__file__)))
sys.path.insert(0, head)
import device_cloud as iot
#from device_cloud import osal

B=3975

temperature_sensor = grove.GroveTemp(512 + 0)
button_sensor = grove.GroveButton(512 + 8)
motion_sensor = upmMotion.BISS0001(512 + 7)
blue_motion_led = grove.GroveLed(512 + 2)
red_stock_led = grove.GroveLed(512 + 4)
green_stock_led = grove.GroveLed(512 + 3)

running = True

# Return status once the cloud responds
cloud_response = False

# Second intervals between telemetry
TELEMINTERVAL = 10

def sighandler(signum, frame):
    """
    Signal handler for exiting app
    """
    global running
    if signum == signal.SIGINT:
        print("Received SIGINT, stopping application...")
        running = False

def method_restock(client, params):
    """
    Restocks Chocolate Bars
    """
    global stock_chocobars

    message = params.get("num_chocobars_sent")
    stock_chocobars += message
	
    p = {}
    p['chocobars'] = "RESTOCKED"

    msgstr = "Chocolate Bars Restocked"
    client.info(msgstr)
    client.event_publish(msgstr)
    return (iot.STATUS_SUCCESS, "", p)

def send_telemetry():
	global motion, stock_chocobars, temperature
	# temperature telemetry to send
	client.info("Publishing Property: %s to %s", fahrenheit, "telemetry_temp")
	ts = datetime.datetime.utcnow()
	status = client.telemetry_publish("telemetry_temp", fahrenheit, cloud_response, timestamp=ts)
	# Log response from cloud
	if cloud_response:
		if status == iot.STATUS_SUCCESS:
			client.log(iot.LOGINFO, "Telemetry Publish - SUCCESS")
		else:
			client.log(iot.LOGERROR, "Telemetry Publish - FAIL")
	# motion telemetry to send
	client.info("Publishing Property: %s to %s", motion, "telemetry_motion")
	ts = datetime.datetime.utcnow()
	status = client.telemetry_publish("telemetry_motion", motion, cloud_response, timestamp=ts)
	motion = 0
	# Log response from cloud
	if cloud_response:
		if status == iot.STATUS_SUCCESS:
			client.log(iot.LOGINFO, "Telemetry Publish - SUCCESS")
		else:
			client.log(iot.LOGERROR, "Telemetry Publish - FAIL")	
       # chocolate bar stock telemetry to send	
	client.info("Publishing Property: %s to %s", stock_chocobars, "telemetry_stock_chocobars")
	ts = datetime.datetime.utcnow()
	status = client.telemetry_publish("telemetry_stock_chocobars", stock_chocobars, cloud_response, timestamp=ts)
	# Log response from cloud
	if cloud_response:
		if status == iot.STATUS_SUCCESS:
			client.log(iot.LOGINFO, "Telemetry Publish - SUCCESS")
		else:
			client.log(iot.LOGERROR, "Telemetry Publish - FAIL")

if __name__ == "__main__":
    global motion, stock_chocobars, temperature
    stock_chocobars = 2
    temperature = 0
    motion = 0

    signal.signal(signal.SIGINT, sighandler)

    # Initialize client 
    app_id = "iot-poc-vending"
    client = iot.Client(app_id)

    # Use the .cfg file inside the directory
    config_file = "iot-vending-connect.cfg"
    client.config.config_file = config_file

    # Look in this directory
    config_dir = "/home/upsquared/device_cloud/demo/"
    client.config.config_dir = config_dir

    # Finish configuration and initialize client
    client.initialize()

    # Set action callbacks
    client.action_register_callback("method_restock", method_restock)

    # Connect to Cloud
    if client.connect(timeout=10) != iot.STATUS_SUCCESS:
        client.error("Failed")
        sys.exit(1)

    counter = 0
    while running and client.is_alive():
        counter += 1
			
        #purchase flow
        green_stock_led.off()
        customer_purchase = button_sensor.value()
        if (stock_chocobars > 0):
            red_stock_led.off()
            if (customer_purchase):
				client.info("Customer purchasing item")
                green_stock_led.on()
                stock_chocobars -= 1
        else:
            red_stock_led.on()
            client.alarm_publish("chocobar_out_of_stock", 0)
        current_motion= motion_sensor.value()
        if(current_motion):
            motion +=1
            blue_motion_led.on()
        else:
            blue_motion_led.off()
        celsius = temperature_sensor.value()
        fahrenheit = celsius * 9.0/5.0 + 32.0;	
        if (fahrenheit >= 90):
            client.alarm_publish("high_temp", 0)
        if counter >= TELEMINTERVAL:
            send_telemetry()
            # Reset counter after sending telemetry
            counter = 0
        sleep(1)
		
    client.disconnect(wait_for_replies=True)

Code 10: Full Code for iot-poc-vending.py

The HDC_VendingMachine.service file is shown below. The service should start after the network.target starts, and then start the python code. Place this file in /lib/systemd/system/ with sudo cp. 

[Unit]
Description=HDC POC
After=network.target
 
[Service]
ExecStart=/home/upsquared/device_cloud/demo/iot-poc-vending.py
Restart=always
User=root
StandardOutput=journal
StandardError=journal
KillMode=process
KillSignal=SIGINT
 
[Install]
WantedBy=multi-user.target

Code 11: HDC_Vendingmachine.service File

Make the python file and the service file executable, and then enable and start the service. Look in /var/log/syslog for any errors on startup, and also for the info logs from the application.

chmod +x /home/upsquared/device_cloud/demo/iot-poc-vending.py
chmod +x /lib/systemmd/system/HDC_VendingMachine.service

Code 12: Change iot-poc-vending.py to Executable

sudo systemctl enable HDC_VendingMachine.service
sudo systemctl start HDC_VendingMachine.service 

Code 13: Enable and Start the Service

Dashboard

Now that the client app is configured in the cloud, and running on the device, a dashboard can be setup to see the data, alarms, and other important information at a glance. There is some customization available with layout and colors, however it is designed more for testing not as an industrial grade solution. 

When creating the dashboard, only Name, Thing definition, and Date/Time are required. 

Figure 17: POC Vending Telemetry Dashboard Setup

Next, the layout can be designed as wanted with different dashboard widget types. The property graph widget is used for all three of the telemetry properties: temperature, stock, and motion. The current alarm state widget is used for the Temperature alarm and the Alarm History for the out of stock alarm. 

Figure 18: Dashboard Layout Design

Each widget needs to be configured with the Thing Key and property or alarm to display. 

Figure 19: Property Graph Widget for Temperature

Figure 20: Current Alarm State widget for Temperature Alarm

The final dashboard with the data and alarms looks like Figure 21 below. 

Figure 21: Dashboard View

Summary

Our vending machine code has now been successfully deployed on Helix Device Cloud. Temperature and stock data is being monitored with automated triggers. Motion data can be referenced as time goes on to monitor foot traffic around the vending machine. Any future updates to the program and overall gateway health can be deployed and monitored using HDC. 
To purchase HDC visit https://www.windriver.com/company/contact/index.html or email sales@windriver.com

Purchase

About the Author

Whitney Foster is a software engineer at Intel in the Software and Services Group working on scale enabling projects for Internet of Things. 

Notices

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

Intel, the Intel logo, and Intel RealSense are trademarks of Intel Corporation in the U.S. and/or other countries. 

*Other names and brands may be claimed as the property of others
© 2018 Intel Corporation.

 

 

Beyond Arduino Create*: Developing UP Squared* Projects in Intel® System Studio

$
0
0

Introduction

Arduino Create* is an online integrated development environment (IDE) that simplifies the process of getting familiar with the UP Squared* Grove* IoT Development Kit. As a cloud-based software development environment, Arduino Create is always up-to-date and provides a streamlined method to configure the UP Squared* (UP2*) board so you can quickly start experimenting with the various sensors provided in the kit.

Many developers may at some point want, or need, to move from Arduino Create to a full-featured production IDE. This article provides guidelines for getting your UP2 board configured for developing applications with Intel® System Studio.  

Get familiar with Arduino Create*

The steps outlined in the following sections of this article assume you have already configured your UP2 board and successfully worked through the Arduino Create Blink Application provided in the Grove IoT Development Kit. If not, check out the UP Squared* Grove IoT Development Kit Getting Started Guide before proceeding.

Install Intel® System Studio on Host Computer

Intel® System Studio is the IDE of choice when developing IoT and gateway applications targeting Intel® processors. As you will see on the Intel® System Studio landing page, it provides access to over 400 sensors, enhanced debugger workflows that automate tracing, additional libraries and code samples, and improved data compression and optimizations for small matrix multiplication.

Intel® System Studio is available with a free 90-day renewable commercial license. After registering here, you are given the option of selecting the development and target operating systems. When you register Intel® System Studio you will receive an email with the serial number to use during installation.

In this article our host development computer is equipped with an Intel® Core™ i7 processor, 8 GB RAM, and Windows® 10 Pro. The target operating system is of course Linux*, as the UP2 board comes preconfigured with Ubuntu* 16.04 Server installed, along with some other support libraries.

Once Intel® System Studio is downloaded to your development computer, refer to Developing C/C++ Projects with Intel® System Studio for additional information on installing Docker*, enabling code samples, etc.

Intel® System Studio Dependencies

Docker*

If you installed Intel® System Studio but Docker is not already installed on your system, you will soon discover that it is required for building and uploading projects targeting the UP2 board. Refer to Installing Docker* on a Windows* Host for complete information on installing Docker. In this article we installed Docker for Windows (stable channel) per the linked guide, enabling virtualization in BIOS, etc.

Microsoft .NET Framework*

You may also need to install Microsoft .NET Framework* 3.5 if it is not already installed on your system. You can download the installer here.

PuTTY*

Although PuTTY* is not strictly a requirement, you will likely find it useful for remotely accessing the UP2 board. You can download PuTTY here.

Configure the UP2 Board* Target Environment

The UP2 board includes HDMI and USB connectors, making it convenient to plug in a monitor, keyboard and mouse for local configuration changes. The instructions below assume you are working locally in this manner, rather than over remote SSH connection. (Note: both the default username and password for the UP2 board is upsquared.)

Ubuntu* 16.04 Server Image

As previously mentioned, the UP2 board comes with Ubuntu 16.04 Server preinstalled, so you should be good to go right out of the box. However, in the unlikely event the OS becomes corrupted, or if you simply want to start fresh with a clean installation, you can download and flash the board by following the directions provided here: UP Squared IoT Grove Development Kit Ubuntu 16.04 Server Image.

Configure Root Access on the UP2 Board*

In order for Intel® System Studio to synchronize libraries between the host computer and target board, the UP2 board will need to be configured for root access. One way to achieve this is by entering the following commands to create a root password and unlock the account on the UP2 board:

sudo passwd root
sudo passwd -u root

Next, open the SSH configuration file:

nano /etc/ssh/sshd_config

Locate and change the following line:

From: PermitRootLogin without-password
To: PermitRootLogin yes

Save the file and then then restart SSH:

service ssh restart

You can verify remote root access by using PuTTY (assuming you installed it as recommended above). In PuTTY, specify SSH for the connection type using Port 22, and enter the UP2 board’s IP address. Once a connection is made to the UP2 board, enter root for the username along with the password you created in the steps above. Close the PuTTY session once you have verified connectivity.

Build and Upload a Sample Program

In this section we will test our development environment by running the On-Board LED Blink example included in the Intel® System Studio installation. Assuming you have successfully worked through the Arduino Create Blink Application provided in the Grove IoT Development Kit, you should be familiar with the board setup shown in Figure 1. If not, refer to the instructions provided here for more information on how to connect the Grove LED module to the Grove Pi+* breakout board.

Figure 1. UP2 board with Grove Pi+* and LED module

The basic steps for creating a new project in Intel® System Studio, based on the On-Board LED Blink example, are shown below. (Note: Complete instructions for enabling code examples, running Intel® System Studio, creating a new C/C++ project, and running a project are provided here.)
  1. Open Intel® System Studio and select File New Project.
  2. Under Application Development select Project for building in a container and running on LinuxNext.
  3. In the next window select Ubuntu Linux 16.04 64-Bit (GCC)Next.
  4. In the next window select C++Basic On-Board LED BlinkFinish.
  5. Install platform support if not already installed.
  6. Run the project by following the directions provided here. Note: When prompted for login information to access the UP2 board, use the root account information created in the last section.

Migrate an Arduino Create* Project to Intel® System Studio

Instructions for migrating an Arduino Create project to Intel® System Studio are contained in the article“Transfer Your Project from Arduino Create* to Intel® System Studio 2018”.

Conclusion

Arduino Create is a great framework for getting up to speed quickly with the UP Squared Grove IoT Development Kit, and it may prove to be your IDE of choice for developing IoT applications for the UP2 board. For other developers, migrating to a full-featured production IDE like Intel® System Studio may be desirable or necessary. Hopefully this short primer will assist you in making the transition.

Learn More

Notices

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications, and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

Intel, the Intel logo, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. 

*Other names and brands may be claimed as the property of others.

© 2018 Intel Corporation

Microsoft, Windows, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries.

 

 

 

Understanding NUMA using Intel® Optimization for Caffe*

$
0
0

Introduction

In this article we demonstrate how Intel® VTune™ Amplifier can be used to identify and improve performance bottlenecks while running a neural network training workload (for example, training a Canadian Institute for Advanced Research's - CIFAR-10 model) with deep learning framework Caffe*.

For this tutorial we start with the fork of BVLC/Caffe, which is dedicated to improving performance of Caffe running on a CPU, in particular Intel® Xeon® processors. This version of Caffe integrates the latest version of Intel® Math Kernel Library (Intel® MKL) and Intel® Machine Learning Scaling Library (Intel® MLSL) and is optimized for Intel® Advanced Vector Extensions 2 and Intel® Advanced Vector Extensions 512 instructions.

table of data
Figure 1. Caffe "time" command—performance difference between BVLC/Caffe and Caffe optimized for Intel® architecture through different layers running on the same test system.

Benchmark results were obtained prior to the implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown". Implementation of these updates may make these results inapplicable to your device or system.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information, see Performance Benchmark Test Disclosure.

Configuration: 2S Intel® Xeon® Platinum 8180 processor @ 2.50 GHz, 28 physical cores/socket – HT ON, 192 GB DDR4 2666 MHz, Red Hat Enterprise Linux Server release 7.3 Linux* version 3.10.0-514.16.1.el7.x86_64, Intel VTune Amplifier 2018 Update 1, Caffe version – 1.1.0, Testing model – CIFAR-10 with CIFAR-10 dataset.
Benchmark Source: Intel Corporation. See below for further notes and disclaimers.

As demonstrated in the following article, Caffe Optimized for Intel® Architecture: Applying Modern Code Techniques and also from Figure 1, we can see that there is a significant difference in performance between BVLC/Caffe and Caffe optimized for Intel architecture both running on an Intel CPU. As a result, we have set the goal of this article to investigate whether we can further improve the performance of Caffe optimized for Intel architecture without modifying the framework (or its sources), and by simply using the currently available Intel software developer tools like Intel VTune Amplifier and Intel® MLSL.

Compiling Caffe*

It is assumed that the Caffe dependencies are successfully installed as per your platform. More information about installing prerequisites can be found here. We have also installed OpenCV 3.1.0 (optional) on the test system. Instructions for installing OpenCV can be found here.

Check out the Caffe source from the intel/caffe Git* repository:

$ git clone https://github.com/intel/caffe.git intelcaffe

Modify the Makefile.config to enable use of OpenCV 3 and Intel MLSL for the Linux* operating system. More about Intel MLSL in a later part of the tutorial.

OPENCV_VERSION := 3
USE_MLSL := 1

Code block 1. Modifications to the Makefile.config.

Steps to compile Caffe:

$ make clean
$ make -j16 -k

CIFAR-10 Training: Profiling and Recommendations

Once Caffe is successfully compiled we prepare and use one of the example models, CIFAR-10 with CIFAR-10 image-classification dataset to investigate a performance bottleneck of Caffe CIFAR-10 training on a CPU.

The CIFAR-10 dataset consists of 60,000 color images, each with dimensions of 32 × 32, equally divided and labeled into the following 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The classes are mutually exclusive.

Steps to get and prepare the CIFAR-10 dataset:

$ cd $CAFFE_ROOT
$ data/cifar10/get_cifar10.sh
$ examples/cifar10/create_cifar10.sh
$ time examples/cifar10/train_full.sh
.
.
.
I0109 06:09:27.285831 428533 solver.cpp:737]     Test net output #0: accuracy = 0.8224
I0109 06:09:27.285871 428533 solver.cpp:737]     Test net output #1: loss = 0.512778 (* 1 = 0.512778 loss)
I0109 06:09:27.285876 428533 solver.cpp:617] Optimization Done.
I0109 06:09:27.285881 428533 caffe.cpp:345] Optimization Done.

real    14m41.296s
user    805m57.464s
sys     14m20.309s

Code block 2. Baseline elapsed time (training CIFAR-10 model with default settings) – Note: Accuracy of 82.24% and Elapsed time of approximately 14 mins.

With the default hyperparameters, it currently takes about 14 minutes to complete the full training using the script provided in the CIFAR-10 example directory which runs 70k iterations to achieve 82.34 percent accuracy with the default batch size of 100.

We now run the same example with Intel VTune Amplifier memory access analysis to look for potential bottlenecks. In order to limit the amount of data to be collected with memory access analysis, we run the training for 12k iterations instead of 70k.

With hardware event-based sampling, Intel VTune Amplifier's memory access analysis can be used to identify memory-related issues, like non-uniform memory architecture (NUMA) problems and bandwidth-limited accesses, and attribute performance events to memory objects (data structures), which is provided due to instrumentation of memory allocations/de-allocations and getting static/global variables from symbol information.

Additional information about Intel VTune Amplifier training and documentation can be found on the product webpage.

$ source /opt/intel/vtune_amplifier/amplxe-vars.sh
$ amplxe-cl -c memory-access -data-limit=0 -r CIFAR_100b_12kiter_MA examples/cifar10/train_full.sh

Screenshot of memory access data
Figure 2. Summary (memory access—memory usage viewpoint).

screenshot of of UPI bandwidth
Figure 3. Intel® UPI bandwidth (bottom-up view)—moderately high Intel UPI bandwidth.

Issue 1: High remote and local memory ratio

In NUMA machines, memory requests missing last level cache (LLC) may be serviced either by local or remote DRAM. Memory requests to remote DRAM incur much greater latencies than those to local DRAM. Intel VTune Amplifier defines this metric as a ratio of remote DRAM loads to local DRAM loads. Referring to the above Figures 2 and 3, it can be seen that the remote to local memory ratio is about 40 percent, which is considered to be high, and accounts for higher Intel® Ultra Path Interconnect (Intel® UPI) traffic.

Screenshot of memory access hotspots viewpoint
Figure 4. Summary (memory access—hotspots viewpoint).

Issue 2: High imbalance or serial spin time

Imbalance or serial spinning time is CPU time when working threads are spinning on a synchronization barrier consuming CPU resources. This can be caused by load imbalance, insufficient concurrency for all working threads, or waits on a barrier in the case of serialized execution.

Recommendation 1: Reducing remote to local memory access

In order to improve the ratio of remote to local memory access, it is recommended that cores must keep frequently accessed data local. One way to achieve this is to identify and modify each of the contributing source code regions to affinitize its data; however, this would require major code restructuring. An alternative approach is to use an out-of-the-box Intel MLSL to perform parallel distributed training across the two NUMA nodes so that each process allocates and accesses its data from the same NUMA domain.

In general, there are two approaches of achieving parallelism in distributed training, data parallelism and model parallelism. The approach used in Caffe optimized for Intel architecture is data parallelism, where different batches of data are trained on different nodes (that is, Message Passing interface (MPI) ranks/processes). The data is split among all the nodes, but the same model is used on all of the nodes. This means that the total batch size in a single iteration is equal to the sum of individual batch sizes of all nodes. The primary purpose of Intel MLSL is to allow multinode training and scale training across hundreds of nodes. However, it does not hurt to use the same library to perform training on two NUMA nodes of a single 2-socket server. Additional information on how the distributed training works can be found here.

Using Intel MLSL with Caffe optimized for Intel architecture is as simple as recompiling Caffe with an additional flag—Use Intel MLSL: = 1, in the makefile.config (this flag was already enabled in the steps shown above to compile Caffe in Section 2, so we can skip recompiling Caffe).

In order to compare performance of the baseline with the distributed training, we use the same total batch size of 100 per iteration of the stochastic gradient descent algorithm. As a result, we set the batch size to be 50 for each node.

For distributed training, train_full.sh script can be modified as given below. Here, environment variable OMP_NUM_THREADS and KMP_AFFINITY is set to assign one OpenMP* thread to each physical core with compact affinity (see Thread Affinity Interface). Also, environment variable I_MPI_PIN_DOMAIN="numa" is used to facilitate execution of only one MPI process per NUMA node (see Interoperability with OpenMP API).

--- ../ref_intelcaffe/examples/cifar10/train_full.sh    2018-03-12 17:09:17.706664443 -0400
+++ examples/cifar10/train_full.sh      2018-03-13 01:58:50.789193819 -0400
@@ -39,15 +39,21 @@

 TOOLS=./build/tools

+export OMP_NUM_THREADS=28
+export KMP_AFFINITY="granularity=fine,compact,1,0"
+
+mpirun -l -n 2 -ppn 2 -genv I_MPI_PIN_DOMAIN="numa" \
 $TOOLS/caffe train \
     --solver=examples/cifar10/cifar10_full_solver.prototxt $@

 # reduce learning rate by factor of 10
+mpirun -l -n 2 -ppn 2 -genv I_MPI_PIN_DOMAIN="numa" \

$TOOLS/caffe train \
     --solver=examples/cifar10/cifar10_full_solver_lr1.prototxt \
     --snapshot=examples/cifar10/cifar10_full_iter_60000.solverstate.h5 $@

 # reduce learning rate by factor of 10
+mpirun -l -n 2 -ppn 2 -genv I_MPI_PIN_DOMAIN="numa" \
 $TOOLS/caffe train \
     --solver=examples/cifar10/cifar10_full_solver_lr2.prototxt \
     --snapshot=examples/cifar10/cifar10_full_iter_65000.solverstate.h5 $@

Code block 3. Modifications to the run-script—train_full.sh.

Now we profile the distributed training for the CIFAR-10 model for 12k iterations using Intel MLSL:

$ source external/mlsl/l_mlsl_2017.2.018/intel64/bin/mlslvars.sh intel64
$ amplxe-cl -c memory-access -data-limit=0 -r CIFAR_50b_12kiter_MA examples/cifar10/train_full.sh

Screenshot of memory access data
Figure 5. Memory access summary—low remote/local memory ratio.

Screenshot of memory usage viewpoint low bandwidth
Figure 6. Memory usage viewpoint—low Intel® UPI bandwidth.

From the above figures 2 and 5, it can be seen that the overall remote to Local Memory Ratio has reduced significantly; that is, from 40 percent down to 5 percent, thereby reducing the observed total Intel UPI bandwidth. Also, the total elapsed time has now reduced from 150 seconds to 110 seconds.

Screenshot of Hotspots viewpoint moderately high
Figure 7. Hotspots viewpoint (moderately high spin time).

Recommendation 2: Reducing spin time

We see some improvements in the overall imbalance or serial spinning time but it is still significantly high. As a result, the execution performance of this workload does not scale linearly with the increased thread count (in accordance with Amdahl's law). The accounted spin time can be caused by either load imbalance, insufficient concurrency for all working threads, or waits on a barrier in the case of serialized execution.

Achieving any performance gain by improving load balance and/or parallelizing serial execution parts would require significant code refactoring and is beyond the scope of this article. However, to check if some amount of that spin time is caused due to insufficient concurrency and threads waiting for work, we can try reducing the number of worker (OpenMP) threads and see if it reduces the overall thread-spin time.

To conduct this experiment, we rerun the CIFAR-10 training for 12k iteration with VTune Amplifier, but now with OMP_NUM_THREADS environment variable set to 16. This forces a maximum limit of 16 threads per MPI process running on each NUMA node.

$ amplxe-cl -c memory-access -data-limit=0 -r CIFAR_50b_12k_16t_MA examples/cifar10/train_full.sh
data comparison
comparison data

Figure 8. Spin time comparison between 28 OpenMP threads and 16 OpenMP threads, per MPI process

The above comparison shows an improvement in the elapsed time, from 110 seconds down to 91 seconds, but at the same time it also signifies that the running workload has concurrency issues, which is affecting the simultaneous execution of a high number of parallel threads. Fixing such issues would require significant code restructuring in Caffe to implement further optimizations such as collapsing inner and outer loops in parallel regions of code, changing scheduling and/or granularity of work distribution, and identification of other serial regions of code and parallelizing them, and is beyond the scope of this article.

Optimized Performance

Finally, we would like to verify whether the above changes improve our total elapsed time to train the full model for 70k iterations with the same default hyperparameters. In order to do this, we modify the train_full.sh script as given below:

--- ../ref_intelcaffe/examples/cifar10/train_full.sh    2018-03-12 17:09:17.706664443 -0400
+++ examples/cifar10/train_full.sh      2018-03-13 01:58:50.789193819 -0400
@@ -39,15 +39,21 @@

 TOOLS=./build/tools

+export OMP_NUM_THREADS=16
+export KMP_AFFINITY="granularity=fine,compact,1,0"
+
+mpirun -l -n 2 -ppn 2 -genv I_MPI_PIN_DOMAIN="numa" \
 $TOOLS/caffe train \
     --solver=examples/cifar10/cifar10_full_solver.prototxt $@

 # reduce learning rate by factor of 10
+mpirun -l -n 2 -ppn 2 -genv I_MPI_PIN_DOMAIN="numa" \
 $TOOLS/caffe train \
     --solver=examples/cifar10/cifar10_full_solver_lr1.prototxt \
     --snapshot=examples/cifar10/cifar10_full_iter_60000.solverstate.h5 $@

 # reduce learning rate by factor of 10
+mpirun -l -n 2 -ppn 2 -genv I_MPI_PIN_DOMAIN="numa" \
 $TOOLS/caffe train \
     --solver=examples/cifar10/cifar10_full_solver_lr2.prototxt \
     --snapshot=examples/cifar10/cifar10_full_iter_65000.solverstate.h5 $@

Code block 4. Modifications to the run-script—train_full.sh.

$ time examples/cifar10/train_full.sh
.
.
.
[0] I0312 20:25:35.617869 323789 solver.cpp:563]    Test net output #0: accuracy = 0.8215
[0] I0312 20:25:35.617897 323789 solver.cpp:563]     Test net output #1: loss = 0.514667 (* 1 = 0.514667 loss)
[0] I0312 20:25:35.617908 323789 solver.cpp:443] Optimization Done.
[0] I0312 20:25:35.617914 323789 caffe.cpp:345] Optimization Done.
[1] I0312 20:25:35.617866 323790 solver.cpp:443] Optimization Done.
[1] I0312 20:25:35.617892 323790 caffe.cpp:345] Optimization Done.

real    8m5.952s
user    256m16.212s
sys     2m3.563s

Code block 5. Optimized elapsed time (distributed training for CIFAR-10 model with Intel® MLSL) – Accuracy of 82.15% and Elapsed time of approximately 8 mins.

From code block 5, we see that we can achieve accuracy of 82.15 percent (approximately similar to what we achieved at the beginning of the article) in up to 45 percent less time with distributed training and tuned number of OpenMP threads.

System Configuration

Performance testing for the results provided in this paper were achieved on the following test system. For more information go to Product Performance.

Component

Specification

System

2-socket server

Host Processor

Intel® Xeon® Platinum 8180 processor @ 2.50 GHz

Physical Cores

28 cores/socket

Host Memory

96 GB/socket

Profiler

Intel VTune Amplifier 2018 Update 1

Host Operating System

Linux* version 3.10.0-514.16.1.el7.x86_64

Caffe version

1.1.0 (GitHub)

Conclusion

In this article we demonstrated how Intel VTune Amplifier can be used to identify NUMA and thread-spinning issues on a neural network training workload running with Caffe. With the help of Intel MLSL and tuning for number of threads, we were able to achieve out-of-the-box performance improvements without actually modifying the Caffe framework or any of the model hyperparameters.

Using distributed training with Intel MLSL, we were able to reduce the ratio of remote to local memory access, and thus improve the elapsed time by approximately 27 percent. Also, by reducing the number of OpenMP threads for the current workload, we were able to achieve even lower elapsed times by reducing inefficient spin time. However, it is important to note that this might not be true for every other training model trained with Caffe. Ideally, even with this workload, we should be able to use all the cores in parallel efficiently without a lot of imbalance or serial spinning time, but that would require some code refactoring.

References

Lightweight Virtualized Containers for Network Function Virtualization (NFV)

$
0
0

Introduction

In late 2017, the Kata Containers* project was announced. This project is in development now and is part of an emerging group of technologies based around the concept of wrapping container workloads in extremely lightweight virtual machines (VMs). These technologies combine the speed and flexibility of containers with the isolation and security of VMs, making them ideal candidates for busy multitenant deployments.

Meanwhile in the network function virtualization (NFV) space, containerization is proving to be an interesting alternative to full-function VMs for virtual network functions (VNFs).

These areas are combining at a fortuitous time. This article introduces the idea of lightweight virtualized containers for NFV usage, explaining how they fit into existing container technology. It also describes how integrating NFV-friendly technologies like the Data Plane Development Kit (DPDK) and Vector Packet Processing (VPP) is not too heavy of a lift for NFV developers and operators.

Demystifying Container Management

The Kata Containers project descends from two other successful virtualized container projects: Intel® Clear Containers and Hyper.sh* container runtime, runv. Launched under the governance of the OpenStack* Foundation, the project takes the best parts of both projects as well as contributions from those who want to pitch in to make the best hypervisor-driven container workload driver stack. The project is currently in development, with the first release anticipated in the first half of 2018. For now (as of this writing), if you'd like to experiment with lightweight virtualized container workloads, we recommend trying out an existing runtime, for example, Intel Clear Containers.

Let's take a look at the components of a container management system, because the parts, especially what's called a runtime, can be a bit overwhelming in the face of rapid development and change. The diagram in Figure 1 comes from the Intel Clear Containers project. We'll reference it as we work through the various generic components.

Components of one container system
Figure 1. Components of one container system (Docker* and Intel® Clear Containers).

The Components of Container Management

The OCI specifications

Before we get into the actual components, it is important to understand the Open Container Initiative (OCI) specifications, which all of the projects under discussion adhere to. There are two major specifications: runtime and image.

The image specification is the easiest one to understand. It defines how to package and distribute a container image that runs a specific workload. Whenever you use docker pull to fetch a container image from the Internet or even from a local container image registry, you are likely fetching an OCI-compliant image. The key takeaway is that OCI container images should run via OCI-compliant runtimes regardless of the underlying container technology.

The runtime specification takes a bit more explanation. It does not specify the API or command set that is used to launch and manage containers. That is the province of the container management system, such as Docker*, or CRI-O (more about this later). It also doesn't specify the technology used to launch container workloads, such as Linux* cgroups or Virtual Machine Managers (VMMs). What it does specify is the characteristics of a runtime as a program that launches and destroys containers, independent of either operating system or hardware. A working implementation of this system was provided to the OCI by Docker, in the form of runc.

Component: runtime

The runc runtime was capable of taking Docker-issued commands and running Linux cgroup-based containers (normal or bare-metal containers). Other container management and orchestration systems, notably Kubernetes* and the CoreOS rkt* project, developed OCI-compliant abstracted runtimes of their own. The abstraction meant that while the runtime implemented the OCI-specified independence from the operating system and hardware, they went a step further and abstracted the container, network, and volume implementations as well.

Container runtimes of multiple varieties
Figure 2. Container runtimes of multiple varieties.

Abstracted runtimes

Abstracted runtimes have led to a much more open container management ecosystem. Docker has also developed an abstracted runtime (our term, not an industry standard) called containerd. It is visible in the above diagram of an Intel Clear Containers implementation. Kubernetes' abstracted runtime is known as CRI-O. There is yet another abstracted runtime in development called cri-containerd that aims to unify Docker and Kubernetes management into a single runtime.

Specification-compliant runtimes

If the abstracted runtimes implement the OCI spec for container management and orchestration systems but abstract away the specifics of container technology, another runtime will be needed to actually launch and destroy containers. These spec-compliant runtimes (again, not an industry term) tend to be aimed at launching and managing a particular kind of container technology. This is where the overloading of the term runtime can get rather confusing.

Examples of this type of runtime are the original runc implementation from Docker, the cc-runtime from Intel Clear Containers, runv from Hyper.sh, and several more. The future Kata Containers runtime will also fall into this category and will work with several of the abstracted runtimes previously mentioned.

Component: shims

Up until this point, we have been discussing runtimes. Abstracted runtimes generally consist of a single process running as a daemon on the host that is launching containers. The specification-compliant runtime that actually launches container processes generally does its job once per container, and then exits. Shims are launched per-container as well and maintain the small number of open communication channels that are needed to keep a container in contact with the management system and available for use. They exit when the container exits.

In Figure 1, two shims are shown: containerd-shim and the Intel Clear Containers shim. In this instance, the containerd-shim is accustomed to working with runc to set up the I/O communication with the container process. Since it is not natively set up to work with cc-runtime, the Intel Clear Containers shim is required to broker this interaction.

The Intel Clear Containers shim, called cc-shim, forms a connection between the abstracted runtime and the proxy (see below), which is a necessary component of VM-based container implementations. Since containerd doesn't have a native method of interacting with the proxy, cc-shim or its equivalent in other systems brokers this communication.

In general, a shim component in a container management system performs this kind of translation between other components, on a per-container, persistent basis.

Component: agent

The agent is a unique component of VM-based container systems. It is a daemon that runs inside each container VM, and its purpose is to configure the VM on boot to load and run the container workload correctly. It also maintains communication with the proxy.

Component: proxy

Container management systems that work directly with Linux kernel cgroups (normal containers) can set up I/O channels, networking devices, volume mounts, and so on without needing to communicate with a different operating system running inside a VM. Therefore, they do not need this component. VM-based systems do need a proxy to handle inside-the-VM configuration and structures. For example, mounting a persistent volume requires both external preparation (configuring the hypervisor for the virtual volume device) and internal preparation (the volume mount). The proxy communicates with the agent to handle internal configuration items.

Component: hypervisor and virtual machine

The hypervisor/VMM and VM used in a container system are specialized. The VM needs to have a highly tuned, lightweight kernel that boots in milliseconds, instead of the more common full operating systems in ordinary VMs that can take several seconds (or longer) to boot. To achieve this, the hypervisor is tuned to strip out any and all pieces of device emulation or passthrough that are not useful for the container's operation. For example, only one type of network device type needs to be probed for since the VM kernel will only support one device type. Another example is that CD-ROM drives do not need to be probed for in a container VM.

This is how lightweight container VMs are created and why they function at very close to parity with bare-metal cgroup-based containers. Only the most relevant and needed portions of the VM system are retained. Intel Clear Containers also works with some additional capabilities like Kernel Shared Memory (KSM) to further speed operation. KSM keeps read-only binaries that are shared by all the containers on the system, such as the container kernel, in a single memory range on the host.

Component summary: composability

There are many different moving parts in a container management system. To some degree this is due to the history of how containers came to be popularized and how the various dividing lines have broken down over this history. In general, a goal of many containerization projects is composability, meaning that each of these components can swap in different binaries without reducing or breaking the capability of the overall system. In reality, things are not quite there yet.

In the next section, we'll see how one element of composability makes NFV-friendly workloads not only possible, but also relatively simple to implement in a virtualized container system.

So, What about NFV?

Virtcontainers

Here is an interesting fact: most of the systems and components that we've described in the previous section are written in Go*. There are good reasons for that, but for the NFV world, the real benefit is that container systems that are written in Go can utilize the virtcontainers Go language library to handle networking and volume connections.

Virtcontainers is now a sub-project of Kata Containers. It was brought into that project from Intel Clear Containers, for which virtcontainers was originally developed. Therefore, both Intel Clear Containers and the forthcoming Kata Containers will link against virtcontainers.

Here is the important part: virtcontainers natively supports:

  • SR-IOV (Single-Root I/O Virtualization (via vfio devices)
  • DPDK poll-mode and vhost devices
  • FD.io VPP

These technologies are critical for the NFV industry. Providing these capabilities out-of-the-box makes it that much easier for NFV to take the leap from cgroup-based containers to VM-based containers.

A Closer Look at Container Networking

Virtcontainers provides support for both the Container Network Model (CNM) and the Container Network Interface (CNI). Docker uses CNM for plug-in-based networking in its container system. The CNI does the same for CoreOS* and Kubernetes.

Let's take a high-level look at how the CNM works with a VM-based containerization system (see Figure 3).

Container Network Initiative for virtual-machine-based containers
Figure 3. Container Network Initiative (CNI) implementation for virtual-machine-based containers.

As shown in the figure, the generic runtime here is the per-container specification-compliant runtime, that is, cc-runtime or runv or the to-be-named Kata Container runtime. The CNI implementation, libcni, is a part of this runtime.

In step 1, the runtime creates the blue-bordered network namespace, which should be a reasonably familiar feature to NFV operators. This namespace contains all devices associated with the VM. In step 2, the configuration required for the container is read from the CNI configuration files, which is where information specific to the plug-in will be obtained.

The plug-ins for CNI are how networking is actually implemented for all containers on the host system. Native interface plug-ins are available such as bridge, ptp (veth pair), vlan, and so on. In the current state, Intel Clear Containers doesn't support all interface plug-ins, but Kata Containers does aim to support all of them. There is also a wide variety of meta-plug-ins and many different types of third party plug-ins. These plug-ins are how NFV-friendly technologies like those previously mentioned are implemented for CNM/CNI. For example, here are links to the SR-IOV and DPDK-vhostuser plug-in repositories.

All of this is part of the CNI static configuration on the host. Nothing changes for the parts of the networking system that we're setting up for the container, regardless of the plug-in configuration. To continue the outlined process, in Step 3 the runtime will communicate with the configured plug-in to start network service for the container. A device is created, in this case cni0, and a veth pair is set up between that device and the container's network namespace.

From here, the rest is plumbing for the VM. In step 4, a bridge inside the namespace is created, a tap device is plumbed to the bridge for the VM to use with standard virtio drivers, and the previous veth pair endpoint is plumbed to the bridge as well. With that path for traffic established, in Step 5 the VM and container workload are started inside the network namespace.

Conclusion

Container technology continues to be an exciting area of development for data centers and for NFV. Later this year, a Kata Containers release will be available that implements industry-standard lightweight VM-based containers. This will offer the security and isolation of VMs with the speed and flexibility of containers, using the same container management tools.

Until Kata has a release, Intel Clear Containers is available to try out the technology, and most of what we've discussed is available in that project.

NFV developers and operators can take advantage of these systems quickly since NFV-friendly technologies are baked in and are independently available as plug-ins to the CNM and CNI networking interfaces used in Docker, Kubernetes, and other container management and orchestration systems.

About the Author

Jim Chamings is a senior software apps engineer in the Developer Relations Division at Intel. He works with cloud and NFV developers, operators, and other industry partners to help people get the most out of their data centers and cloud installations.

References

Kata Containers

Intel Clear Containers

Hyper.sh

The Openstack Foundation

Open Container Initiative

Kubernetes

CoreOS rkt

CRI-O

Go

virtcontainers

Container Network Model

Container Network Interface

CNI plug-ins

SR-IOV plug-in for CNI

DPDK-vhostuser plug-in for CNI


Intel Software Innovators at GDC 2018

$
0
0

Manic parrot shootouts, Terminator-style storylines, and “how to be a Viking” tutorials were among demos by Intel® Software Innovators at GDC 2018 in the Intel booth – all using the new Hades Canyon NUC for smooth VR gameplay. Shadow puppets and fantastic alien worlds were also on tap in the Intel Indie Lounge. In this article, we'll take a look back at all the Innovator goodness that was showcased at GDC 2018.

Viking Days - Pedro Kayatt, VRMonkey

Intel Software Innovator Pedro Kayatt of VRMonkey Studios showed off his innovative game Viking Days, essentially old school Oregon Trail gameplay applied to Viking daily life. Listen to Pedro talk about this game in the video below:

Parrots on Java - Thomas Endres, TNG Consulting

Combine manic parrots with a first person shooter experience in VR and you've got Parrots on Java, an interactive arcade style game demoed by Thomas Endres and Christoph Bergemann of TNG Consulting.  Listen to Thomas and Christoph discuss this game below:

The End of Tomorrow - Alice Mo, Edsenses

A futuristic dystopian landscape full of Terminator-style robots, zombies, and giant robot spiders? Sign me up! Watch below as Alice and Eric discuss their VR game "The End of Tomorrow":

Shadow Fencer Theatre - Alex Schuster, ShuddaHaddaLottaFun

This innovative game, showcased in the Intel Indie Lounge area at GDC, is a 1-2 player awkward physics sword fighting game set in the world of shadow puppets. More info:

"Take control of a unique cast of characters and prove yourself the best performer on the shadowy screen. Take the stage in Story to fill the role of an apprentice ready to show the world you are the next Shadow Puppet Master. Go solo or against friends in Improv, deciding the cast, stage and direction of the shadow fencer fight. Prove you can make the cut in Marathon, competing in a seemingly endless battle with a limited number of takes."

Watch the teaser trailer below:

The World Next Door - Corey Warning, Rose City Games 

The World Next Door  is a narrative-driven game that centers around Jun, a rebellious teen girl trapped in a parallel world inhabited by magical creatures. Follow this game's creation at The World Next Door

Want to know more about the Intel Software Innovators? 

You can read about our innovator updates, get the full Innovator program overview, meet the innovators and learn more about innovator benefits. We also encourage you to learn more about our Black Belt Software Developer program as well as our Student Ambassador program. Also check out Developer Mesh to learn more about the various projects that our community of innovators are working on.

 

Optimizing the User Experience of VR Games on IA for e-Sports

$
0
0

Finn Wong, Intel Corporation

Edward Wu, Smellyriver

 

Abstract

VR, as a new form of interaction between humans and the virtual world, is able to immerse users by making them feel like they are experiencing the simulated reality. However, it’s not easy to achieve considering the frame rendering budget for VR is 11.1ms per frame (90fps), and you need to render the entire scene twice (once for each eye). In this session, we’ll focus on both performance and user experience optimizations for VR games, introducing techniques we used in an arena Premium VR game called “Code51” to minimize motion sickness and increase user playtime in VR, as well as what optimizations and differentiations have been done in Code51 to increase both the user experience of players and audience throughout the game.

 

Introduction

Code51 is the first worldwide mech arena VR game supporting Oculus Rift, HTC Vive, PSVR and Pico VR. It allows up to four versus four combat among worldwide players and is specifically tailored for VR e-sports, with a nausea minimized gameplay design and a built-in VR spectator mode. Code51 has already been released in over 3000 VR Arcades and experience centers in China (ZamerVR & VRLe), and is targeted to be released on PlayStation Store, Oculus Store and Steam in Q2’18.

Intel worked closely with Smellyriver to optimize the user experience and performance of the game. Moreover Intel and Smellyriver added richer visual and audial enhancements enabled by Intel Core i7 processors, including features like 3D audio, object destruction, enhanced CPU particles and additional background objects.

In this article, we’ll describe seven design points in Code51 that can help increase both the immersion and user experience of VR games.

 

Design Points of Immersive VR Games

Moving immersively in VR

Currently there are four classes of immersive motion tracking systems to drive player movements in the virtual world of a game, which are:

  • Teleport + Six DoF Tracking (e.g. Rebo Recall)
  • Virtual Cockpit (e.g. EVE: Valkyrie)
  • Locomotion Simulator (e.g. Virtuix Omni)
  • Large Scale Tracking System (e.g. OptiTrack)

All of these solutions have different pros and cons. For Code51, virtual cockpit was chosen to be the way to move in VR due to the following reasons:

  1. Able to move continuously is an important way to improve immersion in VR since it matches our real-world experience, and virtual cockpit is the only way that can move continuously in VR without the need of extra hardware and cost.
  2. Considering that current sales of Premium VR helmets is not high enough, a moving way which is compatible to 3DoF VR helmets helps reduce the engineering effort of porting the game to these devices.
  3. Code51 is an arena VR game where a player sits in the cockpit of a mech and fights with others in VR. This kind of “sitting down” experience perfectly matches the moving way of virtual cockpit and increases the immersion.
  4. Sitting & using gamepad to play is less tiring than standing & using motion controllers to play, it allows users to play longer in VR[1]. Fig. 1 shows how the users play Code51 with various VR helmets.

Fig. 1. Code51 supports various VR helmets. Left: HTC Vive (6 DoF). Right: Pico VR (3 DoF).

 

Alleviating motion sickness in VR

Motion sickness[2] is one of the main factors preventing users from experiencing VR for a relatively long period of time. There are several factors causing it, including:

  • Visual stress caused by the vergence-accommodation conflict of the viewers[3]
  • VR scene w/o directional clues or references (i.e. not able to get hints on the current moving direction)
  • Low FPS or high MTP latency
  • Acceleration mismatch between what you see and what your body feels
  • Angular velocity
  • Zoom in & zoom out
  • VR blur
    • To address the factors leading to dizziness, several approaches were adopted in Code51 to minimize these effects from various aspects
  • UI Design
    • Put the rendered control panel at 1 meter or more away from the user to avoid frequent changing of the user’s vergence distance
  • Level Design
    • Provide clear directional cues in the scene such that the user knows which direction he/she is going consistently. It can get one’s brain prepared for the movement and reduce dizziness. In practice, avoid adding objects with plain texture that might block the whole view of the user to the scene, for it might interrupt the sensation of the moving direction from visual clues, increasing the equivalent variation of perceived speed.
  • Rendering Performance
    • Optimize performance to get Code51 stably rendered at 90fps for minimizing MTP latency.
  • Reduce Acceleration
    • Avoid adding acceleration to Code51 as much as possible to reduce dizziness, velocity should be changed between different levels instantly. For example, apply acceleration for a short period of time only to initialize the action, and then maintain constant speed during jumping or landing.
  • Reduce Angular Velocity
    • Reduce the ability to rotate (angular velocity) at high speed, cockpit acts as a reference and helps reduce perceived optical flow.
  • Dynamically reduce the field of view (FOV)
    • Restricting the FOV in VR helps reduce motion sickness[4][5]. This method was adopted in Code51 where depth buffer is used to calculate the instant velocities at 4 corners of the screen, and then the vertices are warped inward accordingly in the stencil buffer to subtly reduce the FOV during movement. The faster the moving, the narrower the FOV, as shown in Fig. 2. It can reduce the equivalent magnitude of optical flow perceived by end users. 

Fig. 2. Dynamic FOV adopted in Code51 to eliminate the optical flows at the edges of the screen. It helps reduce the motion sickness induced to the end users.

 

Minimizing network latency

For VR application, it’s also important to minimize network latency in order to have a fluent gameplay experience without lagging, for it might generate dizziness if not optimized. In Code51, all motion of mechs are predicted locally, and then being refined through synchronization with the server in a low frequency pace to avoid frequent interruption during bad network condition. If synchronization was failed, the client would extrapolate previous trajectory to get the new position.

 

Enhancing spectator viewing experience for e-Sports

To create a better viewing experience for e-Sports audience, a built-in VR spectator mode was implemented in Code51. The spectator mode acts like a client in the game and allows audience to watch a live battle in either stereo or mono mode.  Spectators can use VR HMDs or not, and view from any angles and positions they want through keyboard, mouse or gamepad control, as shown in Fig. 3.

Fig. 3. VR spectator mode in Code51. End users can view a live battle in VR or non-VR mode.

 

Maintain the sharpness of the rendered scene

Since the PPD (Pixels Per Degree) of current VR HMDs is still low as compared to conventional displays, it’s critical for VR apps to preserve the sharpness of the rendered scene as much as possible to minimize VR blur and reduce dizziness. Currently there are 3 types of Anti-Aliasing (AA) approaches available to the UE4 developers:

  • Temporal AA (TAA)
    • TAA and its derivatives have good AA quality for static environment with moderate computational cost. If TAA is used in VR apps, it has to be computed in additional pass for non-opaque objects since there is no depth information available for these objects. Also, it’s better to adopt a conservative parameter settings for TAA used in VR in order to reduce VR blur generated during head movement or in dynamic scenes.
  • Multi-Sample AA (MSAA)
    • MSAA is available in forward rendering pipeline only, it’s adopted in Code51 to minimize VR blur generated from AA. 4x MSAA is good enough for VR.
  • Screen Space AA (SSAA)
    • SSAA has the high computational cost but is able to achieve the best quality, 1.4x SSAA is also used in Code51.

 

CPU performance optimization

CPU optimization is also critical to the VR experience to ensure consistent and fast submits to the graphics pipe, and to prevent stalling the VR scene rendering. To minimize the CPU boundedness of DX11 VR game made with UE4, reduce render thread workload as much as possible[6]. The RHI threads for DX11 in UE4 (4.20+) will help reduce the CPU render thread overhead through D3D11 deferred rendering context[7], and it should be adopted whenever possible.

In addition, there are various approaches to optimize the render thread workload in UE4[8], where some of them are deployed in Code51:

  • Reduce drawcalls (better less than 2000 on Premium VR according to our experience for forward rendering pipeline, the more GPU is optimized (less GPU computation time per frame), the larger this number can be)
  • Optimize visibility culling (InitView)
    • Modify assets to reduce dynamic occlusion culling (hardware occlusion queries or hierarchical z buffer culling), for it is computational inefficient to calculate visibility culling for an object that only shows up a small portion of it on the screen. An example of this optimization is shown in Fig. 4
    • Use precomputed visibility culling[9] to reduce numbers of primitives needed to be processed with dynamic occlusion culling in runtime. However, it’s not so efficient in Code51 because mechs can jump and fly in the game. The statically occluded primitives (~300 primitives) are 20% of the total occluded primitives in Code51 when precomputed visibility culling is enabled
    • Masked Occlusion Culling (Github/Paper) is an occlusion culling approaches implemented in the CPU with SIMD acceleration(SSE, AVX, and AVX-512), it is an alternative to the hierarchical z buffer culling, which can be parallelized and efficiently computed on modern multi-core CPUs
  • Reduce dynamic lighting, turn off shadow projection for dynamic lighting or use capsule shadows instead when budget allowed
  • Use light baking as much as possible
  • Use LOD & HLOD, keep tris < 2.5M for Premium VR
  • Use particle LOD and don’t use non opaque particles at LOD0

Fig. 4. An asset was modified from the left one to the right one to avoid culling calculation for objects behind the asset which can be seen through in the original one.

 

Differentiation and deepening the immersive experience

Last but not least, one key approach to optimize the user experience of VR apps is to utilize all available computational resources on a hardware platform as much as possible in order to deliver the best experience on that platform.  Having well optimized CPU compute and higher thread count CPUs enables developers to employ the CPU to deepen the VR immersion experience.  For example, a user with an Intel Core i7-7700K has more CPU threads and resources to use than a user with an Intel Core i5-4590 only (the minimum CPU spec. of Oculus Rift without using ASW). Thus the experience of a VR app can be improved if extra CPU intensive visual and audial features are added to the app to consume those extra resources.

In Code51, we implemented several CPU enhanced features to better utilize the CPU resources available on high end Intel Core i7 CPUs. These features include 3D audio, object destruction, CPU particle enhancement and additional background effects, as shown in the following video:

Most of the CPU computation of the visual and audial enhancements is offloaded to the worker threads or the audio simulation thread in UE4, including physical simulation performed by PhysX and ray tracing for physically-based sound propagation performed by the Steam Audio plugin (occlusion and environmental audio). These kinds of features significantly increase the immersion of Code51 on high end CPUs without performance drop since most of the computation is offloaded to the idle CPU cores, where the impact to the critical path of the rendering (CPU render thread and GPU) is minimized.

Fig. 5 shows the frame rate data of Code51 accordingly. The game is able to run smoothly on both Intel Core i7-7700K and Intel Core i7-7820HK with enhanced features on, but drops frames significantly on Intel Core i5-4590 with the same setting, implying that the i5 CPU is not able to handle all the enhanced features in 11.1ms per frame. Instead, for users with the min spec CPUs, they can still maintain a good user experience by turning off all the enhanced CPU features (low quality setting). According to the performance data, Intel Core i7-7700K shows 27% performance benefit against Intel Core i5-4590 in terms of fps(frame per second) when all CPU enhanced features are turned on in Code51.

Fig. 5. The frame rate of Code51 running on different CPUs (Intel Core i5 and  i7 processors) and quality settings, where Ultra High is the one with CPU enhanced features and Low is the one without. Test systems:  Intel Core i5-4590, NVIDIA GTX1080, 2x4GB DDR3-1600, Windows 10 version 1703; Intel Core i7-7700K, NVIDIA GTX1080, 4x4GB DDR4-2400, Windows 10 version 1703; Intel Core i7-7820HK, NVIDIA GTX1080, 4x4GB DDR4-2400, Windows 10 version 1703.

Fig.6 and 7 show the GPUView screenshots of Code51 in Ultra High quality running on Intel Core i7-6800K with CPU cores set to 4C4T and 6C12T respectively using msconfig, running at the same CPU frequency. It’s obvious from the charts that on 4C4T configuration (a proxy to 4C4T Intel Core i5 processors), the rendering time of a frame cannot reach 11.1ms and frame dropping happens in this case, while it can run smoothly on 6C12T configuration.

Here are the percentages of the CPU computation increased on different threads of Code51 accordingly, running on the same Intel Core i7-6800K in two configurations (6C12T and 4C4T) and observed in Windows Performance Analyzer (WPA):

Total CPU workload:       43%↑

Render thread:                44%↑

Driver thread:                  10%↑

Game thread:                  13%↑

Worker threads:               89%↑

A significant amount of the CPU works of the enhanced features was offloaded to the worker threads in Code51, leading to the largest computation increase on these threads.

Fig. 6. A GPUView screenshot of Code51 on Ultra High setting running on Intel Core i7-6800K and GTX1080. To demonstrate the scalable experience behavior of Code51, the CPU cores were set to 4C4T by msconfig in this case. Test system:  Intel Core i7-6800K, NVIDIA GTX1080, 4x4GB DDR4-2400, Windows 10 version 1703.

Fig. 7. A GPUView screenshot of Code51 on Ultra High setting running on Intel Core i7-6800K and GTX1080. There is no modification on CPU cores (6C12T) in this case, and Code51 employs the additional CPU compute resources to deliver a richer immersive experience. Test system:  Intel Core i7-6800K, NVIDIA GTX1080, 4x4GB DDR4-2400, Windows 10 version 1703.

 

Conclusion

Making VR users feel comfortable and alleviating motion sickness are two critical factors to the success of an e-Sports VR game. Only games with a long playtime and user addiction can succeed in e-Sports. Code51 adopted various approaches descripted in this article to reduce motion sickness, and to deepen the immersive experience, while at the same time create a spectator mode that helps to improve the viewing experience for audience. For performance optimization, render thread bound is usually the main CPU bottleneck in UE4 DX11 VR apps according to GPUView and WPA, standard game optimization methods and offloading tasks from the render thread to worker threads help in this case, it’s also beneficial to leverage all available CPU resources as much as possible to make your game stand out from the crowd, through adding more CPU intensive features such as those implemented in Code51.

 

More about Code51:

http://www.51hitech.com/values/game

Follow us on Twitter: https://twitter.com/smellyriver

Subscribe our YouTube channel: Code51 Smellyriver

 

Reference

[1] http://blog.leapmotion.com/taking-motion-control-ergonomics-beyond-minority-report/

[2] https://en.wikipedia.org/wiki/Virtual_reality_sickness

[3] Shibata, Takashi, et al. "Visual discomfort with stereo displays: effects of viewing distance and direction of vergence-accommodation conflict."Stereoscopic Displays and Applications XXII. Vol. 7863. International Society for Optics and Photonics, 2011.

[4] http://engineering.columbia.edu/fighting-virtual-reality-sickness

[5] Fernandes, Ajoy S., and Steven K. Feiner. "Combating VR sickness through subtle dynamic field-of-view modification."3D User Interfaces (3DUI), 2016 IEEE Symposium on. IEEE, 2016.

[6] Finn Wong, “Performance Analysis and Optimization for PC-Based VR Applications: From the CPU’s Perspective,” 2016.

[7] https://msdn.microsoft.com/en-us/library/windows/desktop/ff476891(v=vs.85).aspx

[8] Finn Wong, “Unreal* Engine 4 VR应用的CPU性能优化和差异化,” 2017.

[9] http://timhobsonue4.snappages.com/culling-precomputed-visibility-volumes

 

About the authors

Finn Wong is a senior software engineer in the Intel Developer Relations Division (DRD). He is responsible for VR content enabling and technical collaboration since 2015, helping VR developers to optimize CPU performance and differentiate CPU contents to deliver a truly immersive VR experience to end users. Finn was also invited to deliver tech talks in various VR conferences including CGDC, VRCORE, Tencent GDOC, Unreal Circle, Unity Unite and Vision AR/VR Summit, etc. Before that, Finn worked on the performance optimization in H.264/H.265 and RealSense applications at Intel and has over 10 years of expertise in the fields of video coding, video analytics, computer vision, algorithms and performance optimization, with several academic papers published in the literature as well. Finn holds a bachelor's degree in electrical engineering and a master's degree in communication engineering, all from National Taiwan University.

Edward Wu is the CEO and co-founder of Smellyriver Game Studio. He had many years of in-depth game development experience and led the development of the well-known VR game - Code51, starting from 2016.

Code Sample: Parallel Processing with Direct3D* 12

$
0
0

File(s):

Download
License:Intel Sample Source Code License Agreement
Optimized for... 
Operating System:Microsoft* Windows® 10 (64 bit)
Hardware:GPU required
Software:
(Programming Language, tool, IDE, Framework)
Microsoft Visual Studio* 2017, Direct3D* 12, C++
Prerequisites:Familiarity with Visual Studio, Direct3D API, 3D graphics, parallel processing.
Tutorial:Parallel Processing with DirectX 3D* 12

Introduction

The idea behind this project was to provide a demonstration of parallel processing in gaming with Direct3D 12. It expands upon the results from the paper "A Comparison of the Intel® Core™ i5 Processor and Intel® Core™ i7 Processor with Visualizations in OpenGL* and Oculus* VR" (see References section) and extends the code there to contain a Direct3D 12 renderer. It also re-implements the previous particle system as a Direct3D 12 compute shader.

  1. Modify code to add a CPU Direct3D 12 renderer
  2. Moving to the GPU
  3. Closer look at differences between CPU and GPU

Get Started

Modify code to add a CPU Direct3D* 12 renderer

The first task is to add a Direct 3D 12 "renderer" to the particle system used in the Intel Core i5 vs Intel Core i7 article. The software design makes this very easy to do since it very nicely encapsulates the concept of rendering. The first step is to define an interface to the renderer and then write an event loop. To improve performance, I wrote a custom upload heap. Next I looked at the compute shader and the actual Direct3D 12 rendering code, before discussing issues surrounding the vertex buffer view.

Moving to the GPU

We can improve performance by moving the renderer from the CPU to the GPU. Better than separating processing between separate threads, is to separate it between multiple processors: CPU and GPU.

The first thing I considered after struggling with getting every single field in every single structure in the CPU portion to be correct and consistent, was facing the prospect of two to three times more work for the GPU compute problem. I quickly decided the right thing to do was to look for some kind of "helper" framework and, thus, chose MiniEngine: A DirectX 12 Engine Starter Kit. I will cover how I installed and customized MiniEngine for this project. Through the use of MiniEngine, the 500+ lines of code to render using the CPU is reduced to about 38 lines of setup code and 31 lines of rendering code (69 lines total) for the GPU, so my work paid off.

The GPU renderer consists of setup and rendering code. Setup contains configuring the root signature, vertex inputs and obtain the formats used for color and depth. Finally, I configure the graphic PSO and the view and production matrices. Rendering code is broken into obtaining the context, describing the transitions, clearing the color and depth before updating the matrix with the new values and then drawing the frame.

Closer look at differences between CPU and GPU

For best performance I use two buffers of particle data and render one while the other is being updated for the next frame by GPU compute. I briefly talk about this before taking a deeper look at changes required to implement a particle rendering system on the GPU, in particular, differences between the algorithms.

References

John Stone, Integrated Computing Solutions, Inc., A Comparison of the Intel® Core™ i5 Processor and Intel® Core™ i7 Processor with Visualizations in OpenGL* and Oculus* VR, https://software.intel.com/en-us/articles/compare-intel-core-i5-and-i7-processors-using-custom-visualization-and-vr-benchmark, 2017

John Stone, Integrated Computing Solutions, Inc., Parallel Processing with Direct3D 12, https://software.intel.com/en-us/articles/parallel-processing-with-directx-3d-12, 2017

Updated Log

Created March 20, 2018

Developer Success Stories Library

$
0
0

Intel® Parallel Studio XE | Intel® System Studio  Intel® Media Server Studio

Intel® Advisor | Intel® Computer Vision SDK | Intel® Data Analytics Acceleration Library 

Intel® Distribution for Python* | Intel® Inspector XEIntel® Integrated Performance Primitives

Intel® Math Kernel Library | Intel® Media SDK  | Intel® MPI Library | Intel® Threading Building Blocks

Intel® VTune™ Amplifer 

 


Intel® Parallel Studio XE


Altair Creates a New Standard in Virtual Crash Testing

Altair advances frontal crash simulation with help from Intel® Software Development products.


CADEX Resolves the Challenges of CAD Format Conversion

Parallelism Brings CAD Exchanger* software dramatic gains in performance and user satisfaction, plus a competitive advantage.


Envivio Helps Ensure the Best Video Quality and Performance

Intel® Parallel Studio XE helps Envivio create safe and secured code.


ESI Group Designs Quiet Products Faster

ESI Group achieves up to 450 percent faster performance on quad-core processors with help from Intel® Parallel Studio.


F5 Networks Profiles for Success

F5 Networks amps up its BIG-IP DNS* solution for developers with help from
Intel® Parallel Studio and Intel® VTune™ Amplifer.


Fixstars Uses Intel® Parallel Studio XE for High-speed Renderer

As a developer of services that use multi-core processors, Fixstars has selected Intel® Parallel Studio XE as the development platform for its lucille* high-speed renderer.


Golaem Drives Virtual Population Growth

Crowd simulation is one of the most challenging tasks in computer animation―made easier with Intel® Parallel Studio XE.


Lab7 Systems Helps Manage an Ocean of Information

Lab7 Systems optimizes BioBuilds™ tools for superior performance using Intel® Parallel Studio XE and Intel® C++ Compiler.


Mentor Graphics Speeds Design Cycles

Thermal simulations with Intel® Software Development Tools deliver a performance boost for faster time to market.


Massachusetts General Hospital Achieves 20X Faster Colonoscopy Screening

Intel® Parallel Studio helps optimize key image processing libraries, reducing compute-intensive colon screening processing time from 60 minutes to 3 minutes.


Moscow Institute of Physics and Technology Rockets the Development of Hypersonic Vehicles

Moscow Institute of Physics and Technology creates faster and more accurate computational fluid dynamics software with help from Intel® Math Kernel Library and Intel® C++ Compiler.


NERSC Optimizes Application Performance with Roofline Analysis

NERSC boosts the performance of its scientific applications on Intel® Xeon Phi™ processors up to 35% using Intel® Advisor.


Nik Software Increases Rendering Speed of HDR by 1.3x

By optimizing its software for Advanced Vector Extensions (AVX), Nik Software used Intel® Parallel Studio XE to identify hotspots 10x faster and enabled end users to render high dynamic range (HDR) imagery 1.3x faster.


Novosibirsk State University Gets More Efficient Numerical Simulation

Novosibirsk State University boosts a simulation tool’s performance by 3X with Intel® Parallel Studio, Intel® Advisor, and Intel® Trace Analyzer and Collector.


Pexip Speeds Enterprise-Grade Videoconferencing

Intel® analysis tools enable a 2.5x improvement in video encoding performance for videoconferencing technology company Pexip.


Schlumberger Parallelizes Oil and Gas Software

Schlumberger increases performance for its PIPESIM* software by up to 10 times while streamlining the development process.


Ural Federal University Boosts High-Performance Computing Education and Research

Intel® Developer Tools and online courseware enrich the high-performance computing curriculum at Ural Federal University.


Walker Molecular Dynamics Laboratory Optimizes for Advanced HPC Computer Architectures

Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.


Intel® System Studio


CID Wireless Shanghai Boosts Long-Term Evolution (LTE) Application Performance

CID Wireless boosts performance for its LTE reference design code by 6x compared to the plain C code implementation.


GeoVision Gets a 24x Deep Learning Algorithm Performance Boost

GeoVision turbo-charges its deep learning facial recognition solution using Intel® System Studio and Intel® Computer Vision SDK.


NERSC Optimizes Application Performance with Roofline Analysis

NERSC boosts the performance of its scientific applications on Intel® Xeon Phi™ processors up to 35% using Intel® Advisor.


Daresbury Laboratory Speeds Computational Chemistry Software 

Scientists get a speedup to their computational chemistry algorithm from Intel® Advisor’s vectorization advisor.


Novosibirsk State University Gets More Efficient Numerical Simulation

Novosibirsk State University boosts a simulation tool’s performance by 3X with Intel® Parallel Studio, Intel® Advisor, and Intel® Trace Analyzer and Collector.


Pexip Speeds Enterprise-Grade Videoconferencing

Intel® analysis tools enable a 2.5x improvement in video encoding performance for videoconferencing technology company Pexip.


Schlumberger Parallelizes Oil and Gas Software

Schlumberger increases performance for its PIPESIM* software by up to 10 times while streamlining the development process.


Intel® Computer Vision SDK


GeoVision Gets a 24x Deep Learning Algorithm Performance Boost

GeoVision turbo-charges its deep learning facial recognition solution using Intel® System Studio and Intel® Computer Vision SDK.


Intel® Data Analytics Acceleration Library


MeritData Speeds Up a Big Data Platform

MeritData Inc. improves performance—and the potential for big data algorithms and visualization.


Intel® Distribution for Python*


DATADVANCE Gets Optimal Design with 5x Performance Boost

DATADVANCE discovers that Intel® Distribution for Python* outpaces standard Python.
 


Intel® Inspector XE


CADEX Resolves the Challenges of CAD Format Conversion

Parallelism Brings CAD Exchanger* software dramatic gains in performance and user satisfaction, plus a competitive advantage.


Envivio Helps Ensure the Best Video Quality and Performance

Intel® Parallel Studio XE helps Envivio create safe and secured code.


ESI Group Designs Quiet Products Faster

ESI Group achieves up to 450 percent faster performance on quad-core processors with help from Intel® Parallel Studio.


Fixstars Uses Intel® Parallel Studio XE for High-speed Renderer

As a developer of services that use multi-core processors, Fixstars has selected Intel® Parallel Studio XE as the development platform for its lucille* high-speed renderer.


Golaem Drives Virtual Population Growth

Crowd simulation is one of the most challenging tasks in computer animation―made easier with Intel® Parallel Studio XE.


Schlumberger Parallelizes Oil and Gas Software

Schlumberger increases performance for its PIPESIM* software by up to 10 times while streamlining the development process.


Intel® Integrated Performance Primitives


JD.com Optimizes Image Processing

JD.com Speeds Image Processing 17x, handling 300,000 images in 162 seconds instead of 2,800 seconds, with Intel® C++ Compiler and Intel® Integrated Performance Primitives.


Tencent Optimizes an Illegal Image Filtering System

Tencent doubles the speed of its illegal image filtering system using SIMD Instruction Set and Intel® Integrated Performance Primitives.


Tencent Speeds MD5 Image Identification by 2x

Intel worked with Tencent engineers to optimize the way the company processes millions of images each day, using Intel® Integrated Performance Primitives to achieve a 2x performance improvement.


Walker Molecular Dynamics Laboratory Optimizes for Advanced HPC Computer Architectures

Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.


Intel® Math Kernel Library


DreamWorks Puts the Special in Special Effects

DreamWorks Animation’s Puss in Boots uses Intel® Math Kernel Library to help create dazzling special effects.


GeoVision Gets a 24x Deep Learning Algorithm Performance Boost

GeoVision turbo-charges its deep learning facial recognition solution using Intel® System Studio and Intel® Computer Vision SDK.

 


MeritData Speeds Up a Big Data Platform

MeritData Inc. improves performance―and the potential for big data algorithms and visualization.


Qihoo360 Technology Co. Ltd. Optimizes Speech Recognition

Qihoo360 optimizes the speech recognition module of the Euler platform using Intel® Math Kernel Library (Intel® MKL), speeding up performance by 5x.


Intel® Media SDK


NetUP Gets Blazing Fast Media Transcoding

NetUP uses Intel® Media SDK to help bring the Rio Olympic Games to a worldwide audience of millions.


Intel® Media Server Studio


ActiveVideo Enhances Efficiency

ActiveVideo boosts the scalability and efficiency of its cloud-based virtual set-top box solutions for TV guides, online video, and interactive TV advertising using Intel® Media Server Studio.


Kraftway: Video Analytics at the Edge of the Network

Today’s sensing, processing, storage, and connectivity technologies enable the next step in distributed video analytics, where each camera itself is a server. With Kraftway* video software platforms can encode up to three 1080p60 streams at different bit rates with close to zero CPU load.


Slomo.tv Delivers Game-Changing Video

Slomo.tv's new video replay solutions, built with the latest Intel® technologies, can help resolve challenging game calls.


SoftLab-NSK Builds a Universal, Ultra HD Broadcast Solution

SoftLab-NSK combines the functionality of a 4K HEVC video encoder and a playout server in one box using technologies from Intel.


Vantrix Delivers on Media Transcoding Performance

HP Moonshot* with HP ProLiant* m710p server cartridges and Vantrix Media Platform software, with help from Intel® Media Server Studio, deliver a cost-effective solution that delivers more streams per rack unit while consuming less power and space.


Intel® MPI Library


Moscow Institute of Physics and Technology Rockets the Development of Hypersonic Vehicles

Moscow Institute of Physics and Technology creates faster and more accurate computational fluid dynamics software with help from Intel® Math Kernel Library and Intel® C++ Compiler.


Walker Molecular Dynamics Laboratory Optimizes for Advanced HPC Computer Architectures

Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.


Intel® Threading Building Blocks


CADEX Resolves the Challenges of CAD Format Conversion

Parallelism Brings CAD Exchanger* software dramatic gains in performance and user satisfaction, plus a competitive advantage.


Johns Hopkins University Prepares for a Many-Core Future

Johns Hopkins University increases the performance of its open-source Bowtie 2* application by adding multi-core parallelism.


Mentor Graphics Speeds Design Cycles

 

Thermal simulations with Intel® Software Development Tools deliver a performance boost for faster time to market.

 


Pexip Speeds Enterprise-Grade Videoconferencing

Intel® analysis tools enable a 2.5x improvement in video encoding performance for videoconferencing technology company Pexip.


Quasardb Streamlines Development for a Real-Time Analytics Database

To deliver first-class performance for its distributed, transactional database, Quasardb uses Intel® Threading Building Blocks (Intel® TBB), Intel’s C++ threading library for creating high-performance, scalable parallel applications.


University of Bristol Accelerates Rational Drug Design

Using Intel® Threading Building Blocks, the University of Bristol helps slash calculation time for drug development—enabling a calculation that once took 25 days to complete to run in just one day.


Walker Molecular Dynamics Laboratory Optimizes for Advanced HPC Computer Architectures

Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.


Intel® VTune™ Amplifer


CADEX Resolves the Challenges of CAD Format Conversion

Parallelism brings CAD Exchanger* software dramatic gains in performance and user satisfaction, plus a competitive advantage.


F5 Networks Profiles for Success

F5 Networks amps up its BIG-IP DNS* solution for developers with help from
Intel® Parallel Studio and Intel® VTune™ Amplifer.


GeoVision Gets a 24x Deep Learning Algorithm Performance Boost

GeoVision turbo-charges its deep learning facial recognition solution using Intel® System Studio and Intel® Computer Vision SDK.


Mentor Graphics Speeds Design Cycles

 

Thermal simulations with Intel® Software Development Tools deliver a performance boost for faster time to market.

 


Nik Software Increases Rendering Speed of HDR by 1.3x

By optimizing its software for Advanced Vector Extensions (AVX), Nik Software used Intel® Parallel Studio XE to identify hotspots 10x faster and enabled end users to render high dynamic range (HDR) imagery 1.3x faster.


Walker Molecular Dynamics Laboratory Optimizes for Advanced HPC Computer Architectures

Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.


 

Enable Intel® Software Development Tools for HPC Applications Running on Amazon EC2* Cluster

$
0
0

1. Introduction

This article demonstrates how to scale-out your high performance computing (HPC) application compiled with Intel® Software Development Tools to leverage Intel® Xeon® Scalable processors hosted in the Amazon Elastic Compute Cloud* (Amazon EC2*) environment. We use the Cloud Formation Cluster (CfnCluster), an open source tool published by Amazon Web Services* (AWS*), to deploy in less than 15 minutes a fully elastic HPC cluster in the cloud. Once created, the cluster provisions standard HPC tools such as schedulers, Message Passing Interface (MPI) environment, and shared storage.

This tutorial presented in this article targeted to the following audience: Application developers using Intel® C++ Compiler and/or Intel® Fortran Compiler with the Intel® MPI Library to develop their HPC applications, who want to test scaling of their application across multiple HPC nodes, and application users who want to execute their precompiled application binaries compiled with Intel® Software Development Tools in an HPC environment running in cloud and thereby increase throughput of their applications.

2. CfnCluster

CfnCluster is a framework that deploys and maintains HPC clusters on AWS.

In order to use the CfnCluster tool to set up an HPC cluster in AWS, you’ll need an AWS account and an Amazon EC2 key pair. On the local workstation, install and configure the AWS Command Line Interface (AWS CLI) and recent versions of Python* (Python 2>= 2.7.9 and Python 3>= 3.4).

The process to sign up for AWS and access the Amazon EC2 key pair is outside the scope of this article. The remainder of the article assumes the user has created an AWS account and has access to the Amazon EC2 key pair. For more information on creating an AWS account, refer to the AWS website. For additional information on creating an Amazon EC2 key pair, refer to these steps.

The following steps show you how to install and configure both AWS CLI and CfnCluster.

2.1 Install and configure AWS CLI

Assuming you have the most recent version of Python installed on your workstation, one of the ways AWS CLI can be installed is using Python’s package manager pip.

$ pip install awscli --upgrade –user

Once AWS CLI is installed, configure it using the following steps. To do this, you’ll need access to your Amazon EC2 key pair. You can also select your preferred region for launching your AWS instances. More information about AWS CLI configuration can be found here.

$ aws configure

2.2 Install and configure CfnCluster

To install the CfnCluster, use pip as follows:

$ sudo pip install --upgrade cfncluster

Once CfnCluster is installed successfully, configure it using the following command:

$ cfncluster configure

The default location of CfnCluster configuration file is ~/.cfncluster/config. This file can be customized using the editor of your choice to select cluster size, the type of scheduler, base operating system, and other parameters. Figure 1 shows an example of a CfnCluster configuration file.

[aws]
aws_access_key_id = AAWSACCESSKEYEXAMPLE
aws_secret_access_key = uwjaueu3EXAMPLEKEYozExamplekeybuJuth
aws_region_name = us-east-1

[cluster Cluster4c4]
key_name = AWS_PLSE_KEY
vpc_settings = public
initial_queue_size = 4
max_queue_size = 4
compute_instance_type = c5.9xlarge
master_instance_type = c5.large
maintain_initial_size = true
scheduler = sge
placement_group = DYNAMIC
base_os = centos7

[vpc public]
vpc_id = vpc-68df1310
master_subnet_id = subnet-ba37e8f1

[global]
cluster_template = Cluster4c4
update_check = true
sanity_check = true

Figure 1. Sample CfnCluster configuration file

The configuration file shown in Figure 1 includes the following information about the cluster, which will be launched using the CfnCluster tool:

  • AWS access keys and region name for the HPC cluster
  • Cluster parameters
    • Initial queue size and max queue size. Initial queue size is the number of compute nodes that will be made available for use when the cluster is first launched. Since CfnCluster allows for provisioning an elastic cluster, the number of compute nodes can vary as per the need. Max queue size is the maximum number of compute nodes allowed for the cluster. For this tutorial we launched a four-compute-nodes cluster.
    • Master and compute instance types. The type of the AWS instances launched for the master and the compute nodes. Currently available instances types are listed on the AWS instance types webpage. For this article we select Intel Xeon Scalable processor-based Amazon EC2 C5 instances. We choose master node of instance type c5.large (2 virtual CPU -vCPU) for cluster management and running job scheduler and c5.9xlarge instances (36 vCPU) as compute nodes.
    • Scheduler. The type of job scheduler to be configured and installed for this cluster. By default, CfnCluster launches a cluster with Sun Grid Engine (SGE) scheduler. Other available options are Slurm Workload Manger (Slurm) and Torque Resource Manager (Torque).
    • Placement group. This determines how instances are placed on the underlying hardware. This is important for HPC applications where selecting the correct placement group can improve performance of applications that require the lowest latency possible. Placement group can be NONE (the default), dynamic, or a custom placement group (created by the user). Dynamic allows a unique placement group to be created and destroyed with the cluster.
  • Virtual Private Cloud (VPC) and subnets. A VPC is a virtual network dedicated to an AWS account that is logically isolated from other accounts or users. VPC ID and master subnet ID to be used for launching a CfnCluster can be referred from a user’s AWS console.

Additional information about other available options for CfnCluster parameters can be found on the CfnCluster configuration webpage.

3. Creating CfnCluster

Once the CfnCluster configuration file is verified, launch the HPC cluster as follows:

$ cfncluster create Cluster4c4

Upon successful cluster creation, the CfnCluster tool provides the required information like the master node’s public and private IP address, which can be used to access the just-launched HPC cluster. It also provides a URL to monitor the cluster utilization using Ganglia Monitoring System, an open source tool.

Status: cfncluster-Cluster4c4 - CREATE_COMPLETE                                 Output:"MasterPublicIP"="34.224.148.71"
Output:"MasterPrivateIP"="172.31.18.240"
Output:"GangliaPublicURL"="http://34.224.148.71/ganglia/"
Output:"GangliaPrivateURL"="http://172.31.18.240/ganglia/"

3.1 CfnCluster login

SSH into the master node using the public IP address. For CentOS* as the base operating system, the default user name is centos, and for Amazon Linux AMI* (Amazon Machine Image*), the default user name is ec2-user.

$ ssh centos@MasterPublicIP
Eg. $ ssh centos@34.224.148.71

4. Executing an HPC Job on an HPC Cluster

Referring to the performance chart shown on the Intel® MPI product page, Intel MPI clearly shows significant performance improvements over the other open source Message Passing Interface (MPI) libraries like Open MPI* and MVAPICH*. This is because the Intel MPI Library, implementing the high-performance MPI-3.1 standard, focuses on enabling MPI applications to perform better for clusters based on Intel® architecture. However, by default, CfnCluster configures and installs only the open source MPI library, Open MPI. In order to get increased application performance on Amazon EC2 C5 instances based on Intel Xeon Scalable processors, we must install the required stand-alone runtime packages for the Intel MPI library, Intel Compilers, or Intel® Math Kernel Library (Intel® MKL) to execute applications compiled with Intel Compilers and the Intel MPI library.

4.1 Intel® MPI Library runtime installation

The Intel MPI Library runtime package includes everything you need to run applications based on the Intel MPI Library. It is free of charge and available to customers who already have applications enabled with the Intel MPI Library, and it includes the full install and runtime scripts. The runtime package can be downloaded from Intel registration center webpage.

Once the Intel MPI runtime package is downloaded, it can be unzipped and installed using the following commands. Here we assume that the version of Intel MPI library downloaded is 2018 update 2. By default, the Intel MPI runtime library would be installed in the /opt/intel directory. But we want the library to be shared by all the nodes in the cluster, so we will have to customize the install to change the install directory to /shared/opt/intel.

With the provisioned cluster, a shared NFS mount is created for the user and is available at /shared. This shared NFS mount is an Amazon Elastic Block Storage* (Amazon EBS*) volume for which, by default, all the content is deleted when the cluster is pulled apart. However, it is common to install frequently used HPC application software like the Intel MPI runtime library to the /shared drive and snapshot the /shared EBS volume so that the same preconfigured software can be deployed on the future clusters.

$ tar -xvzf l_mpi-rt_2018.2.199.tgz
$ cd l_mpi-rt_2018.2.199
$ sudo ./install.sh

The next step is to copy or download your precompiled MPI application package to the HPC cluster. For this tutorial we use one of the precompiled Intel MKL benchmarks, the High Performance Conjugate Gradients (HPCG) benchmark, as an example to show how to run an MPI application on the HPC cluster. A free stand-alone version of Intel MKL can be downloaded from the product page.

The HPCG Benchmark project is used to create a new metric for ranking HPC systems. It is intended as a complement to the High Performance LINPACK (HPL) benchmark, which is used to rank the TOP500 supercomputing systems. With respect to application characteristics, HPCG differs from HPL in that it not only exercises the computational power of the system, but also the data access patterns. As a result, HPCG is representative of a broad set of important applications.

4.2 Intel® Math Kernel Library - installation

$ tar -xvzf l_mkl_2018.2.199.tgz
$ cd l_mkl_2018.2.199
$ sudo ./install.sh
$ cp -r /shared/opt/intel/mkl/benchmarks/hpcg /home/centos/hpcg

If your application does not have Intel MKL dependencies, you can skip Intel MKL installation and just copy your application package to the master node. In that case, you might have to still install other runtime dependencies, for example, the Intel compiler runtime library.

$ scp -r /opt/intel/mkl/benchmarks/hpcg  centos@34.224.148.71:/home/centos/hpcg

4.3 Intel compiler runtime library - installation

Redistributable libraries for the 2018 Intel® C++ Compiler and Intel® Fortran Compiler for Linux* can be downloaded for free from the product page. If you have already installed Intel MKL, you can skip installing Intel compiler runtime library.

$ tar -xvzf l_comp_lib_2018.2.199_comp.cpp_redist.tgz
$ cd  l_comp_lib_2018.2.199_comp.cpp_redist
$ sudo ./install.sh

4.4 List of compute nodes

In order to run an MPI job across the cluster, we will use a list of compute nodes (hostfile or machinefile) where the HPC job will be executed. Depending on the type of scheduler, we can configure this hostfile as follows:

// For SGE scheduler
$ qconf -sel | awk -F. '{print $1}'&> hostfile

// For Slurm workload manager
$ sinfo -N | grep compute | awk '{print $1}'&> hostfile

4.5 Job submittal scripts

We also need to create a job submittal file for launching MPI jobs using the scheduler. Figure 2 shows a sample job submittal for the SGE scheduler for launching HPCG across Intel Xeon Scalable processor-based compute instances (Amazon EC2 C5 instances).

#!/bin/sh
#$ -cwd
#$ -j y

# Executable
EXE=/home/centos/hpcg/bin/xhpcg_skx

# MPI Settings
source /shared/opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64
source /shared/opt/intel/impi/2018.2.199/bin64/mpivars.sh intel64 

# Fabrics Settings
export I_MPI_FABRICS=shm:tcp

# Launch MPI application
mpirun -np 4 -ppn 1 -machinefile hostfile -genv KMP_AFFINITY="granularity=fine,compact,1,0" $EXE

Figure 2. Job submittal script – SGE scheduler (job.sh).

For the Slurm workload manager, the sample job submittal script is as follows:

#!/bin/bash
#SBATCH -N 4
#SBATCH -t 00:02:00 #wall time limit

# Executable
EXE=/home/centos/hpcg/bin/xhpcg_skx

# MPI Settings
source /shared/opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64
source /shared/opt/intel/impi/2018.2.199/bin64/mpivars.sh intel64
 
# Fabrics Settings
export I_MPI_FABRICS=shm:tcp

# Launch MPI application
mpirun -np 4 -ppn 1 -machinefile hostfile -genv KMP_AFFINITY="granularity=fine,compact,1,0" $EXE

Figure 3. Job submittal script – Slurm scheduler (job.sh).

4.6 Launching the HPC job

Once the modifications to the job script are completed, execute the MPI application as follows:

// For SGE scheduler
$ qsub job.sh

// For Slurm workload manager
$ sbatch job.sh

4.7 Monitoring the HPC job

The HPC cluster provisioned using the CfnCluster tool also provides access to the Ganglia Monitoring System via the URL specific to your cluster. Using the Ganglia Public URL (http://<PublicIPAddress>/ganglia/), we can access the console view of the cluster to monitor utilization. Figure 4 shows the average cluster utilization for a one-hour time window.

Cluster utilization using Ganglia

Figure 4. Cluster utilization using Ganglia.

Additional information about the Ganglia Monitoring System can be found here.

5. Deleting CfnCluster

Once you log off the master instance, delete CfnCluster as follows:

$ cfncluster list           //To get list of clusters online
$ cfncluster delete Cluster4c4

However, if you would like to reuse the cluster and avoid installing the runtime libraries every time you create a cluster, it may be useful to create a snapshot of the shared drive /shared that resides on an Amazon EBS volume before deleting the cluster. More information on creating an EBS volume snapshot for cluster reusability can be found in this document published by Amazon Web Services.

6. Conclusion

This introductory article on HPC in the cloud demonstrates how the capabilities of Intel Xeon Scalable processors can be leveraged in an AWS cloud environment. We focused on how the CfnCluster tool from AWS can be configured and used to launch a four-compute-node HPC cluster. Using the steps provided in the article, HPC application developers or users can compile their applications using Intel® Parallel Studio XE on their workstations and then deploy and test the scalability of their application in the cloud using stand-alone runtime libraries (free for registered customers) and available job schedulers like SGE or Slurm. However, the application performance will vary depending on the type of compute instances selected, memory size, and the available network throughput between the compute nodes.

7. References

Digital Painting: A Fast-Paced Process

$
0
0

splash art for Final Star Dynasties - Space court
Figure 1. Final Star Dynasties* splash art.

In concept art, the shorter the time between an abstract idea and a concrete visual, the better. This applies to any industry where concept art is involved, but speed is especially important in the fast-paced development pipelines of the video game industry. Using a digital illustration I made for the indie game Star Dynasties*, I'll outline the steps of conceptualizing, gathering reference, sketching, and finalizing, with a focus on efficiency and speed.

The accompanying video shows some specific ways to use 3D elements, photo textures, and Photoshop* tools. Every artist has their own toolkit of techniques that best fits their style or the needs of a particular project. I hope this walkthrough provides some ideas that you can add to your own toolkit.

The Design Process

Step 1: Planning and Conceptualization

The first step in creating this illustration for Star Dynasties was to determine the purpose of the image. What does the image need to accomplish? In this case, it's the main splash art for the game, and is probably the first visual that a potential player will see. It should therefore represent the mood and perhaps genre of the game, drawing in the types of gamers that would be interested. Mood and genre can be established through color, framing/composition, subject matter, and style. In essence, it's telling a story with an image, which is what most concept art and illustration is about.

According to Star Dynasties creator Glen Pawley, the game is "a character simulation game set in a dark age future, where you play as the leader of a faction of space colonies through multiple generations of your dynasty. Your goal is to ensure that your house survives and grows in influence." It's a strategy game about quasi-medieval politics and personal relationships rather than military strategy, so it needs to have cover art that feels different from a first-person shooter. After the purpose of the image is established, it's time to start collecting some reference.

Collecting a robust library of reference can be very helpful at the start of a project, even before starting on sketches. It helps to establish an understanding between anyone involved with the project, whether it's artist and client, or art team and art director. Reference saves a lot of time and effort, not only as a guide for what to do, but also as a guide for what not to do. Collecting reference from games in the same genre can be a way to expand the creative horizons of your game, or to avoid visuals that have already been done (or overdone). The more that the mood, style, and content can be detailed before pencil hits paper, the less back-and-forth there will be between artist, client, art director, and so on.

Sometimes a small collection of images will suffice, but it may be useful to develop a more detailed and organized style guide if it's a team project involving multiple artists. However, more complete, comprehensive style guides are usually only made after some of the primary art on a project has been created. Besides benefiting the initial ideation, reference is of course helpful throughout the digital painting process, especially for realistic or semi-realistic styles. Reference is easy to obtain through web searches, with tools like the Handy art reference tool, or by making faces at yourself in the mirror. It's all about getting the best results in the quickest way, so easily accessible reference should always be taken advantage of.

Step 2: Composition and Thumbnails

After some initial ideas, goals, and reference have been established, the next step is thumbnail sketches. Sometimes an important illustration like cover art can have dozens of thumbnails and roughs completed before a direction for the final image is selected. In this case, there were only a handful, since the basic direction was determined beforehand: A ruler or royal family is in the foreground, with a space colony in the background, inviting the viewer to imagine themselves in the role of the character.

Final Star Dynasties grayscale rough sketches
Figure 2. Rough sketches, Star Dynasties*.

These initial rough sketches don't depict action or violence, but instead show or hint at the following key elements:

  • The multi-generational family dynasty aspect of the game.
  • The industrial style of the space colonies, which in the lore of the game were originally established to gather resources but are now used as permanent habitats.
  • The military ships, which are repurposed cargo or exploration vessels.
  • Hints at interpersonal relations, with the characters on the right engaging in possible courtroom intrigue.
  • The quasi-medieval theme, reinforced by the coat of arms, the pike-like weapon of the guard, and the royal cut of the central characters' clothing (not to mention the royal blue and gold colors added later on).
  • The authority of the main characters, which is established through a low angle, central placement, and their position surveying their colony/ships.

You may notice the use of 3D elements in these roughs. Using 3D (in this case, SketchUp*) in the thumbnail process can add extra time if all that's needed is a rough scribble. However, the basic spaceship models in these roughs had already been made for previous Star Dynasties illustrations, so it was faster to position them and take some screenshots than it was to sketch them out. Another potential reason for using 3D in the thumbnail process is that it can provide an organic way to explore perspective and composition options.

Zooming and panning in 3D space can open up new opportunities for interesting compositions. Even a very basic blocked-out version of a room, building, or other object can serve this purpose. For example, a castle could be as simple as a handful of 3D boxes and cylinders and still be used to explore potential compositions. Finding the right composition is a vital part of the thumbnail process. Not only does the right composition create a compelling image to look at, it also helps convey the story, idea, and mood.

In the rough images in Figure 2 above, #1 (on the top left) ended up being the final choice for the illustration. There are some problems with the other roughs:

  • #2: the windows act to stop the eye from traveling around the image freely. Where the other images feature the royal family looking down on their colony kingdom, this one has them looking up, which makes them feel less powerful.
  • #3 is good but lacks visual interest in the top right area.
  • #4 may have too much symmetry, which can make for a boring image. It becomes too focused on the middle, and the spaceships are too close in size and shape to the figures. More variety in size and shape is needed in the composition.

Here is how #1 functions compositionally: The symmetry first draws the eye to the characters at the center, and then to what they're looking at. The diagonal movement of the ships and the vertical direction of the banners help to lead the eye around the rest of the image, landing on secondary points of interest like the guard on the left, the nobles talking (or plotting?) on the right, and the background elements that depict the setting.

Step 3: Detailed Illustration

The final part of the process is bringing the image up to the level of detail needed. Some concepts can be rough sketches or rough color speed paintings, but as an illustration, this image needs to be taken to a state of polish. First, I fleshed out some of the 3D scene from the thumbnails using SketchUp. Some concept artists use 3D almost exclusively; some use nothing. A few Photoshop brushes and thirty minutes can generally create a more realized scene or character than thirty minutes in a 3D program. But if high levels of detail and/or realism are required, for example in cinematic key art, 3D renders can become very useful, sometimes even a necessity. In this illustration, SketchUp was used mainly to establish the perspective, and as a guide for the digital painting.

Sometimes I'll paint right on top of a 3D render, using some of the Photoshop coloring techniques that I go over in the video. But that usually involves a finished model from a 3D artist rather than these boxy blocked-out SketchUp creations of my own. SketchUp isn't the most advanced 3D tool out there, and its main use is for man-made shapes, not organic shapes. But it's free (or relatively cheap for the pro version) and intuitive, making it a favorable choice for artists who are inexperienced in 3D, and/or do not use 3D enough to justify the hefty price tags of other programs.

3D scene in SketchUp and final splash
Figure 3. 3D scene in SketchUp*, and final splash art, Star Dynasties*.

After fleshing out the 3D scene and updating the thumbnail rough based on the new screenshot, the next step is to add color. Keeping the major elements like figures, support beams, floor, buildings, ships, and planet on separate layers helps. Photoshop has a variety of ways to colorize a grayscale image. The tools I used in this illustration include gradient maps, color and overlay layers, and the Blend If feature, all of which I go over in the video. Gradient maps in particular are useful to apply cool, desaturated colors to the shadows of an area, and brighter, more vibrant colors to the highlights. A lot of detailing and painting went on after the initial colorization of the roughs for this image, but gradient maps saved time and effort at the start.

After establishing the colors, it's about applying all the details, the poses, lighting, expressions, and so on. The hand and head reference app, Handy art reference tool, is especially helpful because of its easy positioning and lighting tools. In this illustration, the rocky landscape, the cloth banners, and the metal support beams all involve photo textures, either applied as an overlay, or used as a base to paint on top of. Some of the details on the ships are custom shapes. Check out Long Pham's Gumroad* for some industrial and sci-fi custom shape packs. For the visible faces, I used the Handy art reference tool to help get the angle and lighting that I wanted. And the far-right figure's hands are just a photo of my hands with some color adjustments and a bit of painting on top.

That covers the overall process behind this illustration. Working quickly can involve a variety of tools and shortcuts, but it's also important to spend sufficient time on each step, especially when working with a client or a team. The early steps of gathering reference and finding the right composition, design, and style can greatly reduce time-consuming revisions later on in the process. If you're interested in a way to practice speed painting, Facebook groups like Daily Spitpaint can be useful, both as a way to practice and as a way to gain insight into how others tackle digital painting challenges on a tight schedule.

Coarse Pixel Shading with Temporal Supersampling

$
0
0

By Kai Xiao, Gabor Liktor and Karthik Vaidyanathan

The BISTRO scene

Abstract

Decoupled sampling techniques such as coarse pixel shading can lower the shading rate while resolving visibility at the full resolution, thereby preserving details along geometric edges. However, while these techniques can signifcantly reduce shading costs, they also reduce shading quality. In this paper we extend coarse pixel shading with a temporal supersampling scheme that notably improves image quality. We derive multiple shading and visibility samples by jittering each frame using a novel sequence that produces a suitable distribution of samples in both the shading and visibility domain, and temporally resolve these samples to enhance the shading resolution and reduce aliasing. We demonstrate a substantial reduction in shading cost compared to checkerboard rendering, which is another temporal supersampling technique widely used in games today.


Free access to Intel® Performance libraries, Analysis tools and more...

Star Trek*: Bridge Crew and VR's New Frontiers

$
0
0

the bridge of a Star Trek spaceship

Star Trek*: Bridge Crew is the latest example of how new virtual reality (VR) titles are fully utilizing CPU and GPU power to provide immersive gameplay. This game, the newest title from one of the most important science-fiction franchises in gaming, also illustrates many of key concepts described in Intel’s Guidelines for Immersive Virtual Reality Experiences, making it worth a deeper look.

The most successful VR titles provide an escape into other worlds that feel real. Intel® Developer Zone readers have already learned about Arizona Sunshine* and Lone Echo*, two games that illustrate how successful VR titles provide this immersive quality by following emerging standards documented in the guidelines. Both titles smartly created the illusion of alternate realities—in space or deep in a mine shaft. For this article, Star Trek: Bridge Crew provides another perspective.

Latest in a Long Line

More than 50 years after the 1966 network television premiere of Star Trek*, the franchise continues to (boldly) go from strength to strength. The original series was canceled after just three seasons, but through syndication, spin-off series, novels, comics, magazines, exhibitions, and a staggering 13 feature films, Star Trek has become one of the most recognizable titles in sci-fi history.

Fans have previously gone to great lengths to taste the Star Trek experience in real life, from attending far-flung conventions, to bidding at charity auctions for walk-on roles in the TV and movie productions. But that experience was available only to a lucky few, until now. Dubbed a "real-life holodeck" (the place where members of Star Trek: The Next Generation got to live out their fantasies), the Star Trek: Bridge Crew video game simulates the experience of piloting a Federation starship, using the 2009 reboot film Star Trek as a backdrop to the action.

Developed by Red Storm Entertainment, a Ubisoft Entertainment company, Star Trek: Bridge Crew was originally released in spring 2017 as VR-only for Windows* and Oculus* Rift, and the Sony PlayStation* VR. The non-VR version followed later that year. The game takes place just after the planet Vulcan has been destroyed; the crew is tasked with locating a new home for the few Vulcan people that remain, starting in a region of space occupied by Klingons. As in the movies and TV shows, the characters plot courses and navigate the ship, arm and operate weapons, and control the ship’s power and carry out repairs. Players can be local or online, and all but the captain can be computer-controlled.

Adding VR to this title takes fans closer than ever before to the heart of the Star Trek experience, putting them directly on the bridge. Put on the headset, and you're on the USS Aegis, in the command center, with that familiar widescreen view of space in front of you. Up to four players assume the essential roles of captain, engineer, helmsman, and tactical officer. A solo option is available, but the experience differs significantly; interaction between crew members is an important aspect of the game.

Many VR titles take players into worlds they have never seen before, with no existing frame of reference and no comparisons to be instantly made. But for Star Trek: Bridge Crew, it's not so much "where no game has gone before" as it is "going into a world that's very familiar to a lot of people with very firm ideas on how it should be portrayed." This is a world where fans care greatly about such things as the noises doors make. Will the "rules" apply? And will the game be the "just like being there" experience fans are hoping for? Let's explore.

Star Trek spaceship Aegis in space
Figure 1. The USS Aegis warping through space in one of the dazzling exterior scenes from Star Trek*: Bridge Crew.

Realities of Virtual Reality

Intel's guidelines for making VR games, in which players not only feel safe, but also want to continue playing, fall into three main categories:

  • Physical foundation
  • Basic realism
  • Beyond novelty

Safety First

The physical foundation is built on safety, comfort, and occluding the real world. The safety aspect is both physical and social, making sure that players' safety is ensured in the physical world while they interact with the virtual world, and that they are aware of social consequences of actions in the virtual world. Comfort relates to ergonomics, such as well-fitting headsets and easy-to-use controllers, and avoiding the dreaded motion sickness. Occluding the real world is about preventing distraction from the VR experience, which can easily be overlooked.

Cristiano Ferreira, a game-technology engineer at Intel, explained how to avoid one of VR's earliest issues: motion sickness. "The big 'no-no' is suddenly accelerating a player without them causing the movement, unless special conditions are met," he said. One such condition he mentioned is to place the character within an apparently static surrounding vehicle. In Star Trek: Bridge Crew, the crew members keep to their respective stations within the static confines of the deck. There is even an option to inhabit the captain's chair from the original, first-run series.

"You want to feel a sense of grounding in reality, to allow yourself to feel the safety required for complete immersion," Ferreira explained. He offered other examples of unsettling characteristics of an unsafe environment. "If you've ever had a dog running around or a ledge nearby in your VR play area, you know what I'm talking about. [That] can be terrifying."

Basic Realism for New Worlds

The guidelines break basic realism into the following subcategories: graphical integrity, realistic sound, responsive world, and intuitive controls.

Some aspects of Star Trek: Bridge Crew bring up the fascinating notion of the "uncanny valley" referenced in the guidelines. The uncanny valley refers to the theory that when "human replicas approach a convincing level of reality but aren't quite there, they elicit greater feelings of revulsion from observers than mere cartoons." Fortunately, Star Trek: Bridge Crew does not make the mistake of presenting avatars with unsettling hyperrealism. The avatars in the game are obviously avatars, without being creepy or robotic. Ubisoft also maintains graphical integrity by avoiding unsettling bugs, glitches, and other jarring user-experience characteristics that can shatter the illusion. This additional advice from the guidelines is key to maintaining immersion. "Unrealistic conditions in games that are attempting realism quickly bring the player back to reality," Ferreira said.

players interacting on the bridge
Figure 2. Players interact mainly with bridge controls, which maintains physical safety.

Intel has found that well-rated VR games typically include high levels of interactivity between the player and the scenery. The ability to engage with and navigate through your surroundings is what contributes to the success of VR games such as Arizona Sunshine* and Lone Echo*. In Star Trek: Bridge Crew, the characters keep to their seats on the bridge, which reduces interactivity, but the game aligns closely to player expectations of how the crew functions, with each actor standing or seated in a specific location engaging in activity related to their rank. In the original series, much of the action on the bridge consisted of punching buttons, moving levers, and addressing commands. The level of interactivity in Star Trek: Bridge Crew is a deliberate, nostalgic nod to the Star Trek universe, and it matches the real and virtual expectations of the Trek-savvy.

While the bridge scenes evoke nostalgia, the space scenes are what truly go where no game has gone before, with beautiful graphics and wonderful sound effects that dazzle players with
extraordinary scenes of imagined worlds. "The beauty of VR comes from the ability to experience the unreal with a hint of reality," Ferreira said. "This enables humans to experience things nobody has seen before." Games can break the laws of physics, create visuals not possible in the real world, and enhance those visuals with mesmerizing audio, while still maintaining a basic integrity. Star Trek: Bridge Crew accomplishes this beautifully.

Responsiveness and intuitive controls are the two final keys to basic realism. The intuitive bridge controls are "like an extension of your body," Ferreira enthused, and exhibit the expected responsiveness when navigating through space, firing weapons, or raising shields. "In that sense, you do have interaction with game scenery," he said. One of the keys to the game's success is the intensity of focus, which makes the player feel empowered and active in any scenario.

Engineering consle
Figure 3. The engineering console is responsive and intuitive, fitting the guidelines nicely.

Go beyond Novelty by Meeting Expectations

Beyond novelty is the third category of Intel's VR guidelines, and the voyage of Star Trek: Bridge Crew is set on that course. In its VR guidelines, Intel defines going beyond novelty with these bullet points:

  • Smooth onboarding, making that all-important transition from the real to the virtual.
  • Ubiquitous interactivity, placing the expectations of the player center stage.
  • Primacy of VR interactions, emphasizing the uniqueness of this world.

Since most players will be experienced with Star Trek's basic premise and background, onboarding involves simple tasks that establish the game's mission and controls. The differentiated roleplay creates options, for the storylines and player's reactions, encouraging players to begin new missions. The environment (and extent of interaction) is as expected, thanks to its adherence to the existing Star Trek paradigm. The controls for each crew member are simple enough to use easily and without frustration, yet intricate enough that they remain engaging and fun. In-depth tutorials are baked into the game, part of that smooth onboarding process that means there is no jarring disconnect as the player transitions into and then moves through the virtual world.

Ubiquitous interactivity is present in the game due to its very nature. Rather than simply blast through endless waves of space invaders, crew members work the controls, interpret orders, plot courses, and develop strategies. The challenges increase, as do the rewards and the responsibilities, but each interaction is logical and true to the genre and the franchise.

The engineering role, for example, involves distributing available power to the shields, engines and phasers, and that demand increases when the ship sees action. A good engineer will be cool, calm, and collected in the face of action, in the model of storied characters such as Lieutenant Commander Montgomery Scott or Lieutenant Commander Geordi La Forge. Players would do well to study their documentation!

The tactical officer is in charge of scanners and weapons, which includes phasers and torpedoes. After scanning an enemy ship, controls enable targeting its engines, shields, or weapons only, just as in the episodes and movies. This targeting works only when the enemy's shields are down. However, with help from a scan by the ship's engineer, a clever tactician can isolate the enemy's shield frequency and disrupt them.

Tactical console
Figure 4. The tactical console can monitor enemy weapons and defenses, and isolate individual systems for attack.

The helmsman plots courses and drives the ship. Controls at their disposal include impulse and warp drives, speed, heading, and vector. If more speed is required than the available power allows, the captain has the option of ordering engineering to reroute the ship.

Navigation console
Figure 5. Course plotted, the helmsman turns the ship toward the yellow course line.

The captain has access to the same information as the crew but also controls what displays on the main screen, which can repeat his controls for all to see or show a view outside the ship or the destination, with various levels of magnification. The captain's controls can send commands to the corresponding crew members electronically. These comments can also be sent verbally, at the captain's discretion. Whether players channel their inner James Tiberius Kirk or Jean-Luc Picard, the game offers the captain full control of the bridge.

The notion of primacy raised in the guidelines is an important one, and this game is a great illustration of effective primary interactions. As the guidelines state, "software that emphasizes the unique capabilities of VR and puts them to use in the primary interactions will be much more effective in sustaining the user's interest and continuing usage."

Jeremy Bailenson, director of Stanford's Virtual Human Interaction Lab, said developers should ask themselves a basic question as they envision success with primary interactions: Is the experience rare, impossible, too dangerous, or too expensive to do in actual reality? Piloting a starship while wearing that coveted Starfleet insignia checks all the boxes. Fans want to be inside that universe looking out, and their expectations of how it will function and how they will interact within it have been carefully considered.

An Accomplished Mission

As with any solid AAA title, more powerful system specs provide smoother graphics rendering and high frame rates for Star Trek: Bridge Crew—vital to a realistic and immersive VR experience. Part of Ferreira's job at Intel is to analyze games for game developers and help them determine where to make performance improvements. Ferreira says that a common misstep is to attempt to do too much with visual effects that don't add to the story, such as lens flare, motion blur, forced depth-of-field, or chromatic aberration. "It's crucial to be mindful of exactly what's happening on the GPU," he said, in order to preempt performance issues. "Using frame-analysis tools such as Intel® Graphics Performance Analyzers (Intel® GPA) can help you peer under the covers to see exactly what's happening. You'll be surprised at how heavily certain post-effects can bogart your frame budget." You might also find things you can take out.

Star Trek: Bridge Crew is a successful VR game because it adheres to known best practices for a great VR experience, as captured in the Intel VR guidelines. If you are thinking about creating a new VR title, download the guidelines today to give yourself the best possible chance for success.

Resources

Code Sample: An Approach to Parallel Processing with Unity*

$
0
0

File(s):

Download at GitHub*:
License:Intel Sample Source Code License Agreement
Optimized for... 
Operating System:Windows® 10 (64 bit)
Hardware:GPU required
Software:
(Programming Language, tool, IDE, Framework)
Microsoft Visual Studio* 2017, Unity* 5.6, C#
Prerequisites:Familiarity with Microsoft Visual Studio, Unity* API, 3D graphics, parallel processing.
Tutorial:An Approach to Parallel Processing with Unity

Introduction

The idea behind this project was to provide a demonstration of parallel processing in gaming with Unity* and how to perform gaming-related physics using this game engine. In this domain realism is important as an indicator of success. In order to mimic the actual world, many things need to happen at the same time which requires parallel processing. Two different applications were created, then compared to a single-threaded application run on a single core. This code and accompanying article (see References below) cover development of a flocking algorithm, which is then demonstrated as schools of fish via two applications. The first application was developed to run on a multi-threaded CPU, and the second to perform physics calculations on the GPU.

  1. Implementation of a flocking algorithm
  2. Coding differences: CPU vs. GPU

Get Started

Implementation of a flocking algorithm

In this example, a flock was defined as a school of fish. For each member, the algorithm needs to worry about cohesion, alignment and separation. Each fish was calculated to "swim" within a school if it was within a certain distance from any other fish in the school. Members of a school will not act as individuals, but only as members of a flock, sharing same parameters such as the speed and the direction.

The complexity of the flocking algorithm is O(n2) where n is the number of fish. To update the movement of a single fish, the algorithm needs to look at every other n fish in the environment in order to know if the fish can 1) remain in a school; 2) leave a school; or 3) join a new school. It is possible that a single fish could "swim" by itself for a time, until it has an opportunity to join a new school. This needs to be executed for every fish n times.

The algorithm can be simplified as:

For each fish (n)

Look at every other fish (n)

If this fish is close enough

Apply rules: Cohesion, Alignment, Separation

Data is stored inside two buffers which represent the state of each fish. The two buffers are used alternatively to read and to write. The two buffers are required to maintain in memory the previous state of each fish. This information is then used to calculate the next state of each fish. Before every frame, the current Read buffer is read in order to update the scene.

The basic flow through the application is:

  1. Initialize the scene.
  2. For each frame, update the scene
    1. Read the current read buffer
    2. Calculate scene
    3. Render scene
    4. Write to current write buffer
    5. Swap buffers

Coding differences: CPU vs. GPU

The key difference between coding for a single- and a multi-threaded application is how the flocking computation is called. Remember, this calculation is called n times for each frame. The single-threaded application uses a regular for-loop, while the multi-threaded application utilizes the Parallel.For class.

To get the most performance responsibility for the physics calculation is shifted to the GPU. To do so, a "shader" is used and executed on the GPU. A shader is used to add graphical effects to a scene. For this project a "compute shader" was used. The compute shader was coded using HLSL (High-Level Shader Language). The compute shader reproduces the behavior of the Calc function (e.g., speed, position, direction, etc.), but without the calculations for rotation.

The CPU, using the Parallel. For function, calls the UpdateStates function for each fish to calculate its rotation and create the TRS matrices before drawing each fish. The rotation of the fish is calculated using the Unity function Slerp, of the "Quaternion" class.

Note, the accompanying article points out some additional points to consider when utilizing the GPU:

  • Random number generation on the GPU
  • Exchanging or sharing data between the CPU and GPU
  • Cases where the CPU outperformed the GPU

References

Jeremy Servoz, Integrated Computing Solutions, Inc., An Approach to Parallel Processing with Unity, https://software.intel.com/en-us/articles/an-approach-to-parallel-processing-with-unity, 2018

Updated Log

Created March 20, 2018

Creating Custom Benchmarks for IMB 2019

$
0
0

This article guides you through creation of new benchmarks and benchmark suites within the Intel® MPI Benchmarks 2019 infrastructure.

A benchmark suite is a logically connected group of benchmarks. For each suite, you can declare command line arguments and share data structures.

Initial Setup

To create a new benchmark suite:

  1. Choose a name for your new benchmark suite and create a new subdirectory in the src directory of the Intel® MPI Benchmarks directory using this name. For example, if the benchmark suite name is new_bench, the source code sub-tree will be the following:
    src/new_bench/
    src/new_bench/new_bench_1.cpp       
    
  2. Create a Makefile. A simple Makefile may look as follows:
    BECHMARK_SUITE_SRC += new_bench/new_bench_1.cpp
    CPPFLAGS += -Inew_bench
    In this example, the Makefile rules add the new benchmark source code into the full list of files to build and add the new_bench subdirectory to the search list of the included directories via the  –I compiler flag.
  3. Save the Makefile with the .mk extension in the benchmark suite subdirectory:
    src/new_bench/Makefile.new_bench.mk

Examples

You can find a benchmark suite example in the example subdirectory of your Intel® MPI Benchmarks distribution.

example_benchmark1.cpp

This file contains the bare minimum required to introduce a new benchmark suite and a new benchmark to the benchmarking infrastructure. Two main entities must be correctly specified: a benchmark suite class and a benchmark class.

Custom benchmark suite class

In this example, the new benchmark suite class is specified by the DECLARE_BENCHMARK_SUITE_STUFF macro, which specializes the BenchmarkSuite<> template class with the BS_GENERIC enum value. Using the marco is recommended for simple cases like this.

Please note that there is a side effect of using the BenchmarkSuite<BS_GENERIC> template: multiple instantiations of this class in different parts of the source code tree cause linker errors. To avoid this, use a unique namespace for all custom benchmark suites, and custom benchmark data structures and functions. In the example, the example_suite1 namespace is used exactly for this purpose.

Custom benchmark class

A new benchmark class must be inherited from the Benchmark base class and must overload at least one virtual function: void run(const scope_item &item). This is the core of any benchmark. There are two helper macros DEFINE_INHERITED and DECLARE_INHERITED that define all static variables for the automatic runtime registration of any benchmark in the source tree.

You can check in runtime that example1 benchmark appears in the benchmark list of the Intel® MPI Benchmarks with the –list option. The option output also shows that it belongs to our example_suite1 suite.

example_benchmark2.cpp

The example_suite1 can be successfully integrated into the Intel® MPI Benchmarks infrastructure, but to actually run the benchmark, you need to define another main entity of the infrastructure for it.

Custom benchmarking scope

The Intel® MPI Benchmarks infrastructure automatically registers new benchmarks and benchmark suites in the source tree, but the infrastructure must also know how many times each benchmark should be run and which parameters it should be passed each run. This is done by creating an object of an abstract class Scope. The smart pointer to this object belongs to each benchmark object as a member and is supposed to be created by benchmark’s init() virtual function definition.

The example example_benchmark2.cpp introduces the void init() member function, which initializes the scope member of the base class Benchmark. The VarLenScope class, which is derived from the abstract base class Scope, creates a benchmarking scope of all messages or problem lengths from the set: 20, 21,…,222. The Intel® MPI Benchmarks infrastructure uses the scope initialized this way to run the benchmark by calling void run(const scope_item &item) virtual function for each scope item. In this example, each scope item represents a single message length.

example_benchmark3.cpp

The third example extends example_benchmark2.cpp with a simple and close to real world example of an MPI benchmark and implements the well-known ping-pong pattern. The void init() virtual function adds receive and send buffers allocation. The void finalize() virtual function implements the summary results output. The virtual destructor takes care of buffers deallocation.

example_benchmark4.cpp

The fourth example adds command line parameter handling to the previous ping-pong example. There are three command line parameters:

  • –len takes a comma-separated list of message lengths to run the benchmark with
  • –datatype allows you to select the datatype used in MPI messages: MPI_CHAR or MPI_INT
  • –ncycles defines the number of benchmark iterations to execute during each run() call

To set up the descriptions of expected command line arguments, the  bool declare_args() function of the BenchmarkSuite<BS_GENERIC> template class is specified. It uses the args_parser class API to declare options names that are expected to be parsed and option arguments that are meant. For example, the following API call:

parser.add<int>("ncycles", 1000);

instructs the command line parser to expect the –ncycles option with an integer argument, the default argument value being 1000. The call:

parser.add_vector<int>("len", "1,2,4,8").
     set_mode(args_parser::option::APPLY_DEFAULTS_ONLY_WHEN_MISSING);

instructs the command line parser to expect the –len option with a comma-separated list of integers as an argument. The number of integers in the list is arbitrary. The default list consists of 4 integers: 1, 2, 4 and 8, and the nesting set_mode() call makes the parser apply defaults only when the option is missing from the launch command line.

In this example the  bool prepare() function is used to handle the options and transfer data, given by the user on the command line, to internal data structures with corresponding parser.get<>() calls . In particular, the vector<int> len variable stores the list of desired message lengths received from the command line parser, MPI_Datatype datatype stores the chosen data type, int ncycles stores the given number of iterations.

The get_parameter() function specialization implements an interface to pass pointers to data structures from the benchmark suite class to the benchmark class. Any benchmark in this suite may call the get_parameter() function to get a smart pointer to a particular variable. The benchmark suite passes the pointer to the variable via the type erasure template class any. In this example, both the run() and init() virtual functions of the benchmark class use this interface to get pointers to en, datatype and ncycles values. The HANDLE_PARAMETER and GET_PARAMETER macros make the pointer pass handier.

Now the benchmark parameters may be controlled at runtime on the command line. When this example is compiled into a benchmark infrastructure, the command line option parser recognizes the –len, -datatype and –ncycles options. The help output contains information on these options, which is integrated automatically.

example_benchmark5.cpp

This example implements the same functionality as the example_benchmark1.cpp but with minimum usage of predefined macros and template classes.

Dynamically linked IMSL* Fortran Numerical Library can't work with Intel® Parallel Studio XE 2018 for Windows Update 2

$
0
0

Version : Intel® Parallel Studio XE 2018 for Windows Update 2, Intel® Math Kernel Library (Intel® MKL) 2018 Update 2

Operating System : Windows*

Architecture: Intel 64 Only


Problem Description :

An application built by Intel Parallel Studio XE for Windows 2018 Update 2 and dynamically linked with IMSL* Fortran Numerical Library will fail to start with an error message like:


"The procedure entry point mkl_lapack_ao_zgeqrf could not be located in the dynamic link library C:\Program Files(x86)\VNI\imsl\fnl701\Intel64\lib\imslmkl_dll.dll. "


The cause of the error is that symbols removal in Intel MKL 2018 Update 2 breaks its backward compatibility with binaries dynamically linked with an old version of Intel MKL such as IMSL* Fortran Numerical Library.

Resolution Status:

It will be fixed in a future product update. When the fix is available, this article will be updated with the information.

There are three workarounds available to resolve the error:

  1. Link IMSL Fortran Numerical Library statically
  2. Link IMSL Fortran Numerical Library without making use of Intel® MKL, which may have some performance impact
  3. Use an older version of Intel MKL DLL such as Intel MKL 2018 update 1 by putting them into PATH setting at runtime 
Viewing all 3384 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>