Quantcast
Channel: Intel Developer Zone Articles
Viewing all 3384 articles
Browse latest View live

Intel® Trace Analyzer and Collector 2017 Update 3 Readme

$
0
0

The Intel® Trace Analyzer and Collector for Linux* and Windows* is a low-overhead scalable event-tracing library with graphical analysis that reduces the time it takes an application developer to enable maximum performance of cluster applications.  This package is for users who developer on and build for Intel® 64 architectures on Linux* and Windows*, as well as customers running on Intel® Xeon Phi™.  The package also includes an option download on macOS* for analysis only.  You must have a valid license to download, install, and use this product.

The Intel® Trace Analyzer and Collector 2017 Update 3 for Linux* and Windows* packages are now ready for download.  The Intel® Trace Analyzer and Collector is only available as part of Intel® Parallel Studio XE Cluster Edition.

New in this release:

  • Various bug fixes for improved stability and usability

Refer to the Intel® Trace Analyzer and Collector Release Notes for more details.

Contents:

  • Intel® Trace Analyzer and Collector 2017 Update 3 for Linux*
    • l_itac_p_2017.3.030.tgz - A file containing the complete product installation for Linux* OS.
    • w_ita_p_2017.3.030.exe - A file containing the Graphical User Interface (GUI) installation for Windows* OS.
    • m_ita_p_2017.3.030.tgz - A file containing the Graphical User Interface (GUI) installation for macOS*.
  • Intel® Trace Analyzer and Collector 2017 Update 3 for Windows*
    • w_itac_p_2017.3.027.exe - A file containing the complete product installation for Windows* OS.
    • m_ita_p_2017.3.030.tgz - A file containing the Graphical User Interface (GUI) installation for macOS*.

Intel® MPI Library 2017 Update 3 Readme

$
0
0

The Intel® MPI Library is a high-performance interconnect-independent multi-fabric library implementation of the industry-standard Message Passing Interface, v3.1 (MPI-3.1) specification.  This package is for MPI users who develop on and build for Intel® 64 architectures on Linux* and Windows*, as well as customers running on the Intel® Xeon Phi™ product family.  You must have a valid license to download, install, and use this product.

The Intel® MPI Library 2017 Update 3 for Linux* and Windows* packages are now ready for download.  The Intel® MPI Library is available as a stand-alone product and as part of the Intel® Parallel Studio XE Cluster Edition.

New in this release:

  • Accelerated Intel® MPI Library startup for faster HPC application performance
  • Updated the default fabric list on systems with Intel® Omni-Path Architecture (Linux* only)
  • Performance tuning for latest Intel® Xeon® processors
  • Various bug fixes for improved stability and usability

Refer to the Intel® MPI Library Release Notes for more details.

Contents:

  • Intel® MPI Library 2017 Update 3 for Linux*
    • l_mpi_2017.3.196.tgz - A file containing the complete product installation for Linux* OS.
    • l_mpi-rt_2017.3.196.tgz - A file containing the free runtime environment installation for Linux* OS.
  • Intel® MPI Library 2017 Update 3 for Windows*
    • w_mpi_p_2017.3.210.exe - A file containing the complete product installation for Windows* OS.
    • w_mpi-rt_p_2017.3.210.exe - A file containing the complete product installation for Windows* OS.

Known issue collecting FLOPS data with Intel® Advisor 2017 Update 3

$
0
0

 

Problem:

FLOPS and all related data, including Roofline data are completely missed if a survey is collected with the –no-auto-finalize option.

Affected customers:

It should mostly affect our Intel® Xeon Phi™ processor (codename: Knights Landing) customers, because we recommend they perform remote finalization to avoid significant overheads.

Steps to reproduce:

  1. Collect survey with –no-auto-finalize option

  2. Collect FLOPS data

  3. Finalize/open results and discover that FLOPS and all related data are missing (including Roofline)

Root cause:

There is an issue with filtering of FLOPS data. The collector always looking for callstack information, even if this mode is disabled. However, this information is only available after survey finalization. Therefore, no data is collected if survey finalization was skipped.

Workarounds:

  1. Perform survey finalization before collecting FLOPS data. It may be done both on remotely or locally on the  Intel® Xeon Phi™ (not recommended because of the overhead)

  2. Same as above, but you can reuse the callstacks.def from <advisor_project_dir>\e000\callstacks.def for further collections of the same application (if there are no changes in the application modules/callstack, in other case some data may be missed)

TensorFlow* Optimizations on Modern Intel® Architecture

$
0
0

Intel: Elmoustapha Ould-Ahmed-Vall, Mahmoud Abuzaina, Md Faijul Amin, Jayaram Bobba, Roman S Dubtsov, Evarist M Fomenko, Mukesh Gangadhar, Niranjan Hasabnis, Jing Huang, Deepthi Karkada, Young Jin Kim, Srihari Makineni, Dmitri Mishura, Karthik Raman, AG Ramesh, Vivek V Rane, Michael Riera, Dmitry Sergeev, Vamsi Sripathi, Bhavani Subramanian, Lakshay Tokas, Antonio C Valles

Google: Andy Davis, Toby Boyd, Megan Kacholia, Rasmus Larsen, Rajat Monga, Thiru Palanisamy, Vijay Vasudevan, Yao Zhang

TensorFlow* is a leading deep learning and machine learning framework, which makes it important for Intel and Google to ensure that it is able to extract maximum performance from Intel’s hardware offering. This paper introduces the Artificial Intelligence (AI) community to TensorFlow optimizations on Intel® Xeon® and Intel® Xeon Phi™ processor-based platforms. These optimizations are the fruit of a close collaboration between Intel and Google engineers announced last year by Intel’s Diane Bryant and Google’s Diane Green at the first Intel AI Day.

We describe the various performance challenges that we encountered during this optimization exercise and the solutions adopted. We also report out performance improvements on a sample of common neural networks models. These optimizations can result in orders of magnitude higher performance. For example, our measurements are showing up to 70x higher performance for training and up to 85x higher performance for inference on Intel® Xeon Phi™ processor 7250 (KNL). Intel® Xeon® processor E5 v4 (BDW) and Intel Xeon Phi processor 7250 based platforms, they lay the foundation for next generation products from Intel. In particular, users are expected to see improved performance on Intel Xeon (code named Skylake) and Intel Xeon Phi (code named Knights Mill) coming out later this year.

Optimizing deep learning models performance on modern CPUs presents a number of challenges not very different from those seen when optimizing other performance-sensitive applications in High Performance Computing (HPC):

  1. Code refactoring needed to take advantage of modern vector instructions. This means ensuring that all the key primitives, such as convolution, matrix multiplication, and batch normalization are vectorized to the latest SIMD instructions (AVX2 for Intel Xeon processors and AVX512 for Intel Xeon Phi processors).
  2. Maximum performance requires paying special attention to using all the available cores efficiently. Again this means looking at parallelization within a given layer or operation as well as parallelization across layers.
  3. As much as possible, data has to be available when the execution units need it. This means balanced use of prefetching, cache blocking techniques and data formats that promote spatial and temporal locality.

To meet these requirements, Intel developed a number of optimized deep learning primitives that can be used inside the different deep learning frameworks to ensure that we implement common building blocks efficiently. In addition to matrix multiplication and convolution, these building blocks include:

  • Direct batched convolution
  • Inner product
  • Pooling: maximum, minimum, average
  • Normalization: local response normalization across channels (LRN), batch normalization
  • Activation: rectified linear unit (ReLU)
  • Data manipulation: multi-dimensional transposition (conversion), split, concat, sum and scale.

Refer to this article for more details on these Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) optimized primitives.

In TensorFlow, we implemented Intel optimized versions of operations to make sure that these operations can leverage Intel MKL-DNN primitives wherever possible. While, this is a necessary step to enable scalable performance on Intel® architecture, we also had to implement a number of other optimizations. In particular, Intel MKL uses a different layout than the default layout in TensorFlow for performance reasons. We needed to ensure that the overhead of conversion between the two formats is kept to a minimum. We also wanted to ensure that data scientists and other TensorFlow users don’t have to change their existing neural network models to take advantage of these optimizations.

Graph Optimizations

We introduced a number of graph optimization passes to:

  1. Replace default TensorFlow operations with Intel optimized versions when running on CPU. This ensures that users can run their existing Python programs and realize the performance gains without changes to their neural network model.
  2. Eliminate unnecessary and costly data layout conversions.
  3. Fuse multiple operations together to enable efficient cache reuse on CPU.
  4. Handle intermediate states that allow for faster backpropagation.

These graph optimizations enable greater performance without introducing any additional burden on TensorFlow programmers. Data layout optimization is a key performance optimization. Often times, the native TensorFlow data format is not the most efficient data layout for certain tensor operations on CPUs. In such cases, we insert a data layout conversion operation from TensorFlow’s native format to an internal format, perform the operation on CPU, and convert operation output back to the TensorFlow format. However, these conversions introduce a performance overhead and should be minimized. Our data layout optimization identifies sub-graphs that can be entirely executed using Intel MKL optimized operations and eliminates the conversions within the operations in the sub-graph. Automatically inserted conversion nodes take care of data layout conversions at the boundaries of the sub-graph. Another key optimization is the fusion pass that automatically fuses operations that can be run efficiently as a single Intel MKL operation.

Other Optimizations

We have also tweaked a number of TensorFlow framework components to enable the highest CPU performance for various deep learning models. We developed a custom pool allocator using existing pool allocator in TensorFlow. Our custom pool allocator ensures that both TensorFlow and Intel MKL share the same memory pools (using the Intel MKL imalloc functionality) and we don’t return memory prematurely to the operating system, thus avoiding costly page misses and page clears. In addition, we carefully tuned multiple threading libraries (pthreads used by TensorFlow and OpenMP used by Intel MKL) to coexist and not to compete against each other for CPU resources.

Performance Experiments

Our optimizations such as the ones discussed above resulted in dramatic performance improvements on both Intel Xeon and Intel Xeon Phi platforms. To illustrate the performance gains we report below our best known methods (or BKMs) together with baseline and optimized performance numbers for three common ConvNet benchmarks.

  1. The following parameters are important for performance on Intel Xeon (codename Broadwell) and Intel Xeon Phi (codename Knights Landing) processors and we recommend tuning them for your specific neural network model and platform. We have carefully tuned these parameters to gain maximum performance for convnet-benchmarks on both Intel Xeon and Intel Xeon Phi processors.
    1. Data format: we suggest that users can specify the NCHW format for their specific neural network model to get maximum performance. TensorFlow default NHWC format is not the most efficient data layout for CPU and it results in some additional conversion overhead.
    2. Inter-op / intra-op: we also suggest that data scientists and users experiment with the intra-op and inter-op parameters in TensorFlow for optimal setting for each model and CPU platform. These settings impact parallelism within one layer as well as across layers.
    3. Batch size: batch size is another important parameter that impacts both the available parallelism to utilize all the cores as well as working set size and memory performance in general.
    4. OMP_NUM_THREADS: maximum performance requires using all the available cores efficiently. This setting is especially important for performance on Intel Xeon Phi processors since it controls the level of hyperthreading (1 to 4).
    5. Transpose in Matrix multiplication: for some matrix sizes, transposing the second input matrix b provides better performance (better cache reuse) in Matmul layer. This is the case for all the Matmul operations used in the three models below. Users should experiment with this setting for other matrix sizes.
    6. KMP_BLOCKTIME: users should experiment with various settings for how much time each thread should wait after completing the execution of a parallel region, in milliseconds.

Example settings on Intel® Xeon® processor (codename Broadwell - 2 Sockets - 22 Cores)

Example settings on Intel® Xeon PhiTM processor (codename Knights Landing - 68 Cores)

  1. Performance results on Intel® Xeon® processor (codename Broadwell – 2 Sockets – 22 Cores)

  2. Performance results on Intel® Xeon Phi™ processor (codename Knights Landing – 68 cores)

  3. Performance results with different batch sizes on sizes on Intel® Xeon® processor (codename Broadwell) and Intel® Xeon Phi™ processor (codename Knights Landing) - Training

Building and Installing TensorFlow with CPU Optimizations

  1. Run "./configure" from the TensorFlow source directory, and it will download latest Intel MKL for machine learning automatically in tensorflow/third_party/mkl/mklml if you select the options to use Intel MKL.
  2. Execute the following commands to create a pip package that can be used to install the optimized TensorFlow build.
    • PATH can be changed to point to a specific version of GCC compiler:
      export PATH=/PATH/gcc/bin:$PATH
    • LD_LIBRARY_PATH can also be changed to point to new GLIBC :
      export LD_LIBRARY_PATH=/PATH/gcc/lib64:$LD_LIBRARY_PATH.
    • Build for best performance on Intel Xeon and Intel Xeon Phi processors:
      bazel build --config=mkl --copt=”-DEIGEN_USE_VML” -c opt //tensorflow/tools/pip_package:
      build_pip_package
  3. Install the optimized TensorFlow wheel
    1. bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/path_to_save_wheel
      pip install --upgrade --user ~/path_to_save_wheel /wheel_name.whl

System Configuration

What It Means for AI

Optimizing TensorFlow means deep learning applications built using this widely available and widely applied framework can now run much faster on Intel processors to increase flexibility, accessibility, and scale. The Intel Xeon Phi processor, for example, is designed to scale out in a near-linear fashion across cores and nodes to dramatically reduce the time to train machine learning models. And TensorFlow can now scale with future performance advancements as we continue enhancing the performance of Intel processors to handle even bigger and more challenging AI workloads.

The collaboration between Intel and Google to optimize TensorFlow is part of ongoing efforts to make AI more accessible to developers and data scientists, and to enable AI applications to run wherever they’re needed on any kind of device—from the edge to the cloud. Intel believes this is the key to creating the next-generation of AI algorithms and models to solve the most pressing problems in business, science, engineering, medicine, and society.

This collaboration already resulted in dramatic performance improvements on leading Intel Xeon and Intel Xeon Phi processor-based platforms. These improvements are now readily available through Google’s TensorFlow GitHub repository. We are asking the AI community to give these optimizations a try and are looking forward to feedback and contributions that build on them.

Use Intel SGX Templates for the GNU* Autoconf* Build System

$
0
0

GNU* Autoconf* is a popular build system that sees extensive use for Linux* source code packages. It produces a consistent, easy-to-use, and well-understood configuration script that allows end users and systems integrators to tailor software packages for their installation environments, almost always without any manual intervention. To create a configure script, the software developer creates a template file consisting of a series of macros that define the software package configuration needs, and then processes it with the Autoconf utility. GNU Autoconf provides convenient automation and standardization for common, and often tedious, tasks such as building Makefiles and configurable header files.

One of the key features of the Autoconf system is that it is extensible. Software developers can create macros that expand its functionality in order to support customized build and configuration needs. In this article, we introduce a set of macros and Makefile templates that do exactly this: Extend the functionality of Autoconf to simplify the process of building software that makes use of Intel® Software Guard Extensions (Intel® SGX). The templates themselves, along with a sample application source tree that makes use of them, are provided as a download.

Overview

The Intel SGX templates for the GNU Autoconf package contain four files:

  • README
  • aclocal.m4
  • sgx-app.mk.in
  • sgx-enclave.mk.in

README

The README file has detailed information on the Autoconf macros and Makefile rules and variables that make up the templates. It is a reference document, while this article functions more as a “how to” guide.

aclocal.m4

This is where the macros for extending Autoconf are defined. This file can be used as-is, appended to an existing aclocal.m4, or renamed for integration with GNU Automake*.

sgx-app.mk.in

This file builds to “sgx-app.mk” and contains Makefile rules and definitions for building Intel SGX applications. It is intended to be included (via an “include” directive) from the Makefile(s) that produce an executable object that includes one or more Intel SGX enclaves.

sgx-enclave.mk.in

This file builds to “sgx-enclave.mk” and contains Makefile rules and definitions for building Intel SGX enclaves. It must be included (via an “include” directive) from Makefiles that produce an Intel SGX enclave object (*.signed.so file in Linux).

Because this file contains build targets, you should place the include directive after the default build target in the enclave’s Makefile.in.

Creating configure.ac

Start by including the macro SGX_INIT in your configure.ac. This macro is required in order to set up the build system for Intel SGX, and it does the following:

  • Adds several options to the final configure script that let the user control aspects of the build.
  • Attempts to discover the location of the Intel SGX SDK.
  • Creates sgx_app.mk from sgx_app.mk.in.

SGX_INIT also defines a number of Makefile substitution variables. The ones most likely to be needed by external Makefiles are:

enclave_libdirInstallation path for enclave libraries/objects. Defaults to $EPREFIX/lib.
SGX_URTS_LIBThe untrusted runtime library name. When the project is built in simulation mode it automatically includes the _sim suffix.
SGX_UAE_SERVICE_LIBThe untrusted AE service library name. When the project is built in simulation mode it automatically includes the _sim suffix.
SGXSDKThe location of the Intel® SGX SDK.
SGXSDK_BINDIRThe directory containing Intel SGX SDK utilities.
SGXSDK_INCDIRThe location of Intel SGX SDK header files.
SGXSDK_LIBDIRThe directory containing the Intel SGX SDK libraries needed during linking.

The SGX_INIT macro does not take any arguments.

AC_INIT(sgxautosample, 1.0, john.p.mechalas@intel.com)

AC_PROG_CC()
AC_PROG_CXX()
AC_PROG_INSTALL()

AC_CONFIG_HEADERS([config.h])

SGX_INIT()

AC_CONFIG_FILES([Makefile])

AC_OUTPUT()

Next, define the enclaves. Each enclave is expected to have a unique name, and should be located in a subdirectory that is named after it. Specify the enclaves using the SGX_ADD_ENCLAVES macro. It takes one or two arguments:

  1. (required) The list of enclave names.
  2. (optional) The parent directory where the enclave subdirectories can be found. This defaults to “.”, the current working directory, if omitted.

Note that you can invoke this macro multiple times if your project has multiple enclaves and they do not share a common parent directory. Enclave names should not include spaces or slashes.

AC_INIT(sgxautosample, 1.0, john.p.mechalas@intel.com)

AC_PROG_CC()
AC_PROG_CXX()
AC_PROG_INSTALL()

AC_CONFIG_HEADERS([config.h])

SGX_INIT()

# Add enclave named “EnclaveHash” in the EnclaveHash/ directory
SGX_ADD_ENCLAVES([EnclaveHash])

AC_CONFIG_FILES([Makefile])

AC_OUTPUT()

In addition to defining the enclaves, this macro does the following:

  • Builds sgx_enclave.mk from sgx_enclave.mk.in.
  • Builds the Makefiles in each enclave subdirectory from their respective Makefile.in sources.

Enclave Makefiles

Each enclave’s Makefile needs to include the global sgx_enclave.mk rules file in order to inherit the rules, targets, and variables that automate enclave builds. Each Enclave must abide by the following rules:

  • The enclave must be in its own subdirectory.
  • The name of the subdirectory must match the name of the enclave (for example, an enclave named EnclaveCrypto must be placed in a subdirectory named EnclaveCrypto).
  • The EDL file for the enclave must also match the enclave name (for example, EnclaveCrypto.edl).
  • The Makefile must define the name of the enclave in a variable named ENCLAVE (for example, ENCLAVE=EnclaveCrypto).

The sgx_enclave.mk file defines a number of variables for you to use in the enclave’s Makefile:

ENCLAVE_CLEANA list of files that should be removed during 'make clean'.
ENCLAVE_CPPFLAGSC preprocessor flags.
ENCLAVE_CXXFLAGSC++ compiler flags necessary for building an enclave.
ENCLAVE_DISTCLEANA list of files that should be removed during 'make distclean'.
ENCLAVE_LDFLAGSLinker flags for generating the enclave .so.
ENCLAVE_TOBJThe trusted object file $(ENCLAVE)_t.o that is auto-generated by the sgx_edger8r tool. Include this in your enclave link line and the enclave build dependencies.

Here’s the Makefile.in for the enclave in the sample application included with the templates:

CC=@CC@
CFLAGS=@CFLAGS@
CPPFLAGS=@CPPFLAGS@
LDFLAGS=@LDFLAGS@

INSTALL=@INSTALL@
prefix=@prefix@
exec_prefix=@exec_prefix@
bindir=@bindir@
libdir=@libdir@
enclave_libdir=@enclave_libdir@

ENCLAVE=EnclaveHash

OBJS=$(ENCLAVE).o

%.o: %.c
        $(CC) $(CPPFLAGS) $(ENCLAVE_CPPFLAGS) $(CFLAGS) $(ENCLAVE_CFLAGS) -c $<

all: $(ENCLAVE).so

install: all
        $(INSTALL) -d $(enclave_libdir)
        $(INSTALL) -t $(enclave_libdir) $(ENCLAVE_SIGNED)

include ../sgx_enclave.mk

$(ENCLAVE).so: $(ENCLAVE_TOBJ) $(OBJS)
        $(CC) $(CFLAGS) -o $@ $(ENCLAVE_TOBJ) $(OBJS) $(LDFLAGS) $(ENCLAVE_LDFLAGS)

clean:
        rm -f $(OBJS) $(ENCLAVE_CLEAN)

distclean: clean
        rm -f Makefile $(ENCLAVE_DISTCLEAN)

Application Makefiles

Application components that reference enclaves need to include sgx_app.mk in their Makefile. It defines a number of rules, targets, and variables to assist with the build.

To get a list of all the enclaves in the project, the Makefile must define a list variable from the @SGX_ENCLAVES@ substitution variable that is set by Autoconf:

SGX_ENCLAVES:=@SGX_ENCLAVES@

This should be included as a build target as well, to ensure that all enclaves are built along with the application.

all: enclavetest $(SGX_ENCLAVES)

The variables most likely to be needed by the application’s Makefile are:

ENCLAVE_CLEANA list of files that should be removed during 'make clean'.
ENCLAVE_UOBJSThe untrusted object files $(ENCLAVE)_u.o that are auto-generated by the sgx_edger8r tool. Include these in your application link line and the enclave build dependencies.
ENCLAVE_UDEPSThe untrusted source and header files that are auto-generated by the sgx_edger8r tool. Include these in your compilation dependencies when building your application.

Here’s the Makefile for the sample application that is bundled with the templates:

SGX_ENCLAVES:=@SGX_ENCLAVES@

CC=@CC@
CFLAGS=@CFLAGS@ -fno-builtin-memsetqq
CPPFLAGS=@CPPFLAGS@
LDFLAGS=@LDFLAGS@ -L$(SGXSDK_LIBDIR)
LIBS=@LIBS@

INSTALL=@INSTALL@
prefix=@prefix@
exec_prefix=@exec_prefix@
bindir=@bindir@
libdir=@libdir@
enclave_libdir=@enclave_libdir@

APP_OBJS=main.o

%.o: %.c
        $(CC) -c $(CPPFLAGS) $(CFLAGS) -I$(SGXSDK_INCDIR) $<

all: enclavetest $(SGX_ENCLAVES)

install: install-program install-enclaves

install-program: all
        $(INSTALL) -d $(bindir)
        $(INSTALL) -t $(bindir) enclavetest

install-enclaves:
        for dir in $(SGX_ENCLAVES); do \
                $(MAKE) -C $$dir install; \
        done

include sgx_app.mk

enclavetest: $(ENCLAVE_UOBJS) $(APP_OBJS)
        $(CC) -o $@ $(LDFLAGS) $(APP_OBJS) $(ENCLAVE_UOBJS) $(LIBS) -l$(SGX_URTS_LIB)

clean: clean_enclaves
        rm -f enclavetest $(APP_OBJS) $(ENCLAVE_CLEAN)

distclean: clean distclean_enclaves
        rm -rf Makefile config.log config.status config.h autom4te.cache
        rm -rf sgx_app.mk sgx_enclave.mk

Note that the link line for the application references the sgx_urts library via the Makefile variable $(SGX_URTS_LIB). This is to support builds made in simulation mode: The variable will automatically append the _sim suffix to the library names so that the Makefile doesn’t have to define multiple build targets. Always use the variables $(SGX_URTS_LIB) and $(SGX_UAE_SERVICE_LIB) in your Makefile instead of the actual library names.

Running the Configure Script

When the configure.ac file is processed by Autoconf, the resulting configure script will have some additional command-line options. These are added by the SGX_INIT macro:

--enable-sgx-simulation

Build the project in simulation mode. This is for running and testing Intel SGX applications on hardware that does not support Intel SGX instructions.

--with-enclave-libdir-path=path

Specify where enclave libraries should be installed, and set the enclave_libdir substitution variable in Makefiles. The default is $EPREFIX/lib.

--with-sgx-build=debug|prerelease|release

Specify whether to build the Intel SGX application in debug, prerelease, or release mode. The default is to build in debug mode.

See the Intel SGX SDK for information on the various build modes. Note that you cannot mix release or prerelease modes with the --enable-sgx-simulation option.

--with-sgxsdk=path

Specify the Intel SGX SDK installation directory. This overrides the auto-detection procedure.

Summary and Future Work

These templates simplify the process of integrating the GNU build system with Intel SGX projects. They eliminate tedious, redundant coding, relieve the developer of the burden of remembering and entering the numerous libraries and compiler and linker flags needed to build Intel SGX enclaves, and automate the execution of supporting tools such as sgx_edger8r and sgx_sign.

While this automation and integration is valuable, there is still a non-trivial amount of effort required to set up the project environment. Further automation might be possible through the use of GNU Automake, which is designed to generate the Makefile templates that are in turn processed by Autoconf.

The build environment for Intel SGX applications can be complicated. Integration with build systems such as GNU Autoconfig, and potentially Automake, can save the developer considerable time and make their projects less prone to errors.

Wind River Helix* Device Cloud Application Deployment: POC Retail Vending Machine

$
0
0

Intro

Securely and easily deploying an IOT software solution to multiple gateways across the world can be a challenge. However, for gateways running Wind River Helix* Device Cloud there is a clear path to follow that diminishes the challenge. The Wind River Helix* Device Cloud allows for complete device lifecycle management, from deploying, to monitoring, to updating, to decommissioning. It has telemetry capabilities as well, allowing it to receive and store data in the cloud, as well as act on it using rules and alerts. This article will explore a proof of concept that deploys software to vending machine gateways using the Helix Device Cloud (HDC).

To learn more about the Helix Device Cloud:

https://www.helixdevicecloud.com

High level component diagram with Arduino 101* (branded Genuino 101* outside the U.S.) and Intel® NUC

Figure 1: High level component diagram with Arduino 101* (branded Genuino 101* outside the U.S.) and Intel® NUC

Set-up

This article assumes that chocolate bar vending machines have been deployed in various locations, that they’re controlled by a gateway with the HDC agent installed, and that they are properly configured. The POC uses the Intel® NUC (NUC5I3MYHE) running Ubuntu* 16.04 as the gateway with HDC 2.2.1 installed and an Arduino 101 on a USB port with Grove* sensors from Seeed* Studio acting as the vending machine sensors. The Arduino 101 has a touch sensor to indicate a purchase of the product; a green LED turns on when purchase is successful and a red LED turns on when the product is out of stock. A temperature sensor will monitor the vending machine’s temperature to see if the chocolate bars are in danger of melting. In addition, it has a motion sensor to count traffic passing by the vending machine which turns on a blue LED when motion is detected. The software for the vending machine is written in Python* and uses the HDC iot_python module.

For instructions on how to install and configure the HDC Agent on Ubuntu, refer to this guide in the Wind River Knowledge Library:

http://knowledge.windriver.com/en-us/000_Products/040/050/020/000_Wind_River_Helix_Device_Cloud_Getting_Started/060

 

To interface the Arduino 101 board’s sensors with the gateway, MRAA needs to be installed on the gateway:

sudo add-apt-repository ppa:mraa/mraa
sudo apt-get update
sudo apt-get install libmraa1 libmraa-dev mraa-tools python-mraa python3-mraa

Code 1: commands to install MRAA on Ubuntu

The Arduino 101 must also be running the StandardFirmata sketch. That sketch comes with the Arduino IDE under ExamplesàFirmata à StandardFirmata.

 

Vending Machine Telemetry

The data collected from the vending machine is where the real value comes into play. The gateway application will collect motion, temperature, and inventory data, and send it to the Helix Device Cloud. The application is a python script ‘VendingMachine.py’ that will be turned into a service. Then in HDC, a variety of rules and alerts can be set up to handle the values coming in. For example, if inventory runs out, a rule can trigger more inventory to be sent out to the machine.

The Arduino 101’s sensors will supply the data to upload. To interface with it through the USB port add the line below in the code to tie into MRAA and Firmata*. Firmata will allow the board to talk to the gateway and MRAA handles to IO pin communications. Note that root access is required to access the USB port by default, so when running the python script locally, it must be ‘sudo python VendingMachine.py’.

# Interface with Arduino 101 board
mraa.addSubplatform(mraa.GENERIC_FIRMATA, "/dev/ttyACM0")

Code 2: line to have MRAA use Firmata

Using Firmata will shift all the pin numbers by 512, so pin A3 for the temperature sensor is really pin 512 + 3. 

Arduino 101 pins:

Temperature sensor: A3

Touch sensor: D3

Motion sensor: D7

Blue motion indicator LED: D2

Red out of stock indicator LED: D5

Green purchase indicator LED: D6

temperature_sensor = mraa.Aio(512 + 3)
touch_sensor = mraa.Gpio(512 + 3)
touch_sensor.dir(mraa.DIR_IN)
motion_sensor = mraa.Gpio(512 + 7)
motion_sensor.dir(mraa.DIR_IN)
blue_motion_led = mraa.Gpio(512 + 2)
blue_motion_led.dir(mraa.DIR_OUT)
red_stock_led = mraa.Gpio(512 + 5)
red_stock_led.dir(mraa.DIR_OUT)
green_stock_led = mraa.Gpio(512 + 6)
green_stock_led.dir(mraa.DIR_OUT)

Code 3: Arduino 101 sensor initialization code

The program’s loop will compile the sensor data, handle items being purchased, and then send that data to HDC every minute.

count = 0
        while ( running ):
            #motion sensor
            current_motion = motion_sensor.read()
            if (current_motion):
                print "Detecting moving object"
                blue_motion_led.write(1)
                motion += 1
            else:
                blue_motion_led.write(0)
    
            #temperature sensor
            fahrenheit = 0
            raw_Temp = temperature_sensor.read()
            if raw_Temp> 0 :
                resistance = (1023-raw_Temp)*10000.0/raw_Temp
                celsius = 1/(math.log(resistance/10000.0)/B+1/298.15)-273.15
                fahrenheit = (1.8 * celsius) + 32
            if fahrenheit > temperature:
                temperature = fahrenheit 
            #purchase flow
            green_stock_led.write(0)
            customer_purchase = touch_sensor.read()
            if (num_chocobars > 0):
                red_stock_led.write(0)
                if (customer_purchase):
                    print "Customer purchasing item"
                    green_stock_led.write(1)
                    num_chocobars -= 1
            else:
                red_stock_led.write(1)

            #send telemetery every 10 seconds
            if (count%POLL_INTERVAL_SEC==0):
                send_telemetry_sample()
            count += 1
            sleep(1)

Code 4: The main loop of the program

To send telemetry to HDC, there are three required steps in the code for each metric: local memory needs to be allocated for it, the metric must be registered with the HDC agent, and the data needs to be sent. Refer to the condensed code below. In the actual program, the code will allocate and initialize all the sensors in the initialize() method and submit the telemetry data in the send_telemetry_sample() method. Refer to the end of the article for the full code. Following HDC’s recommendations for sending telemetry, data is only sent once every minute and only if the value has changed. This will also prevent alerts from being triggered multiple times unnecessarily.

telemetry_motion = None
telemetry_motion = iot_telemetry_allocate( iot_lib_hdl, "motion", IOT_TYPE_INT64 )

iot_telemetry_register( telemetry_motion, None, 0 )

iot_telemetry_publish( telemetry_motion, None, 0, IOT_TYPE_INT64, motion )

Code 5: code needed for each telemetry metric

Registered telemetry items can be seen in the device’s dashboard in the Helix Device Cloud and can be viewed in graph form by expanding each metric.

Helix Device Cloud’s device dashboard

Figure 2: Helix Device Cloud’s device dashboard

Expanding each telemetry item, the data can be viewed in graph form.

Temperature data graph in Helix Device Cloud

Figure 3: Temperature data graph in Helix Device Cloud

Rules and Alerts

Now that the data is being sent to the cloud, the rules and alerts feature can be leveraged. These will help monitor the vending machine when data is received for conditions that require attention.

The vending machine needs to send out an alert if the temperature gets too high, as the chocolate inside might melt. To create a new rule, go to the ‘Rules’ tab and click ‘CREATE NEW RULE’.

Create a new rule

Figure 4: Create a new rule

1) Name the rule and select the device or devices to deploy the rule on. To deploy to a large group of devices at once, say all the vending machine gateways, use the more generic device variables on the left.

select devices for the rule

Figure 5: select devices for the rule

2) From there select the ‘temp’ telemetry item. Note that the telemetry gathering program must be running at the time of rule creation, otherwise the telemetry choices will be blank.

select telemetry metric to monitor

Figure 6: select telemetry metric to monitor

3) Once selected, set the conditions to greater than or equal to 90, as chocolate melts at 90 degrees.

set conditions for the telemetry metric

Figure 7: set conditions for the telemetry metric

4) Then set up the rule response, in this case it will create a priority one alert that the chocolate is melting.

set up an alert

Figure 8: set up an alert

Now when the temperature gets to 90 degrees, an alert will be created in the ‘Alerts’ tab.

alerts in Helix Device Cloud

Figure 9: alerts in Helix Device Cloud

The condition also gives the option of sending an email or forwarding the data to a specified MQTT topic. Additionally it could trigger a device action which will be used in the next example alert for low inventory.

While the other rule responses can be completely managed in HDC, a device action requires additional code on the device side. The gateway application code initiates and receives the action sent from HDC. The action in this case will be a simple integer called ‘action_restock’, however HDC can also handle triggering a script or other system command.

1) To begin, allocate and register the action_restock:

#  Allocate action
restock_cmd = iot_action_allocate( iot_lib_hdl, "action_restock" )

 # Restock action
iot_action_parameter_add( restock_cmd,
PARAM_STOCK_NAME, IOT_PARAMETER_IN, IOT_TYPE_INT32, 0 )

Code 6: code for an HDC action

2) Next, add a callback defining what to do when the action is received from the gateway. Here it is mimicking restocking the chocolate bars, so the sent number will be added to the current stock value.

def on_action_restock( request ):
    '''Callback function for testing parameters'''
    result = IOT_STATUS_SUCCESS
    status = IOT_STATUS_FAILURE
    global num_chocobars
    chocobarShipment = 0

    IOT_LOG( iot_lib_hdl, IOT_LOG_INFO, "on_action_restock invoked\n\n")

    # int
    if ( result == IOT_STATUS_SUCCESS ):
        ( status, chocobarShipment ) = iot_action_parameter_get( request,
                PARAM_STOCK_NAME, False, IOT_TYPE_INT32 )
        if ( status != IOT_STATUS_SUCCESS ):
            result = IOT_STATUS_BAD_PARAMETER
            IOT_LOG( iot_lib_hdl, IOT_LOG_ERROR,
                    "get param failed for {}\n".format( PARAM_STOCK_NAME ) )
        else:
            IOT_LOG( iot_lib_hdl, IOT_LOG_INFO,
                    "{} success, value = {}\n".format(
                    PARAM_STOCK_NAME, chocobarShipment ) )
            num_chocobars = num_chocobars + chocobarShipment

    return result

Code 7: code to handle action received from HDC

3) Next, the action needs to be configured in HDC as well. Follow the above telemetry steps but choose ‘Add a Device Action’ instead. And like telemetry, actions are only available when the program is running on the device.

action conditions and responses

Figure 10: action conditions and responses

Now the gateway has two active rules applied to it: Restock Machine and ChocolateMeltingPoint.

Rules in HDC

Figure 11: Rules in HDC

Deployment

With the code complete, it can be turned into a service running continuously on the gateway, which can be deployed using the Helix Device Cloud to all the vending machine gateways. To start, create a new package under the ‘Updates’ tab.

Create an update package

Figure 12: Create an update package

1) Name the package, enter the version number, and select the device compatibility criteria to narrow down the list of devices to deploy to. The files for the program need to be uploaded and attached to the package as well: VendingMachine.py and HDC_VendingMachine.service.

HDC_VendingMachine.service file should look like the below code. The service should start after the iot.service starts as that is the HDC agent on the gateway and will start the python code. Note that the python file will need to be moved out of the initial download location as it will be erased by any subsequent package deployments. In addition, the first line of the VendingMachine.py file needs to be ‘#!/usr/bin/python’ for the service to be able to ExecStart it.

Unit]
Description=HDC POC
After=iot.service
 
[Service]
ExecStart=/home/whitney/Desktop/VendingMachineApp/VendingMachine.py
Restart=always
User=root
TimeoutStartSec=240
TimeoutStopSec=15
KillMode=process
KillSignal=SIGINT
 
[Install]
WantedBy=multi-user.target

Code 8: HDC_Vendingmachine.service file

Parameters of the update package

Figure 13: Parameters of the update package

The cloud package can also execute commands at various parts of the install.

2) For pre-install the directory to store the code needs to be created.

sudo mkdir /usr/bin/VendingMachineApp

Code 9: Pre-install command in HDC

3) During the install, make the python file executable, and move all the files to their final destination. Note that HDC does commands as user ‘iot’, however the python script needs to run as root to have access to USB. The HDC_VendingMachine.service file already has the user as root. To avoid permission conflicts, the chmod must be done as user iot while the file is owned by user iot. Then after the sudo cp takes place the owner will change to root.

chmod +x /var/lib/iot/update/download/VendingMachine.py
sudo cp /var/lib/iot/update/download/VendingMachine.py /usr/bin/VendingMachineApp/
sudo cp /var/lib/iot/update/download/HDC_VendingMachine.service /lib/systemd/system/

Code 10: Install commands in HDC

4) Post-install commands will enable and start the service.

sudo systemctl enable HDC_VendingMachine.service
sudo systemctl start HDC_VendingMachine.service

Code 11: Post-install commands in HDC

Install commands in HDC

Figure 14: Install commands in HDC

5) Save the package and wait for it to finish. Then it is ready to be deployed by clicking on ‘Deploy’.

Saved update package in HDC

Figure 15: Saved update package in HDC

6) The Compatible Device list is pre-populated based on the device conditions specified in the package. Select and add the desired devices for the deployment, then click ‘Deploy’.

Select devices to deploy package to

Figure 16: Select devices to deploy package to

7) The status will show as ‘In Progress’ and then to change to ‘Completed’.

Completed deployment

Figure 17: Completed deployment

8) On the gateway, check the status of the service with the command below and refer to the syslog in ‘/var/log/syslog’ for any errors starting the service and ‘var/lib/iot/update/iot_install_updates.log’ for errors with the install itself.

systemctl status HDC_VendingMachine

Code 12: Check HDC_VendingMachine service status

Full Code

#!/usr/bin/python

import os
import sys
import signal
import inspect
import math
import mraa

from time import sleep
sys.path.append( "../lib" )
from iot_python import *

B=3975

# Interface with Arduino 101 board
mraa.addSubplatform(mraa.GENERIC_FIRMATA, "/dev/ttyACM0")

temperature_sensor = mraa.Aio(512 + 3)
touch_sensor = mraa.Gpio(512 + 3)
touch_sensor.dir(mraa.DIR_IN)
motion_sensor = mraa.Gpio(512 + 7)
motion_sensor.dir(mraa.DIR_IN)
blue_motion_led = mraa.Gpio(512 + 2)
blue_motion_led.dir(mraa.DIR_OUT)
red_stock_led = mraa.Gpio(512 + 5)
red_stock_led.dir(mraa.DIR_OUT)
green_stock_led = mraa.Gpio(512 + 6)
green_stock_led.dir(mraa.DIR_OUT)

POLL_INTERVAL_SEC = 60
MAX_LOOP_ITERATIONS = 360
TAG_MAX_LEN = 128

# Set up named parameters for a sample action to validate actions with
# parameters
PARAM_STOCK_NAME = "# Chocobars to Ship"

#  telemetry data
telemetry_motion = None
telemetry_temp = None
telemetry_stock_chocobars = None

previous_numchocobars= 0
previous_motion = 1000
previous_temperature= 0

running = True
iot_lib_hdl = None
restock_cmd = None

def debug_log( log_level, source, msg ):
    '''Debug log wrapper for printing, used for callbacks'''
    i = 0
    prefix = ["FATAL","ALERT","CRITICAL","ERROR","WARNING",
            "NOTICE","INFO","DEBUG","TRACE"]
    # ensure log level is a valid enumeration value
    if ( log_level <= IOT_LOG_TRACE ):
        i = log_level
    print( "{}: {}".format( prefix[i], msg ) )


def IOT_LOG( handle, level, msg ):
    '''Logging function with support for call location'''
    # previous function call
    callerframerecord = inspect.stack()[1]
    # callerframrecord :  1 = function, 3 = file, 2 = line
    iot_log( handle, level, callerframerecord[1], callerframerecord[3],
            callerframerecord[2], msg )


def initialize():
    '''Connects to the agent and registers all actions and telemetry'''
    global telemetry_motion, telemetry_temp, telemetry_stock_chocobars
    global iot_lib_hdl
    global restock_cmd
    result = False
    status = IOT_STATUS_FAILURE

    iot_lib_hdl = iot_initialize( "complete-app-py", None, 0 )
    iot_log_callback_set( iot_lib_hdl, debug_log )
    status = iot_connect( iot_lib_hdl, 0 )
    if ( status == IOT_STATUS_SUCCESS ):
        IOT_LOG( iot_lib_hdl, IOT_LOG_INFO, "Connected" )

        # Allocate telemetry items
        telemetry_motion = iot_telemetry_allocate( iot_lib_hdl,
                "motion", IOT_TYPE_INT64 )
        telemetry_temp = iot_telemetry_allocate( iot_lib_hdl,
                "temp", IOT_TYPE_FLOAT64 )
        iot_telemetry_attribute_set( telemetry_temp,
                "udmp:units", IOT_TYPE_STRING, "Fahrenheit" )
        telemetry_stock_chocobars = iot_telemetry_allocate( iot_lib_hdl,
                "stock chocobars", IOT_TYPE_INT64 )

        # Register telemetry items
        IOT_LOG( iot_lib_hdl, IOT_LOG_INFO, "Registering telemetry: {}".format(
                "motion" ) )
        iot_telemetry_register( telemetry_motion, None, 0 )
        IOT_LOG( iot_lib_hdl, IOT_LOG_INFO, "Registering telemetry: {}".format(
                "temp" ) )
        iot_telemetry_register( telemetry_temp, None, 0 )
        IOT_LOG( iot_lib_hdl, IOT_LOG_INFO, "Registering telemetry: {}".format(
                "stock chocobars" ) )
        iot_telemetry_register( telemetry_stock_chocobars, None, 0 )
  

        #  Allocate action
        IOT_LOG( iot_lib_hdl, IOT_LOG_INFO,
                "Registering action test_parameters\n" )
        restock_cmd = iot_action_allocate( iot_lib_hdl, "action_restock" )

        # Restock action
        iot_action_parameter_add( restock_cmd,
            PARAM_STOCK_NAME, IOT_PARAMETER_IN, IOT_TYPE_INT32, 0 )

        #validate action registration
        status = iot_action_register_callback(restock_cmd,
                on_action_restock, None, 0 )
        if ( status != IOT_STATUS_SUCCESS ):
            IOT_LOG( iot_lib_hdl, IOT_LOG_ERROR,
                    "Failed to register command. Reason: {}".format(
                    iot_error( status ) ) )
    else:
        IOT_LOG( iot_lib_hdl, IOT_LOG_ERROR, "Failed to connect" )
    if ( status == IOT_STATUS_SUCCESS ):
        result = True
    return result

def on_action_restock( request ):
    '''Callback function for testing parameters'''
    result = IOT_STATUS_SUCCESS
    status = IOT_STATUS_FAILURE
    global num_chocobars
    chocobarShipment = 0

    IOT_LOG( iot_lib_hdl, IOT_LOG_INFO, "on_action_restock invoked\n\n")

    # int
    if ( result == IOT_STATUS_SUCCESS ):
        ( status, chocobarShipment ) = iot_action_parameter_get( request,
                PARAM_STOCK_NAME, False, IOT_TYPE_INT32 )
        if ( status != IOT_STATUS_SUCCESS ):
            result = IOT_STATUS_BAD_PARAMETER
            IOT_LOG( iot_lib_hdl, IOT_LOG_ERROR,
                    "get param failed for {}\n".format( PARAM_STOCK_NAME ) )
        else:
            IOT_LOG( iot_lib_hdl, IOT_LOG_INFO,
                    "{} success, value = {}\n".format(
                    PARAM_STOCK_NAME, chocobarShipment ) )
            num_chocobars = num_chocobars + chocobarShipment

    return result


def send_telemetry_sample():
    '''Send telemetry data to the agent'''
    global num_chocobars, motion, temperature
    global previous_numchocobars, previous_motion, previous_temperature

    IOT_LOG( iot_lib_hdl, IOT_LOG_INFO,
        "{}\n".format("+--------------------------------------------------------+"))

    if previous_motion != motion:
        IOT_LOG( iot_lib_hdl, IOT_LOG_INFO, "Sending motion  : {}".format(motion) );
        iot_telemetry_publish( telemetry_motion, None, 0, IOT_TYPE_INT64, motion )
        previous_motion = motion
    motion = 0

    if previous_temperature != temperature:
        IOT_LOG( iot_lib_hdl, IOT_LOG_INFO, "Sending temp  : {}".format(temperature) );
        iot_telemetry_publish( telemetry_temp, None, 0, IOT_TYPE_FLOAT64, temperature )
        previous_temperature = temperature
    temperature = 0

    if previous_numchocobars != num_chocobars:
        IOT_LOG( iot_lib_hdl, IOT_LOG_INFO, "Sending chocobars stock : {}".format(num_chocobars) );
        iot_telemetry_publish( telemetry_stock_chocobars, None, 0, IOT_TYPE_INT64, num_chocobars )
        previous_numchocobars = num_chocobars
 

def sig_handler( signo, frame ):
    '''Handles terminatation signal and tears down gracefully'''
    global running
    if ( signo == signal.SIGINT ):
        IOT_LOG( iot_lib_hdl, IOT_LOG_INFO, "Received termination signal...\n" )
        running = False

if ( __name__ == '__main__' ):
    global motion, num_chocobars, temperature
    motion = 0
    num_chocobars = 2
    temperature = 0
    if ( initialize() == IOT_TRUE ):
        signal.signal( signal.SIGINT, sig_handler )

        IOT_LOG( iot_lib_hdl, IOT_LOG_INFO, "Sending telemetry..." )

        count = 0
        while ( running ):
            #motion sensor
            current_motion = motion_sensor.read()
            if (current_motion):
                print "Detecting moving object"
                blue_motion_led.write(1)
                motion += 1
            else:
                blue_motion_led.write(0)
    
            #temperature sensor
            fahrenheit = 0
            raw_Temp = temperature_sensor.read()
            if raw_Temp> 0 :
                resistance = (1023-raw_Temp)*10000.0/raw_Temp
                celsius = 1/(math.log(resistance/10000.0)/B+1/298.15)-273.15
                fahrenheit = (1.8 * celsius) + 32
            if fahrenheit > temperature:
                temperature = fahrenheit 
            #purchase flow
            green_stock_led.write(0)
            customer_purchase = touch_sensor.read()
            if (num_chocobars > 0):
                red_stock_led.write(0)
                if (customer_purchase):
                    print "Customer purchasing item"
                    green_stock_led.write(1)
                    num_chocobars -= 1
            else:
                red_stock_led.write(1)

            #send telemetery every 10 seconds
            if (count%POLL_INTERVAL_SEC==0):
                send_telemetry_sample()
            count += 1
            sleep(1)

    #  Terminate
    IOT_LOG( iot_lib_hdl, IOT_LOG_INFO, "Exiting..." )
    iot_terminate( iot_lib_hdl, 0 )
    exit( 0 )

Code 13: VendingMachine.py file

Summary

Our vending machine code has now been successfully deployed using HDC. Temperature and stock data is being monitored with automated rules. Motion data can be referenced as time goes on to monitor foot traffic around the vending machine. Any future updates to the program and overall gateway health can be deployed and monitored using HDC.

To purchase HDC visit https://www.windriver.com/company/contact/index.html or email sales@windriver.com

References

https://www.windriver.com/products/helix/device-cloud/

http://knowledge.windriver.com/en-us/000_Products/040/050/020/000_Wind_River_Helix_Device_Cloud_Getting_Started/060

 

 

About the author

Whitney Foster is a software engineer at Intel in the Software Solutions Group working on scale enabling projects for Internet of Things.

 

Notices

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

Intel, Intel RealSense, Intel Edison. and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others

© 2017 Intel Corporation.

Intel® Distribution for Python 2017 Update 3 Readme

$
0
0

Intel® Distribution for Python powered by Anaconda gives you ready access to tools and techniques for high performance to supercharge all your Python applications on modern Intel platforms. Whether you are a seasoned high-performance developer or a data scientist looking to speed up your workflows, the Intel Distribution for Python powered by Anaconda delivers an easy-to-install, performance-optimized Python experience to meet even your most demanding requirements.

The Intel® Distribution for Python 2017 Update 3 for Linux*, Windows*, and macOS* packages are now ready for download. The Intel® Distribution for Python is available as a stand-alone product and as part of the Intel® Parallel Studio XE.

New in this release:

  • Updates to several modules for improved stability and performance

Refer to the Intel® Distribution for Python Release Notes for more details.

Contents:

  • Intel® Distribution for Python 2017 Update 3 for Linux*
    • File: l_python2_pu3_2017.3.053.tgz

      A File containing the complete product installation for Python 2.7 on Linux (x86-64bit/Intel® Xeon Phi™ coprocessor development)

    • File: l_python3_pu3_2017.3.052.tgz

      A File containing the complete product installation for Python 3.5 on Linux (x86-64bit/Intel® Xeon Phi™ coprocessor development)

  • Intel® Distribution for Python 2017 Update 3 for Windows*
    • File: w_python27_pu3_2017.3.052.exe

      A File containing the complete product installation for Python 2.7 on Windows (x86-64bit development)

    • File: w_python35_pu3_2017.3.052.exe

      A file containing the complete product installation for Python 3.5 on Windows (x86-64bit development)

  • Intel® Distribution for Python 2017 Update 3 for macOS*
    • File: intelpython27-2017.3.053.tgz

      A File containing the complete product installation for Python 2.7 on macOS (x86-64bit development)

    • File: intelpython35-2017.3.053.tgz

      A file containing the complete product installation for Python 3.5 on macOS (x86-64bit development)

Parallel STL: Parallel Algorithms in Standard Template Library

$
0
0

C++17 standards include enabling multi-threading and vectorization for STL algorithms. Intel® C++ Compiler 18.0 Beta and above supports Parallel STL. The beauty of STL is that the data storage (STL Containers) are abstracted from the operations performed on the data (STL algorithms) by a concept called STL iterators. Irrespective of which container a developer chooses for their application, most operations like parsing the container, sorting etc are common operations. For instance, let’s consider two different STL containers:

std::vector<int> a(N);
std::unordered_map<int> b(N);

One STL algorithm which can parse through the above two containers is:

std::for_each(a.begin(), a.end(), [&](auto &c){ std::cout<<c<<”\n”; });
std::for_each(b.begin(), b.end(), [&](auto &c){ std::cout<<c<<”\n”;

But the above STL algorithm is single threaded. But Modern processors are multi-core with SIMD units in each core. For more efficient use of the silicon, the operation done by the STL algorithm needs to be multi-threaded and vectorized. Parallel STL (PSTL) feature in C++17 standards provides with different execution policies which controls how the algorithm will run. The implementation of Parallel STL in Intel Compiler is under pstl namespace. The four execution policies which Intel Compiler implements for PSTL are:

Execution PolicyDescription
pstl::execution::seqSingle threaded, Scalar
pstl::execution::unseqSingle threaded, Vectorized
pstl::execution::parMulti-threaded, Scalar
pstl::execution::par_unseqMulti-threaded, Vectorized

To evaluate the Intel Compiler’s PSTL implementation, please refer to Getting Started Article. More information Parallel STL with simple examples are well explained in Intel Parallel Universe Magazine article.

Does traditional STL coexist with Parallel STL implementation?

Yes, they coexist. The traditional STL implementation is in std namespace while the Parallel STL implementation is in pstl namespace.

Does Intel Compiler’s PSTL implementation with Microsoft or GNU’s PSTL implementation?

Yes, they coexist. The Microsoft’s PSTL implementation is under std::experimental::parallel namespace and GNU’s PSTL implementation is under __gnu_parallel namespace.

Which threading model is used for Parallelism?

Intel Compiler’s PSTL implementation uses Intel Threading Building Blocks (Intel® TBB) runtime, GNU’s PSTL implementation uses OpenMP runtime and Microsoft’s PSTL implementation uses native threads.

PSTL Threading:

Consider a simple sorting example using stl::sort() to start with:

#include<iostream>
#include<algorithm>
#include<vector>
#include<chrono>
#define N 99999999
int main(){
        std::chrono::time_point<std::chrono::system_clock> timer_start, timer_stop;
        srand(time(NULL));
        std::vector<int> myvec1(N), myvec2(N);
        for(int i = 0; i < N; i++)
              myvec1[i] = myvec2[i] = rand();
        //Sorting the content of vector using STL algorithm
       timer_start = std::chrono::system_clock::now();
       std::sort(myvec1.begin(), myvec1.end(), [&](int j, int k){ return(j>k); });
       timer_stop = std::chrono::system_clock::now();
       std::chrono::duration<double> elapsed_seconds = timer_stop - timer_start;         std::cout<<"Standard STL: Time taken in seconds is "<<elapsed_seconds.count()<<" seconds \n";
       return 0;
}

Enabling multi-threading using Parallel STL:

#include<iostream>
#include<pstl/algorithm>
#define TBB_PREVIEW_GLOBAL_CONTROL 1
#include<tbb/global_control.h>
#include<vector>
#include<chrono>
#define N 99999999
int main(){
        tbb::global_control c(tbb::global_control::max_allowed_parallelism,2);
        std::chrono::time_point<std::chrono::system_clock> timer_start, timer_stop;
        srand(time(NULL));
        std::vector<int> myvec1(N), myvec2(N);
        for(int i = 0; i < N; i++)
              myvec1[i] = myvec2[i] = rand();
        //Sorting the content of vector using STL algorithm
       timer_start = std::chrono::system_clock::now();
       std::sort(pstl::execution::par, myvec1.begin(), myvec1.end(), [&](int j, int k){ return(j>k); });
       timer_stop = std::chrono::system_clock::now();
       std::chrono::duration<double> elapsed_seconds = timer_stop - timer_start;         std::cout<<"Standard STL: Time taken in seconds is "<<elapsed_seconds.count()<<" seconds \n";
       return 0;
}

Does just enabling multi-threading in STL algorithm work in every case?

Not necessarily. In the above case, the vectors are essentially broken down into mutually exclusive chunks and given to individual threads for sorting. The chances of 2 threads accessing the same location is 0. But that need not be the case with every algorithm. For instance, consider the scenario of Naïve Bayes Supervised Classification Learning Algorithm which is based on Bayes Theorem. The example involves training the program with a census dataset from UCI Machine Learning Repository. The dataset has 14 attributes for each person (like age, sex, capital gain etc.) and his/her annual salary tracked either has <=50K or >50K. Each attribute can have multiple values. The data structure used to hold the learnt model is:

std::vector<std::vector<std::unordered_map<std::string, int> > >

The outermost vector will be of size two, one for each annual salary category (<=50K and >50K). The inner vector holds the 14 attributes and for each attribute, the unordered_map stores the attribute value and number of occurrence as a <key, value> pair. The main computation happens in the below loop:

	for_each(dataset.begin(), dataset.end(), [&](std::string &s) {
		size_t start = 0, end = 0, ques = 0, index;
		char line[300];
		for (size_t num_of_cols = 0; num_of_cols < 15; num_of_cols++) {
			end = s.substr(start, s.length()).find(',');
			std::string newString = s.substr(start, end);
			ques = newString.find('?');
			if (ques != std::string::npos)
				continue;
			//Windowing for certain numeric fields with continuous values
			switch (num_of_cols) {
			case 0:
				if (newString.find("<=50K") != std::string::npos)
					index = 0;
				else
					index = 1;
				break;
			case 1:
				newString = std::move(string(itoa((atoi(newString.c_str()) / 10) * 10, line, 10)));
				break;
			case 3:
				newString = std::move(string(itoa((atoi(newString.c_str()) / 10000) * 10000, line, 10)));
				break;
			case 11:
				newString = std::move(string(itoa((atoi(newString.c_str()) / 1000) * 1000, line, 10)));
				break;
			case 12:
				newString = std::move(string(itoa((atoi(newString.c_str()) / 1000) * 1000, line, 10)));
				break;
			case 13:
				newString = std::move(string(itoa((atoi(newString.c_str()) / 10) * 10, line, 10)));
				break;
			default: break;
			}
			std::pair<std::unordered_map<std::string, int>::iterator, bool> p = (iter[index].begin())[num_of_cols].insert(std::pair<std::string, int>(newString, 1));			if (!p.second) {
				p.first->second++;
			}
			start = start + end + 1;
		}
		});

To enable parallelism using Parallel STL, try adding the execution policy pstl::execution::par for the above for_each algorithm as shown below:

for_each(pstl::execution::par, dataset.begin(), dataset.end(), [&](std::string &s) {
….
});

When executing this program in multi-threaded mode, it crashes. Debugging this program will clearly point to the insert() of STL unordered_map as shown below:

 

The insert() of STL unordered_map is not thread safe and thus when two threads try to concurrently insert values into the unordered_map, it errors out. Intel® TBB offers thread safe equivalent of STL unordered_map which is tbb::concurrent_unordered_map and it supports the same interfaces which STL container offers. Modifying the data structure in our program to replace unordered_map to concurrent_unordered_map as shown below:

From:

std::vector<std::vector<std::unordered_map<std::string, int> > >

To:

std::vector<std::vector<tbb::concurrent_unordered_map<std::string, int> > >

With the above modification, multiple threads can concurrently insert into the unordered_map and the program demonstrates 2x improvement in performance with 2 TBB threads. But checking the output file reveals that though the program ran successfully with performance, the frequency of occurrence of the individual attribute values are registered wrongly. This is because the operation of incrementing the frequency of occurrence of attribute value if this record already exist in the unordered_map is not atomic and results in a data race condition. This can be fixed by changing the data structure to accommodate the following change:

From:

std::vector<std::vector<tbb::concurrent_unordered_map<std::string, int> > >
.
.
std::pair<std::unordered_map<std::string, int>::iterator, bool> p = (iter[index].begin())[num_of_cols].insert(std::pair<std::string, int>(newString, 1));			if (!p.second) {
				p.first->second++;
			}

To:

std::vector<std::vector<tbb::concurrent_unordered_map<std::string, tbb::atomic<int> > > >
.
.
std::pair<std::unordered_map<std::string, int>::iterator, bool> p = (iter[index].begin())[num_of_cols].insert(std::pair<std::string, int>(newString, 1));			if (!p.second) {
				p.first->second.fetch_and_increment();
			}

By doing the following changes, the code runs faster with multiple threads without compromising the accuracy. The key learning from the above exercise is watch out for needs of concurrent containers and potential data race conditions.

PSTL Vectorization:

Consider the example of searching for an integer in a std::vector:

#include<vector>
#include<algorithm>
#include<iostream>
#include<chrono>
#include<stdlib.h>
#define N1 999999999
#ifdef PSTL
#include"pstl/algorithm"
#include"pstl/execution"
#endif
#ifdef GNU_PSTL
#include"parallel/algorithm"
#include<omp.h>
#endif
using namespace std;
std::vector<long long>::iterator mysearch(long long n1, std::vector<long long> &n2) {
#ifdef PSTL
        return find(pstl::execution::unseq, n2.begin(), n2.end(), n1);
#elif defined(GNU_PSTL)
        return __gnu_parallel::find(n2.begin(), n2.end(), n1);
#else
        return find(n2.begin(), n2.end(), n1);
#endif
}
int main(int argc, char *argv[]){
        long long num_to_search;
        if(argc < 2)
        {
                std::cout<<"Enter the number to searched as command line argument range [0 - 999999999]\n";
                return 0;
        }
        else
                num_to_search = atoi(argv[1]);
        static long long p;
        std::vector<long long> myvec(N1);
        std::chrono::time_point<std::chrono::system_clock> timer_start, timer_end;
        timer_start = std::chrono::system_clock::now();
        #ifdef PSTL
                generate(pstl::execution::unseq, myvec.begin(), myvec.end(), [&]() { return p++; });
        #elif defined(GNU_PSTL)
                omp_set_num_threads(1);
                __gnu_parallel::generate(myvec.begin(), myvec.end(), [&]() { return p++; });
        #else
                generate(myvec.begin(), myvec.end(), [&]() { return p++; });
        #endif
        timer_end = std::chrono::system_clock::now();
        std::chrono::duration<double> elapsed_seconds = timer_end - timer_start;
        std::cout<<"Time taken by generate algorithm is "<<elapsed_seconds.count()<<"\n";
        timer_start = std::chrono::system_clock::now();
        auto result = mysearch(num_to_search, myvec);
        timer_end = std::chrono::system_clock::now();
        elapsed_seconds = timer_end - timer_start;
        std::cout<<"Time taken by find algorithm is "<<elapsed_seconds.count()<<"\n";
        if(result != myvec.end())
                std::cout<<"Found the element "<<*result<<", p = "<<p<<"\n";
        else
                std::cout<<"Element not found, p = "<<p<<"\n";
        return 0;
}

Intel Compiler auto-vectorizes the code and the vectorized code performs better than GCC 5.1 generated binary. Please download the code samples attached, evaluate and compare the PSTL implementations provided by different compiler vendors. The PSTL specific code path and auto-vectorized code path will perform the same in this case. In general, Intel Compiler has a very good vectorization heuristics to identify different code patterns and vectorizes them when it is safe to do so. For instance, consider the histogram loop pattern as shown below:

#include<vector>
#include<algorithm>
#include<iostream>
#include<chrono>
#include<stdlib.h>
#define N1 9999999
#ifdef PSTL
#include"pstl/algorithm"
#endif
#ifdef GNU_PSTL
#include<parallel/algorithm>
#include<omp.h>
#endif
using namespace std;
int main(int argc, char *argv[]){
        std::vector<long long> hist(10);
        fill(hist.begin(), hist.end(), 0);
        std::vector<long long> myvec(N1);
        std::cout<<"---------------------\n";
        std::chrono::time_point<std::chrono::system_clock> timer_start, timer_end;
        timer_start = std::chrono::system_clock::now();
        #ifdef PSTL
                generate(pstl::execution::unseq, myvec.begin(), myvec.end(), std::rand);
        #elif defined(GNU_PSTL)
                omp_set_num_threads(1);
                __gnu_parallel::generate(myvec.begin(), myvec.end(), std::rand);
        #else
                generate(myvec.begin(), myvec.end(), std::rand);
        #endif
        timer_end = std::chrono::system_clock::now();
        std::chrono::duration<double> elapsed_seconds = timer_end - timer_start;
        std::cout<<"Time taken by generate algorithm is "<<elapsed_seconds.count()<<"\n";
        timer_start = std::chrono::system_clock::now();
        #ifdef PSTL
                for_each(pstl::execution::unseq, myvec.begin(), myvec.end(), [&](long long &p){ hist[(p%4)]++;
);
        #elif defined(GNU_PSTL)
                __gnu_parallel::for_each(myvec.begin(), myvec.end(), [&](long long &p){ hist[(p%4)]++; });        #else
                for_each(myvec.begin(), myvec.end(), [&](long long &p){ hist[(p%4)]++; });
        #endif
        timer_end = std::chrono::system_clock::now();
        elapsed_seconds = timer_end - timer_start;
        std::cout<<"Time taken by for_each algorithm is "<<elapsed_seconds.count()<<"\n";
        for_each(hist.begin(), hist.end(), [&](const long long &q){ std::cout<<q<<"\n"; });
        return 0;
}

When pstl::execution::unseq execution policy is enabled with Intel Compiler, the vectorized is forced using #pragma omp simd pragma (SIMD pragma from OpenMP4.0). One important point to remember is when using #pragma omp simd, the compiler’s vectorization heuristics will not perform the routine data dependency and data flow analysis but just follows the user’s directive to go ahead and vectorize. So always exercise caution when using this. For instance, in the above program (p%4) will result on values 0,1,2,3 in the SIMD register when targeting SSE (no duplicate values), but when targeting for AVX the SIMD register will have 0,1,2,3,0,1,2,3. When trying to execute hist[p%4] in vectorized mode, there is a data race condition. Intel® AVX-512 instruction set supports a conflict detection instruction (vconflict) which will look for conflict if any (duplicates if any) in the SIMD register. Please try the above example with the build script attached and see the performance difference between Intel Compiler generated code and GNU compiler generated code.


BigDL: Bring Deep Learning to the Fingertips of Big Data Users and Data Scientists

$
0
0

Big data and analytics play a central role in today’s smart and connected world, and are continuously driving the convergence of big data, analytics, and machine learning/deep learning. We open sourced BigDL, a distributed deep learning library for Apache Spark*, in December 2016, for the very purpose of uniting the deep learning community and the big data community. The rest of this article provides an overview of recent enhancements available in the BigDL 0.1.0 release (as well as in the upcoming 0.1.1 release).

  • Python* Support
     

    Python* is one of the most widely used languages in the big data and data science community, and BigDL provides full support for Python APIs (using Python 2.7), based on PySpark since its 0.1.0 release; this allows users to use deep learning models in BigDL together with existing Python libraries (for example, NumPy and Pandas), which automatically run in a distributed fashion to process large volumes of data across Hadoop*/Spark clusters. For instance, we can create the LeNet-5 model, a classic convolutional neural network, using the BigDL Python API as follows:

    def build_model(class_num):
       model = Sequential()
        model.add(Reshape([1, 28, 28]))
        model.add(SpatialConvolution(1, 6, 5, 5).set_name('conv1'))
        model.add(Tanh())
        model.add(SpatialMaxPooling(2, 2, 2, 2).set_name('pool1'))
        model.add(Tanh())
        model.add(SpatialConvolution(6, 12, 5, 5).set_name('conv2'))
        model.add(SpatialMaxPooling(2, 2, 2, 2).set_name('pool2'))
        model.add(Reshape([12 * 4 * 4]))
        model.add(Linear(12 * 4 * 4, 100).set_name('fc1'))
        model.add(Tanh())
        model.add(Linear(100, class_num).set_name('score'))
        model.add(LogSoftMax())
        return model

    In addition, we continue to improve Python support in BigDL; the upcoming BigDL 0.1.1 release will add Python 3.5support, as well as support for users to automatically deploy their customized Python dependency across YARN* clusters.

  • Notebook Integration
     

    With full Python API support in BigDL, data scientists and analysts can now explore their data using powerful notebooks (such as the Jupyter Notebook) in a distributed fashion across the cluster, combining Python libraries, Spark SQL / DataFrames and MLlib, deep learning models in BigDL, as well as interactive visualization tools. For instance, the Jupyter Notebook tutorial contained in BigDL 0.1.0 demonstrates how we can evaluate the prediction result of a text classification model (using both accuracy and confusion matrix) as follows:

    predictions = trained_model.predict(val_rdd).collect()
    
    def map_predict_label(l):
        return np.array(l).argmax()
    def map_groundtruth_label(l):
        return l[0] - 1
    
    y_pred = np.array([ map_predict_label(s) for s in predictions])
    
    y_true = np.array([map_groundtruth_label(s.label) for s in val_rdd.collect()])
    acc = accuracy_score(y_true, y_pred)
    print "The prediction accuracy is %.2f%%"%(acc*100)
    
    cm = confusion_matrix(y_true, y_pred)
    cm.shape
    df_cm = pd.DataFrame(cm)
    plt.figure(figsize = (10,8))
    sn.heatmap(df_cm, annot=True,fmt='d');

    Figure 1

  • TensorBoard* Support
     

    TensorBoard* is a suite of web applications for inspecting and understanding deep learning program runs and graphs, and BigDL 0.1.0 provides support for visualizations using TensorBoard (as well as inline plotting libs such as Matplotlib* within the notebook). First, a BigDL program can be configured to generate summary information for training and/or validation, as illustrated below (using Python APIs):

    optimizer = Optimizer(...)
    ...
    log_dir = 'mylogdir'
    app_name = 'myapp'
    train_summary = TrainSummary(log_dir=log_dir, app_name=app_name)
    val_summary = ValidationSummary(log_dir=log_dir, app_name=app_name)
    optimizer.set_train_summary(train_summary)
    optimizer.set_val_summary(val_summary)
    ...
    trainedModel = optimizer.optimize()

    After we start to run the BigDL program, the train and validation summary is saved to and respectively; after that, we can use TensorBoard to visualize the behaviors of the BigDL program, including the Loss and Throughput curves under the SCALARS tab (as illustrated below).

    Figure 2

    We can also use TensorBoard to visualize weights, bias, gradientWeights, and gradientBias under the DISTRIBUTIONS and HISTOGRAMS tabs (as illustrated below). 

    Figure 3

    Figure 4

  • Better RNN Support
     

    Recurrent neural networks (RNN) are powerful models for analyzing speech, text, times series, sensor data, and so on. The BigDL 0.1.0 release provides comprehensive support for RNN, including different variants of long short-term memory such as gated recurrent unit (GRU), LSTM with peephole, and dropout in recurrent neural networks. For instance, we can create a simple LSTM model (using the Python API) as follows:

    model = Sequential()
    model.add(Recurrent()
                 .add(LSTM(embedding_dim, 128)))
    model.add(Select(2, -1))
    model.add(Linear(128, 100))
    model.add(Linear(100, class_num))

We have seen major advancements in deep learning in recent years; while the deep learning community continues to push the technology envelope, BigDL helps make these breakthroughs more accessible and convenient to use for data scientists and data engineers (who are not necessarily experts in deep learning technologies). We continue to work on enhancements in BigDL beyond the 0.1 release (for example, support for reading/writing TensorFlow models, Convolutional Neural Network (CNN) implementations for 3Dimages, recursive nets, and so on), so that big data users can continue the use of familiar tools and infrastructure to build their deep learning-powered analytics applications.

Deploying BigDL on Microsoft’s Azure* Data Science Virtual Machine

$
0
0

Automated Installation of BigDL Using Deploy to Azure*

To make it easier to deploy BigDL, we created a “Deploy to Azure” button on top of the Linux* (Ubuntu*) edition of the Data Science Virtual Machine (DSVM). This button encapsulates all the necessary installation steps to create a new Azure* DSVM and installs BigDL after the virtual machine (VM) is provisioned.

Azure Virtual Machines provide a mechanism to automatically run a script during post provisioning when using Azure Resource Manager (ARM) templates. On Github*, we have published the Azure Resource Manager (ARM) template and the script to install BigDL on the DSVM for Linux (Ubuntu) when creating the VM on Azure. 

Deploy to Azure

Clicking the Deploy to Azure button takes the user to the Azure portal wizard, leads them through the VM creation process, and automatically executes the necessary script to install/configure BigDL so that it is ready for use once the VM is successfully provisioned. The user can directly run /opt/BigDL/run_notebooks.sh to start a Jupyter* notebook server to execute the samples.

Note: It may take as long as 10 minutes to fully provision DSVM—perfect time for a coffee break!

Please note: For ease of use, we suggest selecting the password option rather than the SSH option in the DSVM provisioning prompt.

Figure 1

For completeness, we also provide, below, the manual, step-by-step installation procedure to create the data science steps in case you already have a DSVM (Ubuntu) instance, or just want to understand the details of what the automated steps do, above.

Manual Installation of BigDL on the DSVM

Provisioning DSVM

Before you start, you need to provision the Microsoft Data Science Virtual Machine for Linux (Ubuntu) by visiting the Azure product detail page and following the directions in the VM creation wizard.

Figure 2

Figure 3

When DSVM is configured, make a note of its public IP address or DNS name; you will need it to connect to DSVM via your connect tool of choice.  The recommended tool for text interface is SSH or Putty. For the graphical interface, Microsoft* recommends an X Client called X2GO*.

Note: You may need to configure your proxy server correctly if your network administrators require all connections to go through your network proxy. The only session type supported by default on DSVM is Xfce*.

Building Intel’s BigDL

Change to root and clone BigDL from Github; switch to released branch-0.1:

sudo -s

     cd /opt

     git clone https://github.com/intel-analytics/BigDL.gi

     git checkout branch-0.1

Building BigDL with Spark* 2.0:

     $ cd BigDL
       $ bash make-dist.sh -P spark_2.0

If successful, you should see the following messages:

Figure 4

Examples of DSVM Configuration Steps to Run BigDL

Switch to Python* 2.7.

     $ source /anaconda/bin/activate root

Confirm Python* version.

     $ python - - version

Figure 5

Install Python Packages

     $ /anaconda/bin/pip install wordcloud
     $ /anaconda/bin/pip install tensorboard

Creating Script Files to Run Jupyter* Notebook and TensorBoard*

In the directory where you cloned BigDL library (/opt/BigDL), create a script, and run_notebook.sh with the following content:

#begin run_notebook.sh
#!/bin/bash
#setup paths
BigDL_HOME=~/BigDL

#this is needed for MSFT DSVM
export PYTHONPATH=${BigDL_HOME}/pyspark/dl:${PYTHONPATH}
#end MSFT DSVM-specific config

#use local mode or cluster mode
#MASTER=spark://xxxx:7077
MASTER="local[4]"
PYTHON_API_ZIP_PATH=${BigDL_HOME}/dist/lib/bigdl-0.1.0-python-api.zip
BigDL_JAR_PATH=${BigDL_HOME}/dist/lib/bigdl-0.1.0-jar-with-dependencies.jar
export PYTHONPATH=${PYTHON_API_ZIP_PATH}:${PYTHONPATH}
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --notebook-dir=~/notebooks  --ip=* "

source ${BigDL_HOME}/dist/bin/bigdl.sh

${SPARK_HOME}/bin/pyspark \
    --master ${MASTER} \
    --driver-cores 5  \
    --driver-memory 10g  \
    --total-executor-cores 8  \
    --executor-cores 1  \
    --executor-memory 10g \
    --conf spark.akka.frameSize=64 \
  --properties-file ${BigDL_HOME}/dist/conf/spark-bigdl.conf \
    --py-files ${PYTHON_API_ZIP_PATH} \
    --jars ${BigDL_JAR_PATH} \
    --conf spark.driver.extraClassPath=${BigDL_JAR_PATH} \
    --conf spark.executor.extraClassPath=bigdl-0.1.0--jar-with-dependencies.jar
# end of create_notebook.sh
-----

chmod +x run_notebook.sh

In the same BigDL directory, create start_tensorboard.sh with the following content:

#begin start_tensorboard.sh
PYTHONPATH=/anaconda/lib/python2.7/site-packages:$PYTHONPATH
/anaconda/lib/python2.7/site-packages/tensorboard/tensorboard --logdir=/tmp/bigdl_summaries
#end start_tensorboard.sh

Please note that ‘/anaconda/lib/python2.7/site-packages/’ is installation-dependent and may change in future releases of DSVM. Thus, if these instructions do not work for you out of the box, you may need to update this path.

Figure 6

Note the URL at the end of the log http://10.0.2.4:6006. Open your DSVM browser with it to see the TensorBoard pane.

Launching a Text Classification Example

Execute run_notebook.sh and start_tensorboard.sh via bash commands from different terminals:

       $bash run_notebook.sh
       $bash start_tensorboard.sh

Open two browser tabs, one for text_classification.ipynb and another for TensorBoard.

Navigate to the text_classification example:

http://localhost:YOUR_PORT_NUMBER/notebooks/pyspark/dl/example/tutorial/simple_text_classification/text_classfication.ipynb# —Check location of sample.

Run the notebook. This will take a few minutes. In the end, you will see a loss graph like this one:

Figure 7

Your TensorBoard may look like this for the Text Classification example.

Figure 8

Automating the Installation of BigDL on DSVM

Azure Virtual Machines provide a mechanism to automatically run a script during post provisioning when using Azure Resource Manager (ARM) templates. On Github, we published the ARM template and the script to install BigDL on the DSVM for Linux (Ubuntu) when creating the VM on Azure.  On the same Github directory there is also a Deploy to Azure button that takes the user to the Azure portal wizard, leads them through the VM creation, and automatically executes the above script to install/configure BigDL so that it is ready for use once the VM is successfully provisioned. The user can directly run /opt/BigDL/run_notebooks.sh to start a Jupyter notebook server to execute the samples.

Conclusion

In this blog post, we demonstrated that in just a few small steps one can take advantage of the Intel BigDL library running on Apache Spark* to execute deep learning jobs on Microsoft’s Data Science Virtual Machine. BigDL continues to evolve and enjoys solid support from the open-source community as well as from Intel’s dedicated software engineering team.

Resources

Appendix

Installing and configuring Spark 1.6 for legacy code implementation:

Installing Spark 1.6.1 WITH spark 2.0)

          Install Spark 1.6.1: http://spark.apache.org/downloads.html
          Select 1.6.1.
          Download
          ​cd Downloads
          tar -xzf spark-1.6.1-bin-hadoop2.6.tgz

Move the directory from the download location to where Spark is stored on the system.

Figure 9

To switch back to the Python 3.5 environment:

     $source activate py35 (for Python 3.5)

To install Python packages in the Python 3.5 environment:

     $sudo /anaconda/envs/py35/bin/conda install xxxx (for Python 3.5 env)

(Do the same for pip installs.)

Installing BigDL on the Data Science Virtual Machine for Linux (CentOS*):

To run BigDL on DSVM CentOS* edition, first you need to install Maven* on the DSVM before compiling BigDL.

Installing Maven. Note that on CentOS-based Linux, instead of Ubuntu's apt-get, you need to use yum to install new packages:

Figure 10

DSVM’s default JAVA_HOME environmental variable points to an empty directory, "/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.111-1.b15.el7_2.x86_64". You need to change it to another already existing one that contains the Java* 8 installation:

   Export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.121-0.b13.el7_3.x86_64".

Check that Maven is installed correctly:

   $ mvn –v

Figure 11

After this, you should be able to run a build on BigDL following the steps in the main section above. 

Introduction to pyDAAL

$
0
0

This paper shows how the python* API of the Intel® Data Analytics Acceleration Library (Intel® DAAL) tool works. First, we explain how to manipulate data using the pyDAAL programming interface and then show how to integrate it with python data manipulation/math APIs. Finally, we demonstrate how to use pyDAAL to implement a simple Linear Regression solution for a prediction problem.

Data Science is a new recent field that put together lots of concepts of other areas such as: Data mining, Data Analysis, Data modeling, Data Prediction, Data Visualization and so on. The need for performing such tasks as quickly as possible has become the main issue in today's data solutions. With that in mind, the Intel DAAL, is a highly optimized library whose goal is to provide a full solution for data analytics targeting today's highly parallel systems such as Intel® Xeon Phi™ processors.


Intel DAAL delivers solutions for many steps of a data analytics pipeline, such as pre-processing, data transformations, dimensionality reduction, data modeling, prediction, and several drivers for reading and writing in most of the common data formats. A summary of all features inside the library can be seen in Figure 1.

Figure 1. Main algorithms delivered by Intel® Data Analytics Acceleration Library

As can be seen in Figure 1, all APIs are compatible with C++, Java*, and Python* (a recent addition available from version 2017 beta). Many of the algorithms implemented inside the tool can be executed in 3 main modes:

  • Batch: in this mode, the processing occurs in a serial way, e.g., the training algorithm is executed in a single node sequentially;
  • Distributed: as the name suggests, in this processing mode, the dataset must be split and distributed among the computing nodes. The algorithm then calculate partial solutions and, at the last step, unifies such solutions; and
  • Online: in this processing mode, the data is considered as being a continuous stream. The processing occurs by building incremental models, and, at the end, building a full model from the partial models.

More on the processing modes, together with additional details on Data Management and how to use pyDAAL to implement a simple Linear Regression solution for a prediction problem are covered in this whitepaper.

Source available on GitHub

 

Vector (SIMD) Function ABI

$
0
0

Vector Function Application Binary Interface

adapted from version of November 2015 by

Xinmin Tian, Hideki Saito, Sergey Kozhukhov, Kevin B. Smith,
Robert Geva, Milind Girkar and Serguei V. Preis
Intel® Mobile Computing and Compilers

Please see attachment.

 

Why & When Deep Learning Works: Looking Inside Deep Learning

$
0
0

Ronny Ronen
The Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI)1

In recent years, Deep Learning has emerged as the leading technology for accomplishing broad range of artificial intelligence tasks (LeCun et al. (2015); Goodfellow et al. (2016)). Deep learning is the state-of-the-art approach across many domains, including object recognition and identification, text understating and translation, question answering, and more. In addition, it is expected to play a key role in many new usages deemed almost impossible before, such as fully autonomous driving.

While the ability of Deep Learning to solve complex problems has been demonstrated again and again, there is still a lot of mystery as to why it works, what is it really capable of accomplishing, and when it works (and when it does not). Such an understanding is important for both theoreticians and practitioners, in order to know how such methods can be utilized safely and in the best possible manner. An emerging body of work has sought to develop some insights in this direction, but much remains unknown. The general feeling is that Deep learning is still by and large “black magic” we know it works, but we do not truly understand why. This lack of knowledge disturbs the scientists and are a cause for concern for developers would you let an autonomous car be driven by a system whose mechanisms and weak spots are not fully understood?

The Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI) has been heavily supporting Machine Learning and Deep Learning research from its foundation in 2012. We have asked six leading ICRI-CI Deep Learning researchers to address the challenge of “Why & When Deep Learning works”, with the goal of looking inside Deep Learning, providing insights on how deep networks function, and uncovering key observations on their expressiveness, limitations, and potential.

The output of this challenge call was quite impressive, resulting in five papers that address different facets of deep learning. These papers summarize the researchers’ ongoing recent work published in leading conferences and journals as well as new research results made especially for this compilations. These different facets include a high-level understating of why and when deep networks work (and do not work), the impact of geometry on the expressiveness of deep networks, and making deep networks interpretable.

Understating of why and when deep networks work (and do not work)

  1. Naftali Tishby and Ravid Schwartz-Ziv in Opening the Black Box of Deep Neural Networks via Information study Deep Networks by analyzing their information-theoretic properties, looking at what information on the input and output each layer preserves, and suggests that the network implicitly attempts to optimize the Information-Bottleneck (IB) tradeoff between compression and prediction, successively, for each layer. Moreover, they show that the stochastic gradient descent (SGD) epochs used to train such networks have two distinct phases for each layer: fast empirical error minimization, followed by slow representation compression. They then present a new theoretical argument for the computational benefit of the hidden layers.
     
  2. Shai Shalev-Shwartz, Ohad Shamir and Shaked Shamma in Failures of Gradient-Based Deep Learning attempt to gain a deeper understanding of the difficulties and limitations associated with common approaches and algorithms. They describe four families of problems for which some of the commonly used existing algorithms fail or suffer significant difficulty, illustrate the failures through practical experiments, and provide theoretical insights explaining their source and suggest remedies to overcome the failures that lead to performance improvements.
     
  3. Amnon Shashua, Nadav Cohen, Or Sharir, Ronen Tamari, David Yakira and Yoav Levine in Analysis and Design of Convolutional Networks via Hierarchical Tensor Decompositions analyze the expressive properties of deep convolutional networks. Through an equivalence to hierarchical tensor decompositions, they study the expressive efficiency and inductive bias of various architectural features in convolutional networks (depth, width, pooling geometry, inter-connectivity, overlapping operations etc.). Their results shed light on the demonstrated effectiveness of convolutional networks, and in addition, provide new tools for network design.

    The impact of geometry on the expressiveness of deep networks
     
  4. Nathan Srebro, Behnam Neyshabur, Ryota Tomioka and Ruslan Salakhutdinov in Geometry of Optimization and Implicit Regularization in Deep Learning argue that the optimization methods used for training neural networks play a crucial role in generalization ability of deep learning models, through implicit regularization. They demonstrate that generalization ability is not controlled simply by network size, but rather by some other implicit control. Then, by studying the geometry of the parameter space of deep networks and devising an optimization algorithm attuned to this geometry, they demonstrate how changing the empirical optimization procedure can improve generalization performance.

    Interpretability of deep networks
     
  5. Shie Mannor, Tom Zahavy and Nir Baram in Graying the black box: Understanding DQNs present a methodology and tools to analyze Deep Q-networks (DQNs) in a non-blind matter. They propose a new model, the Semi Aggregated Markov Decision Process (SAMDP), and an algorithm that learns it automatically. Using these tools they reveal that the features learned by DQNs aggregate the state space in a hierarchical fashion, explaining its success. Moreover, they are able to look into the network to understand and describe the policies learned by DQNs for three different Atari2600 games and suggest ways to interpret, debug and optimize deep neural networks in reinforcement learning.

References

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT Press, 2016.

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553): 436–444, 2015.

1 This work was done with the support of the Intel Collaborative Research institute for Computational Intelligence (ICRI-CI). This paper is the preface part of the ’Why & When Deep Learning works looking inside Deep Learning’ ICRI-CI paper bundle.

Designing Scalable IoT Architectures

$
0
0

Designing for the Internet of Things is challenging. The technology is rapidly changing, and architecting for these situations can be complex. This article will discuss both design considerations for IoT and new methods in creating a robust network using Intel® processors.

Latency, Bandwidth and Reliability

Design practices for Internet of Things (IoT) devices are changing. It used to be that developers just watched processes from afar, but now we control them in real-time. A result of this change has been an increase in IoT network complexity. For IoT devices that depend on Internet access, this can result in several challenges when it comes to network paths to cloud servers: high latencies, low bandwidths, and decreased reliability.

These trends have led to new topologies in IoT networks, such as Fog Computing (a network layer below the cloud). Deploying cloud elements closer to the edge of the network (or even onsite) reduces latencies while also preventing bandwidth bottlenecks. In order to achieve these goals, edge networks and Fog Computing require high-performance computing resources, as well as high-speed storage and networking.

Scalable Design

The challenge for IoT is twofold:

1) Design scalable and reliable devices

2) Architect flexible cloud elements with the lowest possible latency, highest bandwidth, and best reliability possible.

IoT Design with Intel® Processors

Intel supplies a wide range of processor products that allow IoT designers to scale both hardware and software to meet these design goals (flexible, scalable, and reliable). Many of these processor families also have integrated GPUs which offers extra processing resources.

There are four main product families:

  1. Intel® Quark™ processor
  2. Intel® Core™ processor
  3. Intel Atom® processor
  4. Intel® Xeon® processor


Figure 1 Designing to Scale

Early Big Data and Current IoT Architectures

Early big data architectures were based on sensors with networking capabilities. These accessed the Internet and transmitted data into cloud applications for later retrieval and analysis.

Current IoT architectures evolved into networks that either forward data in near real-time (to generate event-based responses) or function as sensor-actor networks.

About Sensors

Sensors are devices that detect or measure a physical property (temperature, humidity, light, etc.). Controllers receive input from sensors and initiate actions. These actions usually include using an actor or actuator to adjust or maintain desired outputs of specific processes. Let's consider, for example, a plant watering system based on sensors. The moisture sensor measures the water saturation of soil, and if that level falls below a certain threshold, a controller initiates an action to open a water valve.

Evolution of IoT Networks
Figure 2 Evolution of IoT Networks

Evolving IoT Networks

Figure 2 illustrates how latency becomes a significant issue when the IoT network becomes more complex (specifically, the Sensor-Actor Real-Time Data Model). IoT designs must take into account two things: 1) a rapidly progressing network of sensors and 2) systems acting upon the network.  Integrating fog architectures into existing IoT networks helps to reduce latency issues, bringing cloud elements closer to the edge!

As IoT evolves and more sophisticated applications are designed, the entire end-to-end IoT chain will need even more computing resources, while still requiring power consumption optimization. This need for processing power is constantly growing, meaning that designers need to account for some extra headroom for future software upgrades.

Migrating the cloud elements to the edge network or to the LAN (Figure 3) reduces network latencies accordingly. The real-time data path to the ON-PREMISES DATACENTER bypasses the access network, and results in the bandwidth and reliability benefits of LAN.

Comparison of different types of IoT Networks
Figure 3 Comparison of different types of IoT Networks

IoT Network Stacks

An increase in network complexity, along with a growing demand for IoT, have resulted in the exponential growth of complex networks stacks. Now network stacks not only need to worry about IoT protocols, they also must account for security, encryption, and independent processors that handle additional tasks.

IoT Network Architecture

When planning architecture for an IoT network, it’s important to consider the downstream processing of the network. Let's consider, for example, a smart building where a sensor is linked to a lighting appliance. The appliance may be part of a larger building application. The smart building may also be part of a smart city network. In this case, you would want to consider that data is not only being passed locally, but also being transferred to a larger building network, and ultimately to a much larger city network. 

Application Demands

As sensors grow in complexity and their implementation becomes widespread, it’s important to ensure processors account for additional demands (i.e., not only network connectivity). Increasingly large data sets are now communicating with sensors. Digital sensors that use GPIO or analog connections now have large volumes of data to run and manage in real-time. It’s important to scale independent microcontroller and bus interfaces in system designs to meet application demands. For example, Fog Node or edge computing will be needed as LIDAR, radar, ultrasound, and video(vision) sensors are added in order to keep up with real-time computing applications.

Autonomous Systems

Autonomous control and adaptive learning control systems should be accounted for in current or future IoT system designs. Implementation of autonomous systems is becoming more widespread. Being able to scale a design for future use is just as advantageous as offering the solution in your design as emerging technologies continue to progress. Smart homes, connected cars, artificial intelligence, and embedded deep learning are coming soon to the marketplace.

IoT Power and Performance with an Intel® Processor Family

Intel offers four families of processors that make achieving low latency, high bandwidth and increased reliability possible, all without increasing power consumption or affecting performance. The Internet of Things is a fast-growing, and complex system with many design considerations, such as latency issues, or ISP bottlenecks. These both can be rectified with with Intel® processors. Moving big data computing to the edge (and within LAN Fog Nodes) increases onsite computing resources, sensor capability, frees up bandwidth and increases reliability in IoT networks.

More on Scaling Processors at the Edge   Edge-to-Cloud Integration   Sensors

The Evil within the Comparison Functions

$
0
0

Perhaps, readers remember my article titled "Last line effect". It describes a pattern I've once noticed: in most cases programmers make an error in the last line of similar text blocks. Now I want to tell you about a new interesting observation. It turns out that programmers tend to make mistakes in functions comparing two objects. This statement looks implausible; however, I'll show you a great number of examples of errors that may be shocking to a reader. So, here is a new research, it will be quite amusing and scary.

Problematics

Here is my statement: programmers quite often make mistakes in rather simple functions that are meant to compare two objects. This claim is based on the experience of our team in checking a large number of open source projects in C, C++ and C#.

The functions we are going to consider here are IsEqual, Equals, Compare, AreEqual and so on or overloaded operators as ==, !=.

I noticed that when writing articles, very often I come across errors related to the comparison functions. I decided to explore this question in detail and examined the base of errors we found. I did a search of functions throughout the base containing words Cmp, Equal, Compare and such. The result was very impressive and shocking.

In fact this story is similar to the one we had when writing the article "Last line effect". Similarly, I noticed an anomaly and decided to explore it more carefully. Unfortunately, unlike the aforementioned article, I don't know how to bring statistics here and which figures to provide. Perhaps, later I'll come up with a solution with the statistics. At this point I am guided by intuition and can only share my feelings. They see that there are a lot of errors in the comparison functions and I am sure, you will get the same feeling when you see that huge amount of truly impressive examples.

Psychology

For a moment let's go back to the article "Last line effect". By the way, if you haven't read it, I suggest taking a break and looking at it. There is a more detailed analysis of this topic: "The last line effect explained"

In general, we can conclude that the cause of the errors in the last lined is related to the fact that the developer has already mentally moved to the new lines/tasks instead of focusing on the completion of the current fragment. As a result - when writing similar blocks of text, there is a higher probability that a programmer will make an error in the last one.

I believe that in the case of writing a comparison function, a developer in general often don't focus on it, considering it to be too trivial. In other words, he writes the code automatically, without thinking over it. Otherwise, it is not clear how one can make an error like this:

bool IsLuidsEqual(LUID luid1, LUID luid2)
{
  return (luid1.LowPart == luid2.LowPart) &&
         (luid2.HighPart == luid2.HighPart);
}

PVS-Studio analyzer detected this error in the code of RunAsAdmin Explorer Shim (C++) project: V501 There are identical sub-expressions to the left and to the right of the '==' operator: luid2.HighPart == luid2.HighPart RAACommon raacommonfuncs.cpp 1511

A typo. In the second line it should be: luid1.HighPart == luid2.HighPart.

The code is very simple. Apparently, the simplicity of code spoils everything. A programmer immediately thinks of the task to write such a function as standard and uninteresting. He instantly thinks of the way to write the function and he has just to implement the code. This is a routine, but unfortunately an inevitable process to start writing more important, complex and interesting code. He is already thinking about the new task... and as a result - makes an error.

In addition, programmers rarely write unit tests for such functions. Again the simplicity of these functions prevents from it. It seems that it would be too much to test them, as these functions are simple and repetitive. A person has written hundreds of such functions in his life, can he make an error in another function? Yes, he can and he does.

I would also like to note that we aren't talking about code of students who are just learning to program. We are talking about bugs in the code of such projects as GCC, Qt, GDB, LibreOffice, Unreal Engine, CryEngine 4 V Chromium, MongoDB, Oracle VM Virtual Box, FreeBSD, WinMerge, the CoreCLR, MySQL, Mono, CoreFX, Roslyn, MSBuild, etc. It's all very serious.

We are going to have a look at so many diverse examples that it would be scary to sleep at night.

Erroneous Patterns in Comparison Functions

All errors in comparison functions will be divided into several patterns. In the article we'll be talking about errors in projects in C, C++ and C#, but it makes no sense to separate these languages, as most of the patterns are similar for different languages.

Pattern: A < B, B > A

Very often in the comparison functions there is a need to make such checks:

  • A < B
  • A > B

Sometimes programmers think that is more elegant to use the same operator <, but to switch the variables.

  • A < B
  • B < A

However, due to the inattentiveness, we get such checks:

  • A < B
  • B > A

In fact, one and the same comparison is done twice here. Perhaps, it's not clear what it is about here, but we'll get to the practical examples and it'll all become clearer.

string _server;
....
bool operator<( const ServerAndQuery& other ) const {
  if ( ! _orderObject.isEmpty() )
    return _orderObject.woCompare( other._orderObject ) < 0;

  if ( _server < other._server )
    return true;
  if ( other._server > _server )
    return false;
  return _extra.woCompare( other._extra ) < 0;
}

PVS-Studio analyzer detected this error in the code of MongoDB (C++): V581 The conditional expressions of the 'if' operators situated alongside each other are identical. Check lines: 44, 46. parallel.h 46

This condition:

if ( other._server > _server )

Will always be false, as the same check was done two lines before. Correct code variant:

if ( _server < other._server )
  return true;
if ( other._server < _server )
  return false;

This error was detected in the code of Chromium project (C++):

enum ContentSettingsType;
struct EntryMapKey {
  ContentSettingsType content_type;
  ...
};

bool OriginIdentifierValueMap::EntryMapKey::operator<(
    const OriginIdentifierValueMap::EntryMapKey& other) const {
  if (content_type < other.content_type)
    return true;
  else if (other.content_type > content_type)
    return false;
  return (resource_identifier < other.resource_identifier);
}

PVS-Studio warning: V517 The use of 'if (A) {...} else if (A) {...}' pattern was detected. There is a probability of logical error presence. Check lines: 61, 63. browser content_settings_origin_identifier_value_map.cc 61

That was a C++ example, now it's C# turn. The next error was found in the code of IronPython and IronRuby (C#).

public static int Compare(SourceLocation left,
                          SourceLocation right) {
  if (left < right) return -1;
  if (right > left) return 1;
  return 0;
}

PVS-Studio warning (C#): V3021 There are two 'if' statements with identical conditional expressions. The first 'if' statement contains method return. This means that the second 'if' statement is senseless. SourceLocation.cs 156

I think there is no need in explanation.

Note. For C# there was just one example of an error, but for C++ - two. In general, there will be less bugs in the C# code, than for C/C++. But I do not recommend rushing to the conclusion that C# is much safer. The thing is that PVS-Studio analyzer has only recently learned to check C# code relatively recently, and we have just checked less projects written in C#, than in C and C++.

Pattern: a Member of the Class is Compared with itself

The comparison functions usually consist of successive comparisons of structure/class members. This code tends to be more erronreous, when the member of the class starts being compared with itself. I can specify two subtypes of errors.

In the first case, a programmer forgets to specify the name of the object and writes in the following way:

return m_x == foo.m_x &&
       m_y == m_y &&            // <=
       m_z == foo.m_z;
In the second case, the same name of the object is written.
return zzz.m_x == foo.m_x &&
       zzz.m_y == zzz.m_y &&    // <=
       zzz.m_z == foo.m_z;

Let's take a closer look at practical examples of this pattern. Pay attention that incorrect comparison often occurs in the last block of similar code blocks, which reminds us of the "last line effect" again.

The error is found in the code of Unreal Engine 4 (C++) project:

bool
Compare(const FPooledRenderTargetDesc& rhs, bool bExact) const
{
  ....
  return Extent == rhs.Extent&& Depth == rhs.Depth&& bIsArray == rhs.bIsArray&& ArraySize == rhs.ArraySize&& NumMips == rhs.NumMips&& NumSamples == rhs.NumSamples&& Format == rhs.Format&& LhsFlags == RhsFlags&& TargetableFlags == rhs.TargetableFlags&& bForceSeparateTargetAndShaderResource ==
         rhs.bForceSeparateTargetAndShaderResource&& ClearValue == rhs.ClearValue&& AutoWritable == AutoWritable;           // <=
}

PVS-Studio warning: V501 There are identical sub-expressions to the left and to the right of the '==' operator: AutoWritable == AutoWritable rendererinterface.h 180

The code of Samba (C) project:

static int compare_procids(const void *p1, const void *p2)
{
  const struct server_id *i1 = (struct server_id *)p1;
  const struct server_id *i2 = (struct server_id *)p2;

  if (i1->pid < i2->pid) return -1;
  if (i2->pid > i2->pid) return 1;
  return 0;
}

PVS-Studio warning: V501 There are identical sub-expressions to the left and to the right of the '>' operator: i2->pid > i2->pid brlock.c 1901

The code of MongoDB (C++) project:

bool operator==(const MemberCfg& r) const {
  ....
  return _id==r._id && votes == r.votes &&
         h == r.h && priority == r.priority &&
         arbiterOnly == r.arbiterOnly &&
         slaveDelay == r.slaveDelay &&
         hidden == r.hidden &&
         buildIndexes == buildIndexes;        // <=
}

PVS-Studio warning: V501 There are identical sub-expressions to the left and to the right of the '==' operator: buildIndexes == buildIndexes rs_config.h 101

The code of Geant4 Software (C++) project:

inline G4bool G4FermiIntegerPartition::
operator==(const G4FermiIntegerPartition& right)
{
  return (total == right.total &&
          enableNull == enableNull &&          // <=
          partition == right.partition);
}

PVS-Studio warning: V501 There are identical sub-expressions to the left and to the right of the '==' operator: enableNull == enableNull G4hadronic_deex_fermi_breakup g4fermiintegerpartition.icc 58

The code of LibreOffice (C++) project:

class SvgGradientEntry
{
  ....
  bool operator==(const SvgGradientEntry& rCompare) const
  {
    return (getOffset() == rCompare.getOffset()&& getColor() == getColor()            // <=&& getOpacity() == getOpacity());      // <=
  }
  ....
}

PVS-Studio warning: V501 There are identical sub-expressions to the left and to the right of the '==' operator: getColor() == getColor() svggradientprimitive2d.hxx 61

The code of Chromium (C++) project:

bool FileIOTest::MatchesResult(const TestStep& a,
                               const TestStep& b) {
  ....
  return (a.data_size == a.data_size &&             // <=
          std::equal(a.data, a.data + a.data_size, b.data));
}

PVS-Studio warning: V501 There are identical sub-expressions to the left and to the right of the '==' operator: a.data_size == a.data_size cdm_file_io_test.cc 367

The code of FreeCAD (C++) project:

bool FaceTypedBSpline::isEqual(const TopoDS_Face &faceOne,
                               const TopoDS_Face &faceTwo) const
{
  ....
  if (surfaceOne->IsURational() !=
      surfaceTwo->IsURational())
    return false;
  if (surfaceTwo->IsVRational() !=         // <=
      surfaceTwo->IsVRational())           // <=
    return false;
  if (surfaceOne->IsUPeriodic() !=
      surfaceTwo->IsUPeriodic())
    return false;
  if (surfaceOne->IsVPeriodic() !=
      surfaceTwo->IsVPeriodic())
    return false;
  if (surfaceOne->IsUClosed() !=
      surfaceTwo->IsUClosed())
    return false;
  if (surfaceOne->IsVClosed() !=
      surfaceTwo->IsVClosed())
    return false;
  if (surfaceOne->UDegree() !=
      surfaceTwo->UDegree())
    return false;
  if (surfaceOne->VDegree() !=
      surfaceTwo->VDegree())
    return false;
  ....
}

PVS-Studio warning: V501 There are identical sub-expressions 'surfaceTwo->IsVRational()' to the left and to the right of the '!=' operator. modelrefine.cpp 780

The code of Serious Engine (C++) project:

class CTexParams {
public:

  inline BOOL IsEqual( CTexParams tp) {
    return tp_iFilter     == tp.tp_iFilter &&
           tp_iAnisotropy == tp_iAnisotropy &&             // <=
           tp_eWrapU      == tp.tp_eWrapU &&
           tp_eWrapV      == tp.tp_eWrapV; };
  ....
};

PVS-Studio warning: V501 There are identical sub-expressions to the left and to the right of the '==' operator: tp_iAnisotropy == tp_iAnisotropy gfx_wrapper.h 180

The code of Qt (C++) project:

inline bool qCompare(QImage const &t1, QImage const &t2, ....)
{
  ....
  if (t1.width() != t2.width() || t2.height() != t2.height()) {
  ....
}

PVS-Studio warning: V501 There are identical sub-expressions to the left and to the right of the '!=' operator: t2.height() != t2.height() qtest_gui.h 101

The code of FreeBSD (C) project:

static int
compare_sh(const void *_a, const void *_b)
{
  const struct ipfw_sopt_handler *a, *b;

  a = (const struct ipfw_sopt_handler *)_a;
  b = (const struct ipfw_sopt_handler *)_b;
  ....
  if ((uintptr_t)a->handler < (uintptr_t)b->handler)
    return (-1);
  else if ((uintptr_t)b->handler > (uintptr_t)b->handler) // <=
    return (1);

  return (0);
}

PVS-Studio warning: V501 There are identical sub-expressions '(uintptr_t) b->handler' to the left and to the right of the '>' operator. ip_fw_sockopt.c 2893

The code of Mono (C#) project:

static bool AreEqual (VisualStyleElement value1,
                      VisualStyleElement value2)
{
  return
    value1.ClassName == value1.ClassName && // <=
    value1.Part == value2.Part &&
    value1.State == value2.State;
}

PVS-Studio warning: V3001 There are identical sub-expressions 'value1.ClassName' to the left and to the right of the '==' operator. ThemeVisualStyles.cs 2141

The code of Mono (C#) project:

public int ExactInference (TypeSpec u, TypeSpec v)
{
  ....
  var ac_u = (ArrayContainer) u;
  var ac_v = (ArrayContainer) v;
  ....
  var ga_u = u.TypeArguments;
  var ga_v = v.TypeArguments;
  ....
  if (u.TypeArguments.Length != u.TypeArguments.Length) // <=
    return 0;

  ....
}

PVS-Studio warning: V3001 There are identical sub-expressions 'u.TypeArguments.Length' to the left and to the right of the '!=' operator. generic.cs 3135

The code of MonoDevelop (C#) project:

Accessibility DeclaredAccessibility { get; }
bool IsStatic { get; }

private bool MembersMatch(ISymbol member1, ISymbol member2)
{
  if (member1.Kind != member2.Kind)
  {
    return false;
  }

  if (member1.DeclaredAccessibility !=          // <=1
      member1.DeclaredAccessibility             // <=1
   || member1.IsStatic != member1.IsStatic)     // <=2
  {
    return false;
  }

  if (member1.ExplicitInterfaceImplementations().Any() ||
      member2.ExplicitInterfaceImplementations().Any())
  {
    return false;
  }

  return SignatureComparer
    .HaveSameSignatureAndConstraintsAndReturnTypeAndAccessors(
       member1, member2, this.IsCaseSensitive);
}

PVS-Studio warning: V3001 There are identical sub-expressions 'member1.IsStatic' to the left and to the right of the '!=' operator. CSharpBinding AbstractImplementInterfaceService.CodeAction.cs 545

The code of Haiku (C++) project:

int __CORTEX_NAMESPACE__ compareTypeAndID(....)
{
  int retValue = 0;
  ....
  if (lJack && rJack)
  {
    if (lJack->m_jackType < lJack->m_jackType)           // <=
    {
      return -1;
    }
    if (lJack->m_jackType == lJack->m_jackType)          // <=
    {
      if (lJack->m_index < rJack->m_index)
      {
        return -1;
      }
      else
      {
        return 1;
      }
    }
    else if (lJack->m_jackType > rJack->m_jackType)
    {
      retValue = 1;
    }
  }
  return retValue;
}

PVS-Studio warning: V501 There are identical sub-expressions to the left and to the right of the '<' operator: lJack->m_jackType < lJack->m_jackType MediaJack.cpp 783

Just below there is exactly the same error. As I understand, in both cases a programmer forgot to replace lJack with rJack.

The code of CryEngine V (C++) project:

bool
CompareRotation(const Quat& q1, const Quat& q2, float epsilon)
{
  return (fabs_tpl(q1.v.x - q2.v.x) <= epsilon)&& (fabs_tpl(q1.v.y - q2.v.y) <= epsilon)&& (fabs_tpl(q2.v.z - q2.v.z) <= epsilon)     // <=&& (fabs_tpl(q1.w - q2.w) <= epsilon);
}

PVS-Studio warning: V501 There are identical sub-expressions to the left and to the right of the '-' operator: q2.v.z - q2.v.z entitynode.cpp 93

Pattern: Evaluating the Size of a Pointer Instead of the Size of the Structure/Class

This type of error occurs in programs written in C and C++ and is caused by incorrect use of the sizeof operator. The error in evaluating not the size of the object, but the size of the pointer. Example:

T *a = foo1();
T *b = foo2();
x = memcmp(a, b, sizeof(a));

Instead of the size of the T structure, a size of the pointer gets evaluated. The size of the pointer depends on the used data model, but usually it is 4 or 8. As a result, more or less bites in the memory get compared than take the structure.

Correct variant of the code:

x = memcmp(a, b, sizeof(T));

or

x = memcmp(a, b, sizeof(*a));

Now let's move on to the practical part. Here is how such a bug looks in the code of CryEngine V (C++) code:

bool
operator==(const SComputePipelineStateDescription& other) const
{
  return 0 == memcmp(this, &other, sizeof(this));
}

PVS-Studio warning: V579 The memcmp function receives the pointer and its size as arguments. It is possibly a mistake. Inspect the third argument. graphicspipelinestateset.h 58

The code of Unreal Engine 4 project (C++):

bool FRecastQueryFilter::IsEqual(
  const INavigationQueryFilterInterface* Other) const
{
  // @NOTE: not type safe, should be changed when
  // another filter type is introduced
  return FMemory::Memcmp(this, Other, sizeof(this)) == 0;

}

PVS-Studio warning: V579 The Memcmp function receives the pointer and its size as arguments. It is possibly a mistake. Inspect the third argument. pimplrecastnavmesh.cpp 172

Pattern: Repetitive Arguments of Cmp(A, A) Type

Comparison functions usually call other comparison functions. At the same time one of the possible errors is that the reference/pointer is passed to the same object twice. Example:

x = memcmp(A, A, sizeof(T));

Here the object A will be compared with itself, which, is of course, has no sense.

We'll start with an error, found in the debugger GDB (C):

static int
psymbol_compare (const void *addr1, const void *addr2,
                 int length)
{
  struct partial_symbol *sym1 = (struct partial_symbol *) addr1;
  struct partial_symbol *sym2 = (struct partial_symbol *) addr2;

  return (memcmp (&sym1->ginfo.value, &sym1->ginfo.value,    // <=
                  sizeof (sym1->ginfo.value)) == 0&& sym1->ginfo.language == sym2->ginfo.language&& PSYMBOL_DOMAIN (sym1) == PSYMBOL_DOMAIN (sym2)&& PSYMBOL_CLASS (sym1) == PSYMBOL_CLASS (sym2)&& sym1->ginfo.name == sym2->ginfo.name);
}

PVS-Studio warning: V549 The first argument of 'memcmp' function is equal to the second argument. psymtab.c 1580

The code of CryEngineSDK project (C++):

inline bool operator != (const SEfResTexture &m) const
{
  if (stricmp(m_Name.c_str(), m_Name.c_str()) != 0 ||   // <=
      m_TexFlags != m.m_TexFlags ||
      m_bUTile != m.m_bUTile ||
      m_bVTile != m.m_bVTile ||
      m_Filter != m.m_Filter ||
      m_Ext != m.m_Ext ||
      m_Sampler != m.m_Sampler)
    return true;
  return false;
}

PVS-Studio warning: V549 The first argument of 'stricmp' function is equal to the second argument. ishader.h 2089

The code of PascalABC.NET (C#):

private List<string> enum_consts = new List<string>();
public override bool IsEqual(SymScope ts)
{
  EnumScope es = ts as EnumScope;
  if (es == null) return false;
  if (enum_consts.Count != es.enum_consts.Count) return false;
  for (int i = 0; i < es.enum_consts.Count; i++)
    if (string.Compare(enum_consts[i],
                       this.enum_consts[i], true) != 0)
      return false;
  return true;
}

PVS-Studio warning: V3038 The 'enum_consts[i]' argument was passed to 'Compare' method several times. It is possible that other argument should be passed instead. CodeCompletion SymTable.cs 2206

I'll give some explanation here. The error in the factual arguments of the Compare function:

string.Compare(enum_consts[i], this.enum_consts[i], true)

The thing is that enum_consts[i] and this.enum_consts[i are the same things. As I understand, a correct call should be like this:

string.Compare(es.enum_consts[i], this.enum_consts[i], true)

or

string.Compare(enum_consts[i], es.enum_consts[i], true)

Pattern: Repetitive Checks A==B && A==B

Quite a common error in programming is when the same check is done twice. Example:

return A == B &&
       C == D &&   // <=
       C == D &&   // <=
       E == F;

Two variants are possible in this case. The first is quite harmless: one comparison is redundant and can be simply removed. The second is worse: some other variables were to be compared, but a programmer made a typo.

In any case, such code deserves close attention. Let me scare you a little more, and show that this error can be found even in the code of GCC compiler (C):

static bool
dw_val_equal_p (dw_val_node *a, dw_val_node *b)
{
  ....
  case dw_val_class_vms_delta:
    return (!strcmp (a->v.val_vms_delta.lbl1,
                     b->v.val_vms_delta.lbl1)&& !strcmp (a->v.val_vms_delta.lbl1,
                        b->v.val_vms_delta.lbl1));
  ....
}

PVS-Studio warning: V501 There are identical sub-expressions '!strcmp(a->v.val_vms_delta.lbl1, b->v.val_vms_delta.lbl1)' to the left and to the right of the '&&' operator. dwarf2out.c 1428

The function strcmp is called twice with the same set of arguments.

The code of Unreal Engine 4 project (C++):

FORCEINLINE
bool operator==(const FShapedGlyphEntryKey& Other) const
{
  return FontFace == Other.FontFace&& GlyphIndex == Other.GlyphIndex   // <=&& FontSize == Other.FontSize&& FontScale == Other.FontScale&& GlyphIndex == Other.GlyphIndex;  // <=
}

PVS-Studio warning: V501 There are identical sub-expressions 'GlyphIndex == Other.GlyphIndex' to the left and to the right of the '&&' operator. fontcache.h 139

The code of Serious Engine project (C++):

inline BOOL CValuesForPrimitive::operator==(....)
{
  return (
 (....) &&
 (vfp_ptPrimitiveType == vfpToCompare.vfp_ptPrimitiveType) &&
 ....
 (vfp_ptPrimitiveType == vfpToCompare.vfp_ptPrimitiveType) &&
 ....
);

PVS-Studio warning: V501 There are identical sub-expressions '(vfp_ptPrimitiveType == vfpToCompare.vfp_ptPrimitiveType)' to the left and to the right of the '&&' operator. worldeditor.h 580

The code of Oracle VM Virtual Box project (C++):

typedef struct SCMDIFFSTATE
{
  ....
  bool  fIgnoreTrailingWhite;
  bool  fIgnoreLeadingWhite;
  ....
} SCMDIFFSTATE;
/* Pointer to a diff state. */

typedef SCMDIFFSTATE *PSCMDIFFSTATE;

/* Compare two lines */
DECLINLINE(bool) scmDiffCompare(PSCMDIFFSTATE pState, ....)
{
  ....
  if (pState->fIgnoreTrailingWhite    // <=
   || pState->fIgnoreTrailingWhite)   // <=
    return scmDiffCompareSlow(....);
  ....
}

PVS-Studio warning: V501 There are identical sub-expressions 'pState->fIgnoreTrailingWhite' to the left and to the right of the '||' operator. scmdiff.cpp 238

Pattern: Incorrect Use of the Value, Returned by memcmp Function

The memcmp function returns the following values of int type:

  • < 0 - buf1 less than buf2;
  • 0 - buf1 identical to buf2;
  • > 0 - buf1 greater than buf2;

Please note that '>0' can be any number, not only 1. These numbers can be: 2, 3, 100, 256, 1024, 5555, 65536 and so on. This means that this result cannot be placed to a variable of the char and short type. The high bits can be lost, which might violate the logic of program execution.

Also this means that the result cannot be compared with constants 1 or -1. In other words, it is wrong to write this:

if (memcmp(a, b, sizeof(T)) == 1)
if (memcmp(x, y, sizeof(T)) == -1)

Correct comparisons:

if (memcmp(a, b, sizeof(T)) > 0)
if (memcmp(a, b, sizeof(T)) < 0)

The danger of this code is that it may successfully work for a long time. The errors may start showing up when moving to a new platform or with the change of the compiler version.

The code of ReactOS project (C++):

HRESULT WINAPI CRecycleBin::CompareIDs(....)
{
  ....
  return MAKE_HRESULT(SEVERITY_SUCCESS, 0,
   (unsigned short)memcmp(pidl1->mkid.abID,
                          pidl2->mkid.abID,
                          pidl1->mkid.cb));
}

PVS-Studio warning: V642 Saving the 'memcmp' function result inside the 'unsigned short' type variable is inappropriate. The significant bits could be lost breaking the program's logic. recyclebin.cpp 542

The code of Firebird project (C++):

SSHORT TextType::compare(ULONG len1, const UCHAR* str1,
ULONG len2, const UCHAR* str2)
{
  ....
  SSHORT cmp = memcmp(str1, str2, MIN(len1, len2));

  if (cmp == 0)
    cmp = (len1 < len2 ? -1 : (len1 > len2 ? 1 : 0));
  return cmp;
}

PVS-Studio warning: V642 Saving the 'memcmp' function result inside the 'short' type variable is inappropriate. The significant bits could be lost breaking the program's logic. texttype.cpp 338

The code of CoreCLR project (C++):

bool operator( )(const GUID& _Key1, const GUID& _Key2) const
  { return memcmp(&_Key1, &_Key2, sizeof(GUID)) == -1; }

PVS-Studio warning: V698 Expression 'memcmp(....) == -1' is incorrect. This function can return not only the value '-1', but any negative value. Consider using 'memcmp(....) < 0' instead. sos util.cpp 142

The code of OpenToonz project (C++):

bool TFilePath::operator<(const TFilePath &fp) const
{
  ....
  char differ;
  differ = _wcsicmp(iName.c_str(), jName.c_str());
  if (differ != 0)
    return differ < 0 ? true : false;
  ....
}

PVS-Studio warning: V642 Saving the '_wcsicmp' function result inside the 'char' type variable is inappropriate. The significant bits could be lost, breaking the program's logic. tfilepath.cpp 328

Pattern: Incorrect Check of Null References

This error pattern is typical for C# programs. Sometimes in the comparison functions programmers write the type casting with the help of the as operator. The error is that inadvertently a programmer verifies against null not the new reference, but the original one. Let's take a look at a synthetic example:

ChildT foo = obj as ChildT;
if (obj == null)
  return false;
if (foo.zzz()) {}

The check if (obj == null) protects from the situation, if the obj variable contains a null reference. However, there is no protection from the case if it turns out that the as operator returns a null reference. The correct code should be like this:

ChildT foo = obj as ChildT;
if (foo == null)
  return false;
if (foo.zzz()) {}

Typically, this error occurs due to negligence of the programmer. Similar bugs are possible in the programs in C and C++, but I haven't found such a case in our error base.

The code of MonoDevelop project (C#):

public override bool Equals (object o)
{
  SolutionItemReference sr = o as SolutionItemReference;
  if (o == null)
    return false;
  return (path == sr.path) && (id == sr.id);
}

PVS-Studio warning: V3019 Possibly an incorrect variable is compared to null after type conversion using 'as' keyword. Check variables 'o', 'sr'. MonoDevelop.Core SolutionItemReference.cs 81

The code of CoreFX (C#):

public override bool Equals(object comparand)
{
  CredentialHostKey comparedCredentialKey =
                                  comparand as CredentialHostKey;

  if (comparand == null)
  {
    // This covers also the compared == null case
    return false;
  }

  bool equals = string.Equals(AuthenticationType,
        comparedCredentialKey.AuthenticationType, ....
  ....
}

PVS-Studio warning: V3019 Possibly an incorrect variable is compared to null after type conversion using 'as' keyword. Check variables 'comparand', 'comparedCredentialKey'. CredentialCache.cs 4007

The code of Roslyn project (C#):

public override bool Equals(object obj)
{
  var d = obj as DiagnosticDescription;

  if (obj == null)
    return false;

  if (!_code.Equals(d._code))
    return false;
  ....
}

PVS-Studio warning: V3019 Possibly an incorrect variable is compared to null after type conversion using 'as' keyword. Check variables 'obj', 'd'. DiagnosticDescription.cs 201

The code of Roslyn (C#):

protected override bool AreEqual(object other)
{
  var otherResourceString = other as LocalizableResourceString;
  return
    other != null &&
    _nameOfLocalizableResource ==
      otherResourceString._nameOfLocalizableResource &&
    _resourceManager == otherResourceString._resourceManager &&
    _resourceSource == otherResourceString._resourceSource &&
    ....
}

PVS-Studio warning: V3019 Possibly an incorrect variable is compared to null after type conversion using 'as' keyword. Check variables 'other', 'otherResourceString'. LocalizableResourceString.cs 121

The code of MSBuild project (C#):

public override bool Equals(object obj)
{
   AssemblyNameExtension name = obj as AssemblyNameExtension;
   if (obj == null)  // <=
   {
     return false;
   }
   ....
}

PVS-Studio warning: V3019 Possibly an incorrect variable is compared to null after type conversion using 'as' keyword. Check variables 'obj', 'name'. AssemblyRemapping.cs 64

The code of Mono project (C#):

public override bool Equals (object o)
{
  UrlMembershipCondition umc = (o as UrlMembershipCondition);
  if (o == null)                                      // <=
    return false;

  ....

  return (String.Compare (u, 0, umc.Url, ....) == 0); // <=
}

PVS-Studio warning: V3019 Possibly an incorrect variable is compared to null after type conversion using 'as' keyword. Check variables 'o', 'umc'. UrlMembershipCondition.cs 111

The code of Media Portal 2 project (C#):

public override bool Equals(object obj)
{
  EpisodeInfo other = obj as EpisodeInfo;
  if (obj == null) return false;
  if (TvdbId > 0 && other.TvdbId > 0)
    return TvdbId == other.TvdbId;
  ....
}

PVS-Studio warning: V3019 Possibly an incorrect variable is compared to null after type conversion using 'as' keyword. Check variables 'obj', 'other'. EpisodeInfo.cs 560

The code of NASA World Wind project (C#):

public int CompareTo(object obj)
{
  RenderableObject robj = obj as RenderableObject;
  if(obj == null)                                 // <=
    return 1;
  return this.m_renderPriority.CompareTo(robj.RenderPriority);
}

PVS-Studio warning: V3019 Possibly an incorrect variable is compared to null after type conversion using 'as' keyword. Check variables 'obj', 'robj'. RenderableObject.cs 199

Pattern: Incorrect Loops

In some functions, collections of items are compared. Of course, different variant of the loops are used for its comparison. If a programmer writes the code inattentively, it's easy to mix something up, as it is with the comparison functions. Let's look at a few of these situations.

The code of Trans-Proteomic Pipeline (C++):

bool Peptide::operator==(Peptide& p) {
  ....
  for (i = 0, j = 0;
       i < this->stripped.length(), j < p.stripped.length();
       i++, j++) {
  ....
}

PVS-Studio warning: V521 Such expressions using the ',' operator are dangerous. Make sure the expression is correct. tpplib peptide.cpp 191

Note that the comma operator is used in the condition. The code is clearly incorrect, because the condition, written to the left of the coma is ignored. That is, the condition on the left is evaluated, but its result is not used in any way.

The code of Qt project (C++):

bool equals( class1* val1, class2* val2 ) const
{
  ...
  size_t size = val1->size();
  ...
  while ( --size >= 0 ){
    if ( !comp(*itr1,*itr2) )
      return false;
    itr1++;
    itr2++;
  }
  ...
}

PVS-Studio warning: V547 Expression '-- size >= 0' is always true. Unsigned type value is always >= 0. QtCLucene arrays.h 154

The code of CLucene project (C++):

class Arrays
{
  ....
   bool equals( class1* val1, class2* val2 ) const{
     static _comparator comp;
     if ( val1 == val2 )
       return true;
     size_t size = val1->size();
     if ( size != val2->size() )
       return false;
     _itr1 itr1 = val1->begin();
     _itr2 itr2 = val2->begin();
     while ( --size >= 0 ){
       if ( !comp(*itr1,*itr2) )
         return false;
       itr1++;
       itr2++;
     }
   return true;
  }
  ....
}

PVS-Studio warning: V547 Expression '-- size >= 0' is always true. Unsigned type value is always >= 0. arrays.h 154

The code of Mono project (C#):

public override bool Equals (object obj)
{
  ....
  for (int i=0; i < list.Count; i++) {
    bool found = false;
    for (int j=0; i < ps.list.Count; j++) {     // <=
      if (list [i].Equals (ps.list [j])) {
        found = true;
        break;
      }
    }
    if (!found)
      return false;
  }
  return true;
}

PVS-Studio warning: V3015 It is likely that a wrong variable is being compared inside the 'for' operator. Consider reviewing 'i' corlib-net_4_x PermissionSet.cs 607

Apparently, there is a typo here, and the variable j instead of i should be used in the nested loop:

for (int j=0; j < ps.list.Count; j++)

Pattern: A = getA(), B = GetA()

Quite often in the comparison functions a programmer has to write code of this kind:

if (GetA().x == GetB().x && GetA().y == GetB().y)

Intermediate variables are used to reduce the size of the conditions or for optimization:

Type A = GetA();
Type B = GetB();
if (A.x == B.x && A.y == B.y)

But inadvertently, a person sometimes makes a mistake and initializes temporary variables with the same value:

Type A = GetA();
Type B = GetA();

Now let's take a look at these errors in the code of real applications.

The code of LibreOffice project (C++):

bool CmpAttr(
  const SfxPoolItem& rItem1, const SfxPoolItem& rItem2)
{
  ....
  bool bNumOffsetEqual = false;
  ::boost::optional<sal_uInt16> oNumOffset1 =
        static_cast<const SwFmtPageDesc&>(rItem1).GetNumOffset();
  ::boost::optional<sal_uInt16> oNumOffset2 =
        static_cast<const SwFmtPageDesc&>(rItem1).GetNumOffset();

  if (!oNumOffset1 && !oNumOffset2)
  {
    bNumOffsetEqual = true;
  }
  else if (oNumOffset1 && oNumOffset2)
  {
    bNumOffsetEqual = oNumOffset1.get() == oNumOffset2.get();
  }
  else
  {
    bNumOffsetEqual = false;
  }
  ....
}

PVS-Studio warning: V656 Variables 'oNumOffset1', 'oNumOffset2' are initialized through the call to the same function. It's probably an error or un-optimized code. Check lines: 68, 69. findattr.cxx 69

The code of Qt project (C++):

AtomicComparator::ComparisonResult
IntegerComparator::compare(const Item &o1,
                           const AtomicComparator::Operator,
                           const Item &o2) const
{
  const Numeric *const num1 = o1.as<Numeric>();
  const Numeric *const num2 = o1.as<Numeric>();

  if(num1->isSigned() || num2->isSigned())
  ....
}

PVS-Studio warning: V656 Variables 'num1', 'num2' are initialized through the call to the same function. It's probably an error or un-optimized code. Consider inspecting the 'o1.as < Numeric > ()' expression. Check lines: 220, 221. qatomiccomparators.cpp 221

Pattern: Sloppy Copying of the Code

A large amount of errors, cited previously can be called the consequences of sloppy Copy-Paste. They fell under some categories of the erroneous pattern and I decided that it would be logical to describe them in corresponding sections. However, I have several errors that have clearly appeared because of sloppy code copying, but I have no idea how to classify them. That's why I collected these errors here.

The code of CoreCLR project (C++):

int __cdecl Compiler::RefCntCmp(const void* op1, const void* op2)
{
  ....
  if (weight1)
  {
    ....
    if (varTypeIsGC(dsc1->TypeGet()))
    {
      weight1 += BB_UNITY_WEIGHT / 2;
    }
    if (dsc1->lvRegister)
    {
      weight1 += BB_UNITY_WEIGHT / 2;
    }
  }

  if (weight1)
  {
    ....
    if (varTypeIsGC(dsc2->TypeGet()))
    {
      weight1 += BB_UNITY_WEIGHT / 2;       // <=
    }
    if (dsc2->lvRegister)
    {
      weight2 += BB_UNITY_WEIGHT / 2;
    }
  }
  ....
}

PVS-Studio warning: V778 Two similar code fragments were found. Perhaps, this is a typo and 'weight2' variable should be used instead of 'weight1'. clrjit lclvars.cpp 2702

The function was long that's why it is shortened for the article. If we examine the code of the function, we'll see that a part of the code was copied, but in one fragment a programmer forgot to replace the variable weight1 with weight2.

The code of WPF samples by Microsoft project (C#):

public int Compare(GlyphRun a, GlyphRun b)
{
  ....
  if (aPoint.Y > bPoint.Y)      // <=
  {
    return -1;
  }
  else if (aPoint.Y > bPoint.Y) // <=
  {
    result = 1;
  }
  else if (aPoint.X < bPoint.X)
  {
    result = -1;
  }
  else if (aPoint.X > bPoint.X)
  {
    result = 1;
  }
  ....
}

PVS-Studio warning: V3003 The use of 'if (A) {...} else if (A) {...}' pattern was detected. There is a probability of logical error presence. Check lines: 418, 422. txtserializerwriter.cs 418

The code of PascalABC.NET project (C#):

public void CompareInternal(....)
{
  ....
  else if (left is int64_const)
    CompareInternal(left as int64_const, right as int64_const);
  ....
  else if (left is int64_const)
    CompareInternal(left as int64_const, right as int64_const);
  ....
}

PVS-Studio warning: V3003 The use of 'if (A) {...} else if (A) {...}' pattern was detected. There is a probability of logical error presence. Check lines: 597, 631. ParserTools SyntaxTreeComparer.cs 597

The code of SharpDevelop project (C#):

public int Compare(SharpTreeNode x, SharpTreeNode y)
{
  ....
  if (typeNameComparison == 0) {
    if (x.Text.ToString().Length < y.Text.ToString().Length)
      return -1;
    if (x.Text.ToString().Length < y.Text.ToString().Length)
      return 1;
  }
  ....
}

PVS-Studio warning: V3021 There are two 'if' statements with identical conditional expressions. The first 'if' statement contains method return. This means that the second 'if' statement is senseless NamespaceTreeNode.cs 87

The code of Coin3D (C++):

int
SbProfilingData::operator == (const SbProfilingData & rhs) const
{
  if (this->actionType != rhs.actionType) return FALSE;
  if (this->actionStartTime != rhs.actionStopTime) return FALSE;
  if (this->actionStartTime != rhs.actionStopTime) return FALSE;
  ....
}

PVS-Studio warning: V649 There are two 'if' statements with identical conditional expressions. The first 'if' statement contains function return. This means that the second 'if' statement is senseless. Check lines: 1205, 1206. sbprofilingdata.cpp 1206

The code of Spring (C++):

bool operator < (const aiFloatKey& o) const
  {return mTime < o.mTime;}
bool operator > (const aiFloatKey& o) const
  {return mTime < o.mTime;}

PVS-Studio warning: V524 It is odd that the body of '>' function is fully equivalent to the body of '<' function. assimp 3dshelper.h 470

And here is the last, particularly interesting code fragment that PVS-Studio analyzer found in MySQL project (C++).

static int rr_cmp(uchar *a,uchar *b)
{
  if (a[0] != b[0])
    return (int) a[0] - (int) b[0];
  if (a[1] != b[1])
    return (int) a[1] - (int) b[1];
  if (a[2] != b[2])
    return (int) a[2] - (int) b[2];
  if (a[3] != b[3])
    return (int) a[3] - (int) b[3];
  if (a[4] != b[4])
    return (int) a[4] - (int) b[4];
  if (a[5] != b[5])
    return (int) a[1] - (int) b[5]; // <=
  if (a[6] != b[6])
    return (int) a[6] - (int) b[6];
  return (int) a[7] - (int) b[7];
}

PVS-Studio warning: V525 The code containing the collection of similar blocks. Check items '0', '1', '2', '3', '4', '1', '6' in lines 680, 682, 684, 689, 691, 693, 695. sql records.cc 680

Most likely, a programmer wrote the first comparison, then the second and got bored. So he copied to the buffer a text block:

if (a[1] != b[1])
  return (int) a[1] - (int) b[1];

A pasted it to the text of the program as many times as he needed. Then he changed indexes, but made a mistake in one place and got an incorrect comparison:

if (a[5] != b[5])
  return (int) a[1] - (int) b[5];

Note. I discuss this error in more detail in my mini-book "The Ultimate Question of Programming, Refactoring, and Everything" (see a chapter "Don't do the compiler's job").

Pattern: Equals Method Incorrectly Processes a Null Reference

In C# the accepted practice is to implement the Equals methods in such a way, so that they correctly process a situation, if a null reference is passed as an argument. Unfortunately, not all the methods are implemented according to this rule.

The code of GitExtensions (C#):

public override bool Equals(object obj)
{
  return GetHashCode() == obj.GetHashCode(); // <=
}

PVS-Studio warning: V3115 Passing 'null' to 'Equals(object obj)' method should not result in 'NullReferenceException'. Git.hub Organization.cs 14

The code of PascalABC.NET project (C#):

public override bool Equals(object obj)
{
  var rhs = obj as ServiceReferenceMapFile;
  return FileName == rhs.FileName;
}

PVS-Studio warning: V3115 Passing 'null' to 'Equals' method should not result in 'NullReferenceException'. ICSharpCode.SharpDevelop ServiceReferenceMapFile.cs 31

Miscellaneous Errors

The code of G3D Content Pak project (C++):

bool Matrix4::operator==(const Matrix4& other) const {
  if (memcmp(this, &other, sizeof(Matrix4) == 0)) {
    return true;
  }
  ...
}

PVS-Studio warning: V575 The 'memcmp' function processes '0' elements. Inspect the 'third' argument. graphics3D matrix4.cpp 269

One closing bracket is put incorrectly. As a result, the amount of bites compared is evaluated by the statement sizeof(Matrix4) == 0. The size of any class is more than 0, which means that the result of the expression is 0. Thus, 0 bites get compared.

Correct variant:

if (memcmp(this, &other, sizeof(Matrix4)) == 0) {

The code of Wolfenstein 3D project (C++):

inline int operator!=( quat_t a, quat_t b )
{
  return ( ( a.x != b.x ) || ( a.y != b.y ) ||
           ( a.z != b.z ) && ( a.w != b.w ) );
}

PVS-Studio warning: V648 Priority of the '&&' operation is higher than that of the '||' operation. math_quaternion.h 167

Apparently, in one fragment the && operator was accidentally written instead of ||.

The code of FlightGear project (C):

static int tokMatch(struct Token* a, struct Token* b)
{
  int i, l = a->strlen;
  if(!a || !b) return 0;
  ....
}

PVS-Studio warning: V595 The 'a' pointer was utilized before it was verified against nullptr. Check lines: 478, 479. codegen.c 478

If we pass NULL as the first argument to the function, we'll get null pointer dereference, although the programmer wanted the function to return 0.

The code of WinMerge project (C++):

int TimeSizeCompare::CompareFiles(int compMethod,
                                  const DIFFITEM &di)
{
  UINT code = DIFFCODE::SAME;
  ...
  if (di.left.size != di.right.size)
  {
    code &= ~DIFFCODE::SAME;
    code = DIFFCODE::DIFF;
  }
  ...
}

PVS-Studio warning: V519 The 'code' variable is assigned values twice successively. Perhaps this is a mistake. Check lines: 79, 80. Merge timesizecompare.cpp 80

The code of ReactOS project (C++):

#define IsEqualGUID(rguid1, rguid2) \
  (!memcmp(&(rguid1), &(rguid2), sizeof(GUID)))

static int ctl2_find_guid(....)
{
  MSFT_GuidEntry *guidentry;
  ...
  if (IsEqualGUID(guidentry, guid)) return offset;
  ...
}

PVS-Studio warning: V512 A call of the 'memcmp' function will lead to underflow of the buffer 'guidentry'. oleaut32 typelib2.c 320

A pointer is written here as the first argument. As a result, the address of the pointer gets evaluated, which has no sense.

Correct variant:

if (IsEqualGUID(*guidentry, guid)) return offset;

The code of IronPython and IronRuby project (C#):

public static bool Equals(float x, float y) {
  if (x == y) {
    return !Single.IsNaN(x);
  }
  return x == y;
}

PVS-Studio warning: V3024 An odd precise comparison: x == y. Consider using a comparison with defined precision: Math.Abs(A - B) < Epsilon. FloatOps.cs 1048

It's not clear what is the point of a special check against NaN here. If the condition (x == y) is true, it means that both x and y and different from NaN, because NaN isn't equal to any other value, including itself. It seems that the check against NaN is just not necessary, and the code can be shortened to:

public static bool Equals(float x, float y) {
  return x == y;
}

The code of Mono project (C#):

public bool Equals (CounterSample other)
{
  return
    rawValue         == other.rawValue         &&
    baseValue        == other.counterFrequency &&   // <=
    counterFrequency == other.counterFrequency &&   // <=
    systemFrequency  == other.systemFrequency  &&
    timeStamp        == other.timeStamp        &&
    timeStamp100nSec == other.timeStamp100nSec &&
    counterTimeStamp == other.counterTimeStamp &&
    counterType      == other.counterType;
}

PVS-Studio warning: V3112 An abnormality within similar comparisons. It is possible that a typo is present inside the expression 'baseValue == other.counterFrequency'. System-net_4_x CounterSample.cs 139

How Do these Programs Work at all?

Looking through all the errors, it seems miraculous that all these programs generally work. Indeed, the comparison functions do a very important and responsible task in program.

There are several explanations of why these programs work despite these errors:

  1. In a lot of functions, only a part of the object is compared incorrectly. The partial comparison is enough for most of the tasks in this program.
  2. There are no situations (yet) when the function works incorrectly. For example, this applies to the functions that aren't protected from null pointers or those, where the result of the memcmp function call is placed into the variable of char type. The program is simply lucky.
  3. The reviewed comparison function is used very rarely or not used at all.
  4. Who said that the program is working? A lot of programs really do something wrong!

Recommendations

I demonstrated how many errors can be found in the comparison functions. It follows that the efficiency of these functions should be checked with unit-tests by all means.

It is really necessary to write unit-tests for the comparison operators, for Equals functions and so on.

I am quite sure that there was such an understanding among programmers before reading this article, that unit tests for such functions is extra work and they won't detect any errors anyway: the comparison functions are just so simple at the first glance... Well, now I showed the horror that can hide in them.

Code reviews and using static analysis tools would also be a great help.

Conclusion

In this article we mentioned a large amount of big-name projects that are developed by highly qualified experts. These projects are thoroughly tested using different methodologies. Still, it didn't stop PVS-Studio from finding errors in them. This shows that PVS-Studio can become a nice complement to other methodologies used to improve the quality and reliability of the code.


Nightdive turns games of the past into a bright future…virtually

$
0
0

Nightdive turns games of the past into a bright future…virtually

Many game companies open up an office space, get a development team together to work in that office, grind away for a couple of years to create a new intellectual property (IP), then put the product up for sale through retail outlets and digital-distribution sites, such as Steam. Hopefully, profit follows, so they can do it all over again.

Nightdive Studios, on the other hand, took a drastically different path, and its website reveals that core mission: “Bringing lost and forgotten gaming treasures back from the depths…”

By acquiring the rights to already-released games, updating them to work on contemporary platforms, and offering the revamped games through direct-distribution outlets, Nightdive can avoid having to lease office space, and it doesn’t need to employ dozens of local employees to facilitate the work. The development company operates a virtual office environment, which means the people involved in updating and coding the games don’t need to move from their respective countries, or even their homes. All of that contributes to Nightdive’s profits, which the studio uses to, indeed, do it all over again…and again…and again.

A shocking trip

Nightdive was founded in late 2012 by Stephen Kick, now Nightdive’s CEO. Back then, Kick was a character artist with Sony Online Entertainment, but was getting a little tired of making games for others. He decided to embark on a world trip to find new inspiration, and, like many travelers, he brought some games with him — in this case, some classics from his youth.

Stephen Kick
Stephen Kick. Image Credit: Nightdive Studios.

"One night, I was playing — or attempting to play — System Shock 2, and I couldn’t get the game running,” Kick explains. “I went online, attempted to purchase the game (on GOG.com), and I discovered there was no legal way to commercially buy the product. So, I did some digging, and discovered that the IP had been transferred to an insurance company after Looking Glass Studios had gone out of business. I approached [the insurance company] about digitally re-releasing the game on GOG, Steam, and other digital platforms, and that was pretty much the birth of Nightdive Studios."

Kick says the success with the System Shock 2 re-release was the first step for the newborn company, but it quickly led to “finding other games that were lost to time,” and following the same procedure to bring them back to market. As the classic song goes, “Everything old is new again,” and Nightdive is proving that to be quite true with its retro games. The studio has over 100 products in its catalog — available on Steam, GOG, and Humble Bundle’s Humble Store— including, The 7th Guest, Shadow Man, Space Rogue, and the Wizardry series.

"Our inspiration really lies in all the games that we grew up with and that we remember fondly," Kick says, "And our desire to replay those games, preserve them for future generations to enjoy, and just to continue, I guess, the stewardship of making sure these games are available for everybody to play again."

Out of the fog

In March 2017, Nightdive brought out its latest release: Turok 2: Seeds of Evil. This first-person shooter debuted in 1998 on the Nintendo 64 console, courtesy of Acclaim Entertainment, and ported to Windows a year later. Nightdive has already released its Turok 2 update on PC, and is also working on a port to the Xbox One console.

Split-screen multiplayer action in Turok 2
Split-screen multiplayer action in Turok 2. Image Credit: Nightdive Studios

One of the features Nightdive has included is for Turok 2 to be playable on almost any PC. That enables players on a wide variety of systems to still be able to enjoy a stable game with high visual fidelity.

"It’s interesting…we worked in cooperation with Intel, using their toolsets; Intel provides a variety of different software tools to optimize your game performance," says Larry Kuperman, Nightdive’s director of business development. "One of the things we found with the Intel set, we were able to make sure that [Turok 2] would play on the widest spectrum of computers available, so that if you wanted to fire up Turok on your laptop on the way home, it would play smoothly."

Another change Kuperman points out has to do with the game’s viewing distance. Because of the constraints of the processors in the late ’90s, the original game-developers used fog to limit the distance the player could see ahead, which enabled them to provide highly detailed graphics at a relatively short distance. However, nearly 20 years on, with the increase of CPU power and video cards, distance-limiting fog wasn’t needed.

Larry Kuperman
Larry Kuperman. Image Credit: Nightdive Studios

"We were able to roll back the fog, and give the game a whole new visual treatment,” Kuperman explains. “These are not games that are intended to compete with the highest-end, highest-requirement games out there, but, visually, they’re certainly appealing."

Another Nightdive development team is working on a reboot of System Shock. Nightdive has managed to acquire full rights to the game, so the studio is rebuilding it from the ground up using the Unreal Engine.

"The ultimate goal for us acquiring the license,” Kick says, “is to be able to reintroduce the franchise to the current generation of gamers. That really kicked off around the end of June [2016], when we launched our Kickstarter. We were able to raise 150% of our goal for a total of $1.35 million in order to faithfully reboot the first game in the series."

Their virtual reality

Nightdive’s virtual office environment means that the studio has people all around the world working on projects. As Kick explains, this means that development happens on pretty much a 24/7 basis, with tools (such as GitHub, JIRA, and Slack) enabling collaboration and communication across the team. Software enables managers to track each person’s contribution to make sure everyone is generating what they need to for the project. Kick bemoans some of the tradeoffs — such as the lack of in-office socializing and camaraderie — but Kuperman counters that the distributed office means there are no complaints that a co-worker cracks his knuckles or plays her music too loudly.

Kuperman feels that this is a great time to be in game development, with changes to the creation process enabling end-to-end benefits. With crowdfunding platforms, such as Kickstarter and Fig, it’s easier for a studio to work on a project without needing to make a deal (and share future revenue) with a publisher. Game engines, such as Unity and Unreal, are incredibly powerful, but also free to use until you start selling the product you’ve created. And there are a bunch of digital-sales platforms on which to retail a product, so a developer can self-publish quite easily. Even if the developer opts to work with a publisher to bring a product to market, Kuperman says there are still benefits from those tools.

"A developer can be relatively self-sufficient and come to the publisher, saying 'Look at what I’ve produced so far. Is this something that you’d be interested in?' So you have all those things out there — you have a very robust ecosystem for games development now."

How F5 Networks Profiles for Success

$
0
0

When Seattle-based F5 Networks, Inc. needed to amp up its BIG-IP DNS* solution for developers, it got help from Intel.

Business users expect their applications to be fast, secure, and always available. Anything less is unacceptable. That’s why F5 gives the developers who build those applications the tools they need to deliver maximum speed, security, and availability.

The company’s BIG-IP DNS improves the performance and availability of applications by sending users to the closest or best-performing physical, virtual, or cloud environment. It also hyperscales and secures developers’ domain name service (DNS) infrastructure from distributed denial of service (DDoS) attacks and delivers a real-time domain name system security extensions (DNSSEC) solution that protects against hijacking.

“Intel® VTune™ Amplifier helped us identify potential performance bottlenecks in the design and engineering of our high-performance networking systems,” explained James Hendergart, strategic initiatives director for F5 Networks. “We worked with the Intel VTune Amplifier team for about a month. They were very responsive to our needs, adding the capability to run Intel VTune Amplifer remotely and in headless environments. It was a great collaboration between Intel and F5.”

Get the whole story in our new case study.

Tall-and-Skinny and Short-and-Wide Optimizations for QR and LQ Decompositions

$
0
0

    Intel® Math Kernel Library (Intel® MKL) 2017 updates 3 and later versions provide optimized functionality for calculating QR decompositions of tall-and-skinny (TS) matrices, and for calculating LQ decompositions of short-and-wide (SW) matrices.

New routines have been added to Intel MKL to allow for the calculations of QR and LQ factorizations using the TS/SW modifications described above for appropriate matrix sizes. These routines are generalized for all sizes (i.e. they will also work on matrices that are not TS/SW, as they include paths to return to the generic routines when the matrix size is not sufficiently TS/SW). Details of the new routines and parameter specifications can be found in the Intel MKL Developer Reference (https://software.intel.com/en-us/articles/intel-math-kernel-library-documentation). The routines to reference are listed below:

New TS/SW Routine

Generic Routines

QR Decomposition

  • ?geqr
  • ?gemqr

 

 

LQ Decomposition

  • ?gelq
  • ?gemlq

QR Decomposition

  • ?geqrf
  • ?ormqr (real)
  • ?unmqr (complex)

 

LQ Decomposition

  • ?gelqf
  • ?ormlq (real)
  • ?unmlq (complex)

 

    A general overview of the TSQR algorithm is provided into TSKB_QRLQ.pdf file attached. In addition, this pdf provides example code to call the QR decomposition of a matrix using the new TSQR routines.

    The following charts show the speedup of DGEQR compared to DGEQRF. Performance results of ?GELQ compared to ?GELQF routines show similar speedup, thus are not displayed here

The first chart shows these speedups on an Intel® Xeon® CPU E5-2699 v4 processor, 

 

 and the second on an Intel® Xeon Phi™ 7250 processor. 

 

Tutorial: Unlock Intel® GPU capabilities with Intel OpenCL™ Extensions

$
0
0

Download tutorial code here.

Based on an IWOCL 2017 tutorial Unlock Intel GPUs for High Performance Compute, Media and Computer Vision.  

 

Introduction

Intel provides many extensions to the Khronos OpenCL(tm) standard to help you utilize the full range of hardware capabilities.  

  • Subgroups
  • Video Motion Estimation (VME)
  • VEBox

These extensions are not standalone.  They build upon each other.

 

The tutorial code focuses on subgroups, VME, and VEBox.  Image processing and sharing extensions are also used in the tutorials code as solution components.

For more information on Intel extensions: https://software.intel.com/en-us/articles/opencl-intel-graphics-extensions

 

Subgroups

Intel subgroups are 

  • subset of a work group
  • equal to the SIMD width (8,16,or 32)
  • in the same hardware thread of the EU
  • share thread resources (including register space)
  • execute together 

Intel subgroup functions add

  • barrier, broadcast, reduce, scan 
  • shuffle
  • block read/write

More info: Spec

 

Video Motion Estimation (VME)

Intel Gen GPUs accelerate the search for motion in video.  This is a core codec component but can also be used in a wide range of applications from custom bitrate control to computer vision.

 

VEBox

Intel GPUs contain a specialized IP block designed for video enhancement operations.

 

For more info:

 

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

 

Accelerating Deep Learning Inference with Intel® Processor Graphics

$
0
0

Introduction

This paper introduces the tools recently made available to accelerate your AI inference in edge devices on Intel® Processor Graphics solutions across the spectrum of Intel SOCs. In particular, the paper covers Intel’s Deep Learning Deployment Toolkit and how it helps to increase the performance and maybe even more importantly the performance per watt of AI Inference in your product. The paper will also introduce the underlying Compute Library for Deep Neural Networks(clDNN), a Neural Network kernel optimizations written in OpenCL and available in open source.

Target audience: Software developers, platform architects, and academics seeking to maximize deep learning performance on Intel® Processor Graphics.

Note: Artificial Intelligence (AI), Machine Learning (ML), Deep Learning (DL) are used interchangeably in this paper. The larger field is artificial intelligence. This article is focusing on the Machine Learning piece of AI or more specifically the multi-layered neural networks form of Machine Learning called Deep Learning. 

Background on AI and the Move to the Edge

Artificial Intelligence or AI has been a domain of research with fits and starts over the last 60 years.  AI has really taken off in the last 5 years with the availability of large data sources, growth in compute engines and modern algorithms development based on neural networks.  Machine learning or the many layers of deep learning are propelling AI into all parts of modern life as it is applied to varied usages from computer vision to identification and classification from natural language processing to forecasting.  These base level tasks help to do decision making in many areas of life.

As a data scientist Andrew Ng noted AI is the next electricity: “Just as electricity transformed almost everything 100 years ago, today I actually have a hard time thinking of an industry that I don’t think AI will transform in the next several years.”

This wave of AI work began in the cloud running on servers.  While AI usage in the cloud continues to grow quickly, there is a trend to perform AI inference on the edge.  This trend to devices performing machine learning locally versus relying solely on the cloud is driven by the need to lower latency, persistent availability, lower costs and address privacy concerns.  We are moving to the day that devices from phones and PCs to cars, robots and drones to embedded devices like refrigerators and washing machines all will have AI embedded in them.  As Andrew Ng pointed out, companies in all industries are figuring out their AI strategy. Additionally, the field of AI is rapidly changing, with novel topologies being introduced on a weekly basis. This requires product developers to design for flexibility to modify AI software frequently in their products.

Intel® Processor Graphics as a Solution for AI Inference on the Edge

Intel Processor Graphics (Intel® HD Graphics, Intel® Iris® Graphics and Intel® Iris® Pro Graphics) provides a good balance of fixed function acceleration with programmability to deliver good performance/power across the emerging AI workloads with the flexibility to allow customers to adopt the latest AI topologies. Specifically, Intel® Processor Graphics provides the characteristics of:

Ubiquity– Intel Processor Graphics as part of Intel’s SOCs have already shipped in over a billion devices ranging from servers to PCs to embedded devices.  This makes it a widely available engine to run machine learning algorithms.

Scalability– As AI becomes embedded in every product, the design points of power and performance will vary greatly.  Intel Processor Graphics is available in a broad set of power/performance offerings from Intel® Atom™ processors, Intel® Core™ processors, and Intel® Xeon™ processors.

Leadership in Media– More than 70% of internet traffic is video.  One of the top usages for AI in devices will be computer vision.  Along with compute for AI, encoding, decoding and processing video will be employed concurrently.  Intel® Quick Synch Video technology is based on the dedicated media capabilities of Intel® Processor Graphics to improve the performance and power efficiency of media applications, specifically speeding up functions like decode, encode and video processing. See Intel’s Quick synch Video page to learn more.  This is paired with the Intel® Media SDK and Intel® Media Server Studio - API that provides access to hardware-accelerated codecs on Windows* and Linux*.

Powerful and Flexible Instruction Set Architecture (ISA) - The Instruction Set Architecture (ISA) of the Processor Graphics SIMD execution units is well suited to Deep Learning. This ISA offers rich data type support for 32bitFP, 16bitFP, 32bitInteger, 16bitInteger with SIMD multiply-accumulate instructions. At theoretical peak, these operations can complete on every clock for every execution unit. Additionally, the ISA offers rich sub register region addressing to enable efficient cross lane sharing for optimized convolution implementations, or efficient horizontal scan-reduce operations. Finally, the ISA provides efficient memory block loads to quickly load data tiles for optimized convolution or optimized generalized matrix multiply implementations.

Memory architecture– When using discrete graphics acceleration for deep learning, input and output data have to be transferred from system memory to discrete graphics memory on every execution – this has a double cost of increased latency and power. Intel® Processor Graphics is integrated on-die with the CPU. This integration enables the CPU and Processor Graphics to share system memory, share memory controller, and share portions of the cache hierarchy. Such a shared memory architecture can enable efficient input/output data transfer and even “zero copy” buffer sharing. Additionally, Intel has sku offerings with additional package integrated eDRAM. 

Intel’s Deep Learning Deployment Toolkit

To utilize the hardware resources of Intel® Processor Graphics easily and effectively, Intel provides the Deep Learning Deployment Toolkit. This toolkit takes a trained model and tailor it to run optimally for specific endpoint device characteristics. In addition, it delivers a unified API to integrate inference with application logic.

The Deep Learning Deployment Toolkit comprises two main components: the Model Optimizer and the Inference Engine (Figure 1).  

Figure 1: Model flow through the Deep Learning Deployment Toolkit

Model Optimizer is a cross-platform command line tool that performs static model analysis and adjusts deep learning models for optimal execution on end-point target devices. In detail, the Model Optimizer:

  • Takes as input a trained network in a framework specific format (for example from the Caffe* framework)
  • Performs horizontal and vertical fusion of the network layers
  • Prunes unused branches in the network
  • Quantizes weights
  • Produces as output an Internal Representation (IR) of the network - a pair of files that describe the whole model:
    • Topology file - an XML file that describes the network topology
    • Trained data file - a .bin file that contains the weights and biases binary data

The produced IR is used as an input for the Inference Engine.

Inference Engine is a runtime that delivers a unified API to integrate the inference with application logic. Specifically it:

  • Takes as input an IR produced by the Model Optimizer
  • Optimizes inference execution for target hardware
  • Delivers inference solution with reduced footprint on embedded inference platforms.

The Deep Learning Deployment Toolkit can optimize inference for running on different hardware units like CPU, GPU and will support FPGA in future. For acceleration on CPU it uses the MKL-DNN plugin – the domain of Intel® Math Kernel Library which includes functions necessary to accelerate the most popular image recognition topologies. It's planned to add FPGA support using plugin for Intel® Deep Learning Inference Accelerator . For GPU, the Deep Learning Deployment Toolkit has clDNN– a library of OpenCL kernels. The next section explains how clDNN helps to improve inference performance.

Compute Library for Deep Neural Networks (clDNN)

clDNN is a library of kernels to accelerate deep learning on Intel® Processor Graphics. Based on OpenCL, these kernels accelerate many of the common function calls in the popular topologies (AlexNet*, VGG*, GoogleNet*, ResNet*, Faster-RCNN*, SqueezeNet* and FCN* are supported today with more being added). To give developers the greatest flexibility and highest achievable performance Intel is delivering:

1) The full library as open source so developers and customers can use existing kernels as models to build upon or create their own hardware specific kernels running deep learning. 

2) Compute extensions to expose the full hardware capabilities to developers.

During network compilation clDNN breaks the workflow optimizations into in three stages described below.

Figure 2: Model flow from topology creation to execution

Network Compilation and the 3 Stages of clDNN

Stage 1:  Network Level

Fusing is one of most efficient ways to optimize graphs in DL. In clDNN, we have created 2 ways to perform fusing – one more automated to run on a single accelerator (naive inference client) and the second  for a more experienced data scientist to tune to run across multiple accelerators (Set of fused primitives).  In more detail:

  • Naive inference client – you have a workload and want it to be run on one accelerator. In this case user can ask clDNN to perform fusing during network compilation
  • Set of fused primitives – in this approach, the user who is experienced in tuning models, does the graph compilation with pattern matching in his application to balance the work across various accelerators. For this approach we expose already fused primitives

Currently clDNN supports 3 fusions: convolution with activation, fully connected with activation and deconvolution with activation fused primitives. Additional fusions are in development.

Another part of network level optimizations is the padding implementation. Choosing OpenCL buffers as data storage requires padding by either adding conditions inside the kernels or providing a buffer with a frame around the input data. The first approach would consume the full register budget, which would constrain the available registers for the convolutions kernels, negatively impacting performance.

Experiments have shown that adding the proper aligned frame around the buffers provides better performance results, when it is done as follows:

Consider network with two primitives A and B. B contains padding equals 2:

Figure 3: Padding Example

This requires adding a frame with size 2x2:

To add the frame we need to add the reorder primitive:

and fuse this with the A primitive:

Stage 2: Memory Level

As soon as the topology is defined and data is provided, the network is ready to compile. The first step of network compilation is the determination of the activation layout. In DNN’s, data stored in hidden layers is defined as 4D memory chunks. In clDNN, the layout description is defined with 4 letters:

  • B - number of patches in batch
  • F - number of feature maps or channels
  • X - spatial or width
  • Y - spatial or height

Figure 4: Example of a memory chunk

Figure 5: For most cases the most optimal layout is BFYX

If data type is half precision (fp16), the batch size is greater or equal to 32 and the convolutions are using split parameter (depth split like in Alexnet* convolutions), then the clDNN layout is YXFB.

Figure 6: YXFB layout

During memory level optimization, after kernels for every primitive have been chosen, clDNN runs weights optimizations, which transform user provided weights into ones that are suitable for the chosen kernel. Weights for convolutions are stored in:

Figure 7: Weights for convolutions in IS_IYX_OSV16

 

For fully connected networks depending on data type (fp16/fp32), weights can be transformed into one of the following:

Figure 8: memory layouts for optimized fully connected primitives

Stage 3: Kernel Level:

To enable modern topologies in an efficient way on Intel® Processor Graphics, a focus on convolution implementation is needed. To do this, clDNN uses output blocks that enable each thread on the Intel® Processor Graphics to compute more than one output at a time. The size of the block depends on the convolution stride size. If the block size is greater than the stride, then clDNN uses shuffle technology to reuse weights and inputs within the neighborhood. This approach yields 85% of performance peak on Alexnet* convolution kernels. All reads and writes are using more optimal block_read/block_write functions. A similar approach is applied to achieve high efficiency running deconvolution and pooling primitives.

Performance Numbers:

The Intel® Iris® Pro Graphics provides more peak performance and the Intel® HD Graphics provides more performance/watt.

Details:

Batch1 FP16

Intel® HD Graphics 530 (blue) configuration: Intel® Core™ i5-6500 CPU @ 3.20GHz, Intel® HD Graphics 530, fixed frequency - 1000 Mhz, CentOS 7.2 kernel 4.2,  OpenCL driver: Intel SRB 4.1., Memory: 2x8GB DDR4 2133

Intel® Iris® Pro Graphics 580 (orange) configuration: Intel® Core™ i7-6770HQ CPU @ 2.60GHz, Intel® Iris® Pro Graphics 580, fixed frequency – 950 Mhz, CentOS 7.2 kernel 4.2,  OpenCL driver: Intel SRB 4.1., Memory: 2x4GB DDR4 2133

Topologies: AlexNet*, VGG16-FACE*

Memory Bandwidth vs Compute

In topologies with memory bound sequences (like Alexnet*), we can increase the batch size, reusing weights in multi batches to gain greater images/second performance.  But for topologies that are compute bound (like VGG16-FACE*) even with single image on input, we see little benefit with larger batch sizes:

Systems used for these measurements are configured in the same way as at previous pair of benchmarks.

Power efficiency

In some power constrained workloads, it can be more important to maximize performance/watt versus absolute performance.  Since decreasing the clock rate causes the power to decrease linearly but voltage is squared, the GPU performance per Watt is increasing linearly as frequency is lowered. Intel® HD Graphics can show a better FPS/Watt ratio running with lower frequency on lower power states.  Also different Intel processor products offer different leakage and power behavior.  For example the 6th and 7th Generation Intel “Y skus” such as the Intel® Core™ m7-6Y75 Processor with Intel® HD Graphics 515 provide lower peak performance but more performance / watt.  Through the combination of selecting the right Intel SOC across a wide range of power and performance points and choosing the appropriate frequency, the developer has the ability to tune to a broad range of workloads and power envelopes.

Conclusion:

AI is becoming pervasive, driven by the huge advancements in machine learning and particularly deep learning over the last few years.  All devices on the edge are moving toward implementing some form of AI, increasingly performed locally due to cost, latency and privacy concerns.  Intel® Processor Graphics provides a good solution to accelerate deep learning workloads.  This paper described the Deep Learning Model Optimizer, Inference Engine and clDNN library of optimized CNN kernels that is available to help developers deliver AI enabled products to market.  For more information or to get started, download the tools or libraries from the links below:  

Appendix A: List of Primitives in the clDNN Library

Compute Library for Deep Neural Networks (clDNN) is a middle-ware software for accelerating DNN inference on Intel® HD and Iris™ Pro Graphics. This project includes CNN primitives implementations on Intel GPUs with C and C++ interfaces.

clDNN Library implements set of primitives:

  • Compute Primitives
    • Convolution
    • Deconvolution
    • Fully connected (inner product)
    • Element-Wise
  • Pooling
    • average
    • maximum
    • ROI pooling
  • Normalization
    • LRN across/within channel
    • Normalize
    • Batch-Normalization
  • Activation
    • rectified linear unit (RelU)
  • Auxiliary
    • Crop
    • Concantenation
    • Simpler NMS
    • Prior box
    • Detection output
    • Reorder
  • Softmax

With this primitive set, user can build and execute most common image recognition, semantic segmentation and object detection networks topologies like:

  • AlexNet*
  • GoogleNet*
  • ResNet*
  • VGG16-FACE*
  • Faster-RCNN*
  • FCN*
Viewing all 3384 articles
Browse latest View live