Quantcast
Channel: Intel Developer Zone Articles
Viewing all 3384 articles
Browse latest View live

Introduction to Smart Video Technologies from Intel

$
0
0

This paper addresses how the Smart Video (SV) system architecture is increasing in complexity and evolving into new industries and use cases. It articulates the advantage of using Intel® software tools such as Intel® Media SDK, Intel® Computer Vision SDK, software libraries, Intel® AMT,  Intel® vPro technology and Intel® hardware to enable these smart video systems. The main areas this article covers are Smart cameras that hardware accelerate encoding, transcoding and decoding, smart video gateway’s hardware and software, video analytics, deep learning and integration with the Internet of Things.  

Introduction to Intel® Smart Video Technologies

The next generation of Smart Video Systems have complex and integrated computer architectures that are rapidly evolving. Technologies from computer fields such as HD graphics, real-time video encoding and transcoding, scalable video streaming and storage, artificial intelligence and the Internet of Things are expanding the abilities of modern Smart Video Systems.

Smart Video Systems are also expanding into new market segments such as autonomous vehicles, smart lighting, industrial machine vision and drones. They are increasingly used in city surveillance, retail interactions, banking and finance surveillance, traffic monitoring, railway supervision, education management, airport security and many other market applications (some are industry markets i.e. retail some are application i.e. airports). For example, an automotive dealership may use a smart video system in order to ensure the security of property on their auto lot. A factory may use a smart video system to monitor equipment and determine if there are any malfunctions that need to be serviced. A hospital or a government building may use a V system to monitor the people that are entering restricted areas. And retail stores often use them to monitor customer transactions and provide reporting to police and authorities in the case of a robbery.

Each of these scenarios not only changes how the SV system will be architected and installed, but also they affect the policies and procedures enforced during the use of the system. 

Intel focuses on enabling technologies that are flexible, horizontal and scalable so that businesses are able to implement their own regulations and procedures when deciding how to use the recorded video. 

Overview of Intel® Smart Video Technologies

Current video systems use camera sources to create video and stream it to local or cloud-based storage where it is analyzed and any necessary actions are taken. The scalability of the system is limited by the amount of available bandwidth and storage, as well as, the ability to rapidly analyze and determine if alerts or actions need to be taken.

Intel® technologies address customer challenges by focusing on each area of the technology stack and providing solutions that can be mixed and integrated to build next-generation systems.

Let’s focus on four areas where Smart Video System technology is rapidly evolving: 

  1. Smart cameras that hardware accelerate encoding, transcoding and decoding  
  2. Smart Video gateway and software
  3. Video analytics and Deep Learning
  4. Integration with the Internet of Things

Smart Cameras

Smart cameras powered by Intel technology will continue to increase resolutions to HD and 4K/UHD. Intel® processors based on next generation Intel® architecture, such as SkyLake, provide hardware acceleration for media operations in HEVC, H.264 to 4K30, VP9, and JPEG based on Intel® Quick Sync Video technology (GPU). Higher resolutions are achieved with lower power and bandwidth requirements.

The Intel® Media SDK delivers a higher level set of libraries, tools, and sample code that define cross-platform APIs. This SDK provides a unified abstraction layer for media decode and encode. That is capable of accessing hardware acceleration and includes an OpenCL interface to allow custom media processing capabilities.

Higher resolutions and video pipelines enable the creation of multiple video streams that lower bandwidth requirements, enabling more cameras to be deployed in a smart video system ultimately producing  an increase in the number of simultaneous video feeds. 

Smart Video Gateways

Smart Video Gateways serve as aggregation points for analytics services, network management, security management, and integration with automation services and the Internet of Things.

By moving high-end computation closer to the video sources, Intel based gateways will be able to analyze the video onsite without moving video over the Internet or to centralized cloud based data centers. This reduces Smart Video System bandwidth requirements and increases scalability from tens of video streams to hundreds or thousands of simultaneous video streams.

Intel offers smart video systems based on a range of processors so the computational needs of your Smart Video System can grow with your business needs. Systems based on the Intel Atom® platform, the Intel® Core™ processor family and the Intel® Xeon® platform span the range of video computational needs.    

Here is a profile of a potential mid-level and an advanced gateway.

Entry-level NVR, DVR, Transcoder (8/16 HDD, 16-20ch decode)

Intel® Celeron® processor family +24ch 1080p decoding
APL: 4k H.265, 16-20ch 1080p decode capability and VA
Intel Advantages:

  • Intel® graphics with Intel® Quick Sync Video
  • Headroom for Apps, Video Analytics, Software RAID 
  • Ability to Scale
  • IOT Portfolio: Silicon, Software, Security, Manageability, Ecosystem

Performance
Intelligent Video Server, Transcoder 

(Data Center / High-Performance NVR )

(>16 HDD, >20ch decode [more encode], and Deep Learning)

Intel® Xeon® processor E3-1225v5: Hi End NVR/Transcoder/VA Server
Intel® Core™ i3/5/7 family-based NVR with VA

Intel Advantages 

  • Intel® Quick Sync Video – HW Accelerated De/Trans-code w/ Codecs
  • Heterogenous Compute Resources – CPU/GPU/FPGA
  • Analytics Performance from Processor and Memory (Intel® SSD)
  • Big Data SW Ecosystem, Open Standards, Libraries & Frameworks
  • Server-class Reliability, Optimized on Various Workloads 

Smart Video Gateway Software

Intel provides a large number of libraries that system integrators and vendors can use to create smart video services. The algorithms span a range of applications from providing statistical information about the video to allowing higher level descriptions and actionable information.  

Hardware Accelerated Analytics

These smart video libraries make it possible to build custom analytics algorithms that are hardware optimized for image processing, cryptography, math processing & neural networks routines, data analytics and high availability threading:

•    Intel® Integrated Performance Primitives (IPP)
•    Intel® Math Kernel Library (MKL)
•    Intel® Data Analytics Acceleration Library (DAAL)
•    Intel® Threading Building Blocks (TBB)

Classification, Analytics and Computer Vision

Higher level analytics can be customize for different situations. For example, if several smart cameras are monitoring an area that only allows restricted personal, artificial intelligence and deep learning can be used to train the system to identify authorized people and sound the alarm for unauthorized people. 

Intel provides the Intel® Computer Vision SDK based OpenVX*, OpenCL software technology, and OpenCV.  Intel® Computer Vision SDK is a specialized set of libraries for enabling computer vision processing. 

Reducing Latency

Computational power near the video source is also important in reducing latency in timing sensitive application. In autonomous vehicle applications, pedestrian recognition needs to happen immediately and avoid the latencies associated with the cloud. 

Enabling analysis of video nearer the smart camera reduces network transit for high-bandwidth video streams. Low bandwidth alerts and automatic actions can be triggered.

Higher Level Software Libraries

Intel® Distribution for Caffe* is a set of libraries from machine learning to enable deep learning applications that can be deployed on a Smart Video System.  By using advanced smart cameras and video gateways, multiple video streams can be encoded in an efficient manner. 

Deep learning techniques are used in tasks such as object recognition, object classification, object or person tracking, intrusion detection, scene analysis, counting people, facial detection, facial recognition, recognition of age gender or behavior, license plate recognition, vehicle detection, 3D depth analysis, navigation and many other useful applications of video analysis.

Smart Video Manageability 

Once you have the correct hardware and software layers you also need a manageability layer. Intel provides Intel® AMT technology, enabled by Intel® vPro™ Technology, which will allow you to remotely administer your entire smart video system. This includes upgrading firmware, managing deployment and decommissioning lifecycle steps, and monitoring in real-time the status of each of the components in your DDS system. 

Intel silicon and software is enabling next-generation architecture of smart video systems and security and surveillance systems.


Using Intel® MPI Library on Intel® Xeon Phi™ Product Family

$
0
0

Introduction

The Message Passing Interface (MPI) standard is a message-passing library, a collection of routines used in distributed-memory parallel programing. This document is designed to help users get started writing code and running MPI applications using the Intel® MPI Library on a development platform that includes the Intel® Xeon Phi™ processor or coprocessor. The Intel MPI Library is a multi-fabric message passing library that implements the MPI-3.1 specification (see Table 1).

In this document, the Intel MPI Library 2017 and 2018 Beta for Linux* OS are used.

Table 1. Intel® MPI Library at a glance

Processors

Intel® processors, coprocessors, and compatibles

Languages

Natively supports C,C++, and Fortran development

Development Environments

Microsoft Visual Studio* (Windows*), Eclipse*/CDT* (Linux*)

Operating Systems

Linux and Windows

Interconnect Fabric Support

Shared memory
RDMA-capable network fabrics through DAPL* (for example, InfiniBand*, Myrinet*)
Intel® Omni-Path Architecture
Sockets (for example, TCP/IP over Ethernet, Gigabit Ethernet*) and others.

This document summarizes the steps to build and run an MPI application on an Intel® Xeon Phi™ processor x200, on an Intel® Xeon Phi™ coprocessor x200 and Intel® Xeon Phi™ coprocessor x100 natively or symmetrically. First, we introduce the Intel Xeon Phi processor x200 product family and Intel Xeon Phi processor x100 product family and the MPI programing models.

Intel® Xeon Phi™ Processor Architecture

Intel Xeon Phi processor x200 product family architecture: There are two versions of this product. The processor version is the host processor and the coprocessor version requires an Intel® Xeon® processor host. Both versions share the architecture below (see Figure 1):

  • Intel® Advanced Vector Extensions 512 (Intel® AVX-512)
  • Up to 72 cores with 2D mesh architecture
  • Each core has two 512-bit vector processing units (VPUs) and four hardware threads
  • Each pair of cores (tile) shares 1 MB L2 cache
  • 8 or 16 GB high-bandwidth on package memory (MCDRAM)
  • 6 channels DDR4, up to 384 GB (available in the processor version only)
  • For the coprocessor, the third-generation PCIe* is connected to the host

 Colorful depiction of the Intel® Xeon Phi™ processor x200 architecture

Figure 1. Intel® Xeon Phi™ processor x200 architecture.

To enable the functionalities of the Intel Xeon Phi processor x200, you need to download and install the Intel Xeon Phi processor software available here.

The Intel Xeon Phi coprocessor x200 attaches to an Intel Xeon processor-based host via a third-generation PCIe interface. The coprocessor runs on a standard Linux OS. It can be used as an extension to the host (so the host can offload the workload) or as an independent compute node. The first step to bring an Intel Xeon Phi coprocessor x200 into service is to install the Intel® Manycore Platform Software Stack (Intel® MPSS) 4.x on the host, which is available here. The Intel MPSS is a collection of software including device drivers, coprocessor management utilities, and the Linux OS for the coprocessor.

Intel Xeon Phi coprocessor x100 architecture: the Intel Xeon Phi coprocessor x100 is the first-generation of the Intel Xeon Phi product family. The coprocessor attaches to an Intel Xeon processor-based host via a second-generation PCIe interface. It runs on an OS separate from the host and has the following architecture (see Figure 2):

  • Intel® Initial Many Core Instructions
  • Up to 61 cores with high-bandwidth, bidirectional ring interconnect architecture
  • Each core has a 512-bit wide VPU and four hardware threads
  • Each core has a private 512-KB L2 cache
  • 16 GB GDDR5 memory
  • The second-generation PCIe is connected to the host

 Colorful depiction of the Intel® Xeon Phi™ processor x100 architecture

Figure 2. Intel® Xeon Phi™ processor x100 architecture.

To bring the Intel Xeon Phi coprocessor x100 into service, you must install the Intel MPSS 3.x on the host, which can be downloaded here.

MPI Programming Models

The Intel MPI Library supports the following MPI programming models (see Figure 3):

  • Host-only model (Intel Xeon processor or Intel Xeon Phi processor): In this mode, all MPI ranks reside and execute the workload on the host CPU only (or Intel Xeon Phi processor only).
  • Offload model: In this mode, the MPI ranks reside solely on the Intel Xeon processor host. The MPI ranks use offload capabilities of the Intel® C/C++ Compiler or Intel® Fortran Compiler to offload some workloads to the coprocessors. Typically, one MPI rank is used per host, and the MPI rank offloads to the coprocessor(s).
  • Coprocessor-only model: In this native mode, the MPI ranks reside solely inside the coprocessor. The application can be launched from the coprocessor.
  • Symmetric model: In this mode, the MPI ranks reside on the host and the coprocessors. The application can be launched from the host.

 MPI programing models

Figure 3. MPI programing models.

Using the Intel® MPI Library

This section shows how to build and run an MPI application in the following configurations: on an Intel Xeon Phi processor x200, on a system with one or more Intel Xeon Phi coprocessor x200, and on a system with one or more Intel Xeon Phi coprocessor x100 (see Figure 4).

 Black and white, different configurations of the Intel® MPI Library

Figure 4. Different configurations: (a) standalone Intel® Xeon Phi™ processor x200, (b) Intel Xeon Phi coprocessor x200 connected to a system with an Intel® Xeon® processor, and (c) Intel® Xeon Phi™ coprocessor x100 connected to a system with an Intel Xeon processor.

Installing the Intel® MPI Library

The Intel MPI Library is packaged as a standalone product or as a part of the Intel® Parallel Studio XE Cluster Edition.

By default, the Intel MPI Library will be installed in the path /opt/intel/impi on the host or the Intel Xeon Phi processor. To start, follow the appropriate directions to install the latest versions of the Intel C/C++ Compiler and the Intel Fortran Compiler.

You can purchase or try the free 30-day evaluation of the Intel Parallel Studio XE from https://software.intel.com/en-us/intel-parallel-studio-xe. These instructions assume that you have the Intel MPI Library tar file - l_mpi_<version>.<package_num>.tgz. This is the latest stable release of the library at the time of writing this article. To check if a newer version exists, log into the Intel® Registration Center. The instructions below are valid for all current and subsequent releases.

As root user, untar the tar file l_mpi_<version>.<package_num>.tgz:

# tar –xzvf l_mpi_<version>.<package_num>.tgz
# cd l_mpi_<version>.<package_num>

Execute the install script on the host and follow the instructions. The installation will be placed in the default installation directory /opt/intel/impi/<version>.<package_num> assuming you are installing the library with root permission.

# ./install.sh

Compiling an MPI program

To compile an MPI program on the host or on an Intel Xeon Phi processor x200:

Before compiling a MPI program you need to establish the proper environment settings for the compiler and for the Intel MPI Library

$ source /opt/intel/compilers_and_libraries_<version>/linux/bin/compilervars.sh intel64
$ source /opt/intel/impi/<version>.<package_num>/bin64/mpivars.sh

or if you installed the Intel® Parallel Studio XE Cluster Edition, you can simply source the configuration script:

$ source /opt/intel/parallel_studio_xe_<version>/psxevars.sh intel64

Compile and link your MPI program using an appropriate compiler command:

To compile and link with the Intel MPI Library, use the appropriate commands from Table 2.

Table 2. MPI compilation Linux* command.

Programming LanguageMPI Compilation Linux* Command
Cmpiicc
C++mpiicpc
Fortran 77 / 95mpiifort

For example, to compile the C program for the host, you can use the wrapper mpiicc:

$ mpiicc ./myprogram.c –o myprogram

To compile the program for Intel Xeon Phi processor x200 and Intel Xeon Phi coprocessor x200, add the knob–xMIC-AVX512 to take advantage of the Intel AVX-512 instruction set architecture (ISA) existing on this architecture. For example, the following command compiles a C program for the Intel Xeon Phi product family x200 using the Intel AVX-512 ISA:

$ mpiicc –xMIC-AVX512 ./myprogram.c –o myprogram.knl

To compile the program for the Intel Xeon Phi coprocessor x100, add the knob–mmic. The following command show how to compile a C program for Intel Xeon Phi coprocessor x100:

$ mpiicc –mmic ./myprogram.c –o myprogram.knc

Running an MPI program on the Intel Xeon Phi processor x200

To run the application on the Intel Xeon Phi processor x200, use the script mpirun:

$ mpirun –n <# of processes> ./myprogram.knl

where n is the number of MPI processes to launch on the processor.

Running an MPI program on the Intel Xeon Phi coprocessor x200 and Intel Xeon Phi coprocessor x100

To run an application on the coprocessors, the following steps are needed:

  • Start the MPSS service if it was stopped previously:

    $ sudo systemctl start mpss

  • Transfer the MPI executable from the host to the coprocessor. For example, use the scp utility to transfer the executable (for the Intel Xeon Phi coprocessor x100) to the coprocessor named mic0:

    $ scp myprogram.knl mic0:~/myprogram.knc

  • Transfer the MPI libraries and compiler libraries to the coprocessors: before the first run of an MPI application on the Intel Xeon Phi coprocessors, we need to copy the appropriate MPI libraries, compiler libraries to the following directories on each coprocessor equipped on this system: for coprocessor x200, libraries under /lib64 directory are transferred; for coprocessor x100, libraries under /mic directory are transferred.

For example, we issue the copy to the first coprocessor x100 called mic0: the mic0 coprocessor is accessible via the IP address 172.31.1.1 as its IP address. Note that all coprocessors have unique IP addresses since they are treated as just other uniquely addressable machines. You can refer to the first coprocessor as mic0 or its IP address.

# sudo scp /opt/intel/impi/2017.3.196/mic/bin/* mic0:/bin/
# sudo scp /opt/intel/impi/2017.3.196/mic/lib/* mic0:/lib64/
# sudo scp /opt/intel/composer_xe_2017.3.196/compiler/lib/mic/* mic0:/lib64/

Instead of copying the MPI and compiler libraries manually, you can also run the script shown below, to transfer to the two coprocessor mic0 and mic1:

#!/bin/sh

export COPROCESSORS="mic0 mic1"
export BINDIR="/opt/intel/impi/2017.3.196/mic/bin"
export LIBDIR="/opt/intel/impi/2017.3.196/mic/lib"
export COMPILERLIB="/opt/intel/compilers_and_libraries_2017/linux/lib/mic"

for coprocessor in `echo $COPROCESSORS`
do
   for prog in mpiexec mpiexec.hydra pmi_proxy mpirun
   do
      sudo scp $BINDIR/$prog $coprocessor:/bin/$prog
   done

   for lib in libmpi.so.12 libmpifort.so.12 libmpicxx.so.12
   do
      sudo scp $LIBDIR/$lib $coprocessor:/lib64/$lib
   done

   for lib in libimf.so libsvml.so libintlc.so.5
   do
      sudo scp $COMPILERLIB/$lib $coprocessor:/lib64/$lib
   done
done

Script used for transferring MPI libraries to two coprocessors.

Another approach is to NFS mount the coprocessors’ file system from the host so that the coprocessors can have access to their MPI libraries from there. One advantage of using NFS mounts is that it saves RAM space on the coprocessors. The details on how to set up NFS mounts can be found in the first example in this document.

To run the application natively on the coprocessor, log in to the coprocessor and then run thempirun script:

$ ssh mic0
$ mpirun –n <# of processes> ./myprogram.knc

where n is the number of MPI processes to launch on the coprocessor.

Finally, to run an MPI program from the host (symmetrically), additional steps are needed:

Set the Intel MPI environment variable I_MPI_MIC to let the Intel MPI Library recognize the coprocessors:

$ export I_MPI_MIC=enable

Disable the firewall in the host:

$ systemctl status firewalld
$ sudo systemctl stop firewalld

For multi-card use, configure Intel MPSS peer-to-peer so that each card can ping others:

$ sudo /sbin/sysctl -w net.ipv4.ip_forward=1

If you want to get debug information, include the flags -verbose and -genv I_MPI_DEBUG=n when running the application.

The following sections include sample MPI programs written in C. The first example shows how to compile and run a program for Intel Xeon Phi processor x200 and for Intel Xeon Phi coprocessor x200. The second example shows how to compile and run a program for Intel Xeon Phi coprocessor x100.

Example 1

For illustration purposes, this example shows how to build and run an Intel MPI application in symmetric mode on a host that connects to two Intel Xeon Phi coprocessors x200. Note that the driver Intel MPSS 4.x should be installed on the host to enable the Intel Xeon Phi coprocessor x200.

In this example, use the integral presentation below to calculate Pi (π):

Image of a mathematical equation

Appendix A includes the implementation program. The workload is divided among the MPI ranks. Each rank spawns a team of OpenMP* threads, and each thread works on a chunk of the workload to take advantage of vectorization. First, compile and run this application on the Intel Xeon processor host. Since this program uses OpenMP, you need to compile the program with OpenMP libraries. Note that the Intel Parallel Studio XE 2018 is used in this example.

Set the environment variables, compile the application for the host, and then generate the optimization report on vectorization and OpenMP:

$ source /opt/intel/compilers_and_libraries_2018/linux/bin/compilervars.sh intel64
$ mpiicc mpitest.c -qopenmp -O3 -qopt-report=5 -qopt-report-phase:vec,openmp -o mpitest

To run two ranks on the host:

$ mpirun -host localhost -n 2 ./mpitest
Hello world: rank 0 of 2 running on knl-lb0.jf.intel.com
Hello world: rank 1 of 2 running on knl-lb0.jf.intel.com
FROM RANK 1 - numthreads = 32
FROM RANK 0 - numthreads = 32

Elapsed time from rank 0:    8246.90 (usec)
Elapsed time from rank 1:    8423.09 (usec)
rank 0 pi=   3.141613006592

Next, compile the application for the Intel Xeon Phi coprocessor x200 and transfer the executable to the coprocessors mic0 and mic1 (assume you already set passwordless on the coprocessors).

$ mpiicc mpitest.c -qopenmp -O3 -qopt-report=5 -qopt-report-phase:vec,openmp -xMIC-AVX512 -o mpitest.knl
$ scp mpitest.knl mic0:~/.
$ scp mpitest.knl mic1:~/.

Enable MPI for the coprocessors and disable the firewall in the host:

$ export I_MPI_MIC=enable
$ sudo systemctl stop firewalld

This example also shows how to mount shared directory using the Network File System (NFS). As root, you mount the /opt/intel directory where the Intel C++ Compiler and Intel MPI are installed. First, add descriptors in the /etc/exports configuration file on the host to share the directory /opt/intelwith the coprocessors, whose IP addresses are 172.31.1.1 and 172.31.2.1 with read-only (ro) privilege.

[host~]# cat /etc/exports
/opt/intel 172.31.1.1(ro,async,no_root_squash)
/opt/intel 172.31.2.1(ro,async,no_root_squash)

Update the NFS export table and restart the NFS server in the host:

[host~]# exportfs –a
[host~]# service nfs restart

Next, log in on the coprocessors and create the mount point /opt/intel:

[host~]# ssh mic0
mic0:~# mkdir /opt
mic0:~# mkdir /opt/intel

 

Insert the descriptor “172.31.2.254:/opt/intel /opt/intel nfs defaults 1” to the /etc/fstab file in mic0:

mic0:~# cat /etc/fstab
/dev/root            /                    auto       defaults              1  1
proc                 /proc                proc       defaults              0  0
devpts               /dev/pts             devpts     mode=0620,gid=5       0  0
tmpfs                /run                 tmpfs      mode=0755,nodev,nosuid,strictatime 0  0
tmpfs                /var/volatile        tmpfs      defaults,size=85%     0  0
172.31.1.254:/opt/intel /opt/intel nfs defaults                            1  1

Finally, mount the shared directory /opt/intel on the coprocessor:

mic0:~# mount –a

Repeat this procedure for mic1 with this descriptor “172.31.2.254:/opt/intel /opt/intel nfs defaults 1 1” added to the /etc/fstab file in mic1.

Make sure that mic0 and mic1 are included in the /etc/hosts file:

$ cat /etc/hosts
127.0.0.1       localhost
::1             localhost
172.31.1.1      mic0
172.31.2.1      mic1

$ mpirun -host localhost -n 1 ./mpitest : -host mic0 -n 1 ~/mpitest.knl : -host mic1 -n 1 ~/mpitest.knl
Hello world: rank 0 of 3 running on knl-lb0
Hello world: rank 1 of 3 running on mic0
Hello world: rank 2 of 3 running on mic1
FROM RANK 0 - numthreads = 64
FROM RANK 2 - numthreads = 272
FROM RANK 1 - numthreads = 272
Elapsed time from rank 0:   12114.05 (usec)
Elapsed time from rank 1:  136089.09 (usec)
Elapsed time from rank 2:  125049.11 (usec)
rank 0 pi=   3.141597270966

By default, the maximum number of hardware threads available on each compute node is used. However, you can change this default behavior by inserting the local environment variable –env in that compute node. For example, to set the number of OpenMP threads on mic0 to 68 and set the compact affinity, you can use the command:

$ mpirun -host localhost -n 1 ./mpitest : -host mic0 -n 1 -env OMP_NUM_THREADS=68 -env KMP_AFFINITY=compact ~/mpitest : -host mic1 -n 1 ~/mpitest
Hello world: rank 0 of 3 running on knl-lb0.jf.intel.com
Hello world: rank 1 of 3 running on mic0
Hello world: rank 2 of 3 running on mic1
FROM RANK 0 - numthreads = 64
FROM RANK 1 - numthreads = 68
FROM RANK 2 - numthreads = 272
Elapsed time from rank 0:   11068.11 (usec)
Elapsed time from rank 1:   57780.98 (usec)
Elapsed time from rank 2:  133417.13 (usec)
rank 0 pi=   3.141597270966

To simplify the launch process, define a file with all machine names, name all the executables, and then move them to a predefined directory. For example, all executables are named mpitest and are located in user home directories:

$ cat hosts_file
knl-lb0:1
mic0:2
mic1:2

$ mpirun -machinefile hosts_file -n 5 ~/mpitest
Hello world: rank 0 of 5 running on knl-lb0
Hello world: rank 1 of 5 running on mic0
Hello world: rank 2 of 5 running on mic0
Hello world: rank 3 of 5 running on mic1
Hello world: rank 4 of 5 running on mic1
FROM RANK 0 - numthreads = 64
FROM RANK 1 - numthreads = 136
FROM RANK 3 - numthreads = 136
FROM RANK 2 - numthreads = 136
FROM RANK 4 - numthreads = 136
Elapsed time from rank 0:   11260.03 (usec)
Elapsed time from rank 1:   71480.04 (usec)
Elapsed time from rank 2:   69352.15 (usec)
Elapsed time from rank 3:   74187.99 (usec)
Elapsed time from rank 4:   67718.98 (usec)
rank 0 pi=   3.141598224640

 

Example 2

Example 2 shows how to build and run an MPI application in symmetric model on a host that connects to two Intel Xeon Phi coprocessors x100. Note that the driver Intel MPSS 3.x should be installed for the Intel Xeon Phi coprocessor x100.

The sample program estimates the calculation of Pi (π) using a Monte Carlo method. Consider a sphere centered at the origin and circumscribed by a cube. The sphere’s radius is r and the cube edge length is 2r. The volumes of a sphere and a cube are given by

Image of a mathematical equation

The first octant of the coordinate system contains one eighth of the volumes of both the sphere and the cube; the volumes in that octant are given by:

Image of a mathematical equation

If we generate Nc points uniformly and randomly in the cube within this octant, we expect that about Ns points will be inside the sphere’s volume according to the following ratio:

Image of a mathematical equation

Therefore, the estimated Pi (π) is calculated by

Image of a mathematical equation

where Nc is the number of points generated in the portion of the cube residing in the first octant, and Ns is the total number of points found inside the portion of the sphere residing in the first octant.

In the implementation, rank 0 (process) is responsible for dividing the work among the other n ranks. Each rank is assigned a chunk of work, and the summation is used to estimate the number Pi. Rank 0 divides the x-axis into n equal segments. Each rank generates (Nc /n) points in the assigned segment, and then computes the number of points in the first octant of the sphere (see Figure 5).

Image of a mathematical results

Figure 5. Each MPI rank handles a different portion in the first octant.

The pseudo code is shown below:

Rank 0 generate n random seed
Rank 0 broadcast all random seeds to n rank
For each rank i [0, n-1]
receive the corresponding seed
set num_inside = 0
For j=0 to Nc / n
generate a point with coordinates
x between [i/n, (i+1)/n]
y between [0, 1]
z between [0, 1]
			compute the distance d = x^2 + y^2 + z^2
			if distance d <= 1, increment num_inside
		Send num_inside back to rank 0
	Rank 0 set Ns  to the sum of all num_inside
	Rank 0 compute Pi = 6 * Ns  / Nc

In order to build the application montecarlo.knc for the Intel Xeon Phi coprocessors x100, the Intel C++ Compiler 2017 is used. Appendix B includes the implementation program. Note that this example just simply shows how to run the code on an Intel Xeon Phi coprocessor x100. You can optimize the sample code for further improvement.

$ source /opt/intel/compilers_and_libraries_2017/linux/bin/compilervars.sh intel64
$ mpiicc –mmic montecarlo.c -o montecarlo.knc

Build the application for the host:

$ mpiicc montecarlo.c -o montecarlo

Transfer the application montecarlo.knc to the /tmp directory on the coprocessors using the scp utility. In this example, we issue the copy to two Intel Xeon Phi coprocessors x100.

$ scp ./montecarlo.knc mic0:/tmp/montecarlo.knc
montecarlo.knc     100% 17KB 16.9KB/s 00:00 $ scp ./montecarlo.knc mic1:/tmp/montecarlo.knc
montecarlo.knc     100% 17KB 16.9KB/s 00:00 

Transfer the MPI libraries and compiler libraries to the coprocessors using the script in Figure 5. Enable the MPI communication between host and Intel Xeon Phi coprocessors x100:

$ export I_MPI_MIC=enable

Run the mpirun script to start the application. The flag –n specifies the number of MPI processes and the flag –host specifies the machine name:

$ mpirun –n <# of processes> -host <hostname> <application>

We can run the application on multiple hosts by separating them with “:”. The first MPI rank (rank 0) always starts on the first part of the command:

$ mpirun –n <# of processes> -host <hostname1> <application> : –n <# of processes> -host <hostname2> <application>

This starts the rank 0 on hostname1 and other ranks on hostname2.

Now run the application on the host. The mpirun command shown below starts the application with 2 ranks on the host, 3 ranks on the coprocessor mic0, and 5 ranks on coprocessor mic1:

$ mpirun -n 2 -host localhost ./montecarlo : -n 3 -host mic0 /tmp/montecarlo.knc \
: -n 5 -host mic1 /tmp/montecarlo.knc

Hello world: rank 0 of 10 running on knc0
Hello world: rank 1 of 10 running on knc0
Hello world: rank 2 of 10 running on knc0-mic0
Hello world: rank 3 of 10 running on knc0-mic0
Hello world: rank 4 of 10 running on knc0-mic0
Hello world: rank 5 of 10 running on knc0-mic1
Hello world: rank 6 of 10 running on knc0-mic1
Hello world: rank 7 of 10 running on knc0-mic1
Hello world: rank 8 of 10 running on knc0-mic1
Hello world: rank 9 of 10 running on knc0-mic1
Elapsed time from rank 0:      13.87 (sec)
Elapsed time from rank 1:      14.01 (sec)
Elapsed time from rank 2:     195.16 (sec)
Elapsed time from rank 3:     195.17 (sec)
Elapsed time from rank 4:     195.39 (sec)
Elapsed time from rank 5:     195.07 (sec)
Elapsed time from rank 6:     194.98 (sec)
Elapsed time from rank 7:     223.32 (sec)
Elapsed time from rank 8:     194.22 (sec)
Elapsed time from rank 9:     193.70 (sec)
Out of 4294967295 points, there are 2248849344 points inside the sphere => pi=  3.141606330872

A shorthand way of doing this in symmetric mode is to use the –machinefile option for the mpirun command in coordination with the I_MPI_MIC_POSTFIX environment variable. In this case, make sure all executables are in the same location on the host and mic0 and mic1 cards.

The I_MPI_MIC_POSTFIX environment variable simply tells the library to add the .mic postfix when running on the cards (since the executables there are called montecarlo.knc).

$ export I_MPI_MIC_POSTFIX=.knc

Now set the rank mapping in your hosts file (by using the <host>:<#_ranks> format):

$ cat hosts_file
localhost:2
mic0:3
mic1:5

And run your executable:

$ mpirun -machinefile hosts_file /tmp/montecarlo

The nice thing about this syntax is that you only have to edit the hosts_file when deciding to change your number of ranks or need to add more cards.

As an alternative, you can ssh to a coprocessor and launch the application from there:

S ssh mic0
S mpirun -n 3 /tmp/montecarlo.knc
Hello world: rank 0 of 3 running on knc0-mic0
Hello world: rank 1 of 3 running on knc0-mic0
Hello world: rank 2 of 3 running on knc0-mic0
Elapsed time from rank 0:     650.47 (sec)
Elapsed time from rank 1:     650.61 (sec)
Elapsed time from rank 2:     648.01 (sec)
Out of 4294967295 points, there are 2248795855 points inside the sphere => pi=  3.141531467438

 

Summary

This document showed you how to compile and run simple MPI applications in symmetric model. In a heterogeneous computing system, the performance in each computational unit is different and this system behavior leads to the load imbalance problem. The Intel® Trace Analyzer and Collector can be used to analyze and understand the behavior of a complex MPI program running on a heterogeneous system. Using the Intel Trace Analyzer and Collector, you can quickly identify bottlenecks, evaluate load balancing, analyze performance, and identify communication hotspots. This powerful tool is essential for debugging and improving the performance of a MPI program running on a cluster with multiple computational units. For more details on using the Intel Trace Analyzer and Collector, read the whitepaper “Understanding MPI Load Imbalance with Intel® Trace Analyzer and Collector” available on /mic-developer. For more details, tips and tricks, and known workarounds, visit our Intel® Cluster Tools and the Intel® Xeon Phi™ Coprocessors page.

References

Appendix A

The code of the first sample program is shown below.

/*
 *  Copyright (c) 2017 Intel Corporation. All Rights Reserved.
 *
 *  Portions of the source code contained or described herein and all documents related
 *  to portions of the source code ("Material") are owned by Intel Corporation or its
 *  suppliers or licensors.  Title to the Material remains with Intel
 *  Corporation or its suppliers and licensors.  The Material contains trade
 *  secrets and proprietary and confidential information of Intel or its
 *  suppliers and licensors.  The Material is protected by worldwide copyright
 *  and trade secret laws and treaty provisions.  No part of the Material may
 *  be used, copied, reproduced, modified, published, uploaded, posted,
 *  transmitted, distributed, or disclosed in any way without Intel's prior
 *  express written permission.
 *
 *  No license under any patent, copyright, trade secret or other intellectual
 *  property right is granted to or conferred upon you by disclosure or
 *  delivery of the Materials, either expressly, by implication, inducement,
 *  estoppel or otherwise. Any license under such intellectual property rights
 *  must be express and approved by Intel in writing.
 */
//******************************************************************************
// Content: (version 1.0)
//      Calculate the number PI using its integral representation.
//
//******************************************************************************
#include <stdio.h>
#include "mpi.h"

#define MASTER 0
#define TAG_HELLO 1
#define TAG_TIME 2

const long ITER = 1024 * 1024;
const long SCALE = 16;
const long NUM_STEP = ITER * SCALE;

float calculate_partialPI(int n, int num) {
   unsigned long i;
   int  numthreads;
   float x, dx, pi = 0.0f;

   #pragma omp parallel
   #pragma omp master
   {
      numthreads = omp_get_num_threads();
      printf("FROM RANK %d - numthreads = %d\n", n, numthreads);
   }

   dx = 1.0 / NUM_STEP;

   unsigned long NUM_STEP1 = NUM_STEP / num;
   unsigned long begin = n * NUM_STEP1;
   unsigned long end = (n + 1) * NUM_STEP1;
   #pragma omp parallel for reduction(+:pi)
   for (i = begin; i < end; i++)
   {
      x = (i + 0.5f) / NUM_STEP;
      pi += (4.0f * dx) / (1.0f + x*x);
   }

   return pi;
}

int main(int argc, char **argv)
{
   float pi1, total_pi;
   double startprocess;
   int i, id, remote_id, num_procs, namelen;
   char name[MPI_MAX_PROCESSOR_NAME];
   MPI_Status stat;

   if (MPI_Init (&argc, &argv) != MPI_SUCCESS)
   {
      printf ("Failed to initialize MPI\n");
      return (-1);
   }

   // Create the communicator, and retrieve the number of processes.
   MPI_Comm_size (MPI_COMM_WORLD, &num_procs);

   // Determine the rank of the process.
   MPI_Comm_rank (MPI_COMM_WORLD, &id);

   // Get machine name
   MPI_Get_processor_name (name, &namelen);

   if (id == MASTER)
   {
      printf ("Hello world: rank %d of %d running on %s\n", id, num_procs, name);

      for (i = 1; i<num_procs; i++)
      {
         MPI_Recv (&remote_id, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
         MPI_Recv (&num_procs, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
         MPI_Recv (&namelen, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
         MPI_Recv (name, namelen+1, MPI_CHAR, i, TAG_HELLO, MPI_COMM_WORLD, &stat);

         printf ("Hello world: rank %d of %d running on %s\n", remote_id, num_procs, name);
      }
   }
   else
   {
      MPI_Send (&id, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&num_procs, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&namelen, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (name, namelen+1, MPI_CHAR, MASTER, TAG_HELLO, MPI_COMM_WORLD);
   }

   startprocess = MPI_Wtime();

   pi1 = calculate_partialPI(id, num_procs);

   double elapsed = MPI_Wtime() - startprocess;

   MPI_Reduce (&pi1, &total_pi, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD);
   if (id == MASTER)
   {
      double timeprocess[num_procs];

      timeprocess[MASTER] = elapsed;
      printf("Elapsed time from rank %d: %10.2f (usec)\n", MASTER, 1000000 * timeprocess[MASTER]);

      for (i = 1; i < num_procs; i++)
      {
         // Rank 0 waits for elapsed time value
         MPI_Recv (&timeprocess[i], 1, MPI_DOUBLE, i, TAG_TIME, MPI_COMM_WORLD, &stat);
         printf("Elapsed time from rank %d: %10.2f (usec)\n", i, 1000000 *timeprocess[i]);
      }

      printf("rank %d pi= %16.12f\n", id, total_pi);
   }
   else
   {
      // Send back the processing time (in second)
      MPI_Send (&elapsed, 1, MPI_DOUBLE, MASTER, TAG_TIME, MPI_COMM_WORLD);
   }

   // Terminate MPI.
   MPI_Finalize();
   return 0;
}

 

Appendix B

The code of the second sample program is shown below.

/*
 *  Copyright (c) 2017 Intel Corporation. All Rights Reserved.
 *
 *  Portions of the source code contained or described herein and all documents related
 *  to portions of the source code ("Material") are owned by Intel Corporation or its
 *  suppliers or licensors.  Title to the Material remains with Intel
 *  Corporation or its suppliers and licensors.  The Material contains trade
 *  secrets and proprietary and confidential information of Intel or its
 *  suppliers and licensors.  The Material is protected by worldwide copyright
 *  and trade secret laws and treaty provisions.  No part of the Material may
 *  be used, copied, reproduced, modified, published, uploaded, posted,
 *  transmitted, distributed, or disclosed in any way without Intel's prior
 *  express written permission.
 *
 *  No license under any patent, copyright, trade secret or other intellectual
 *  property right is granted to or conferred upon you by disclosure or
 *  delivery of the Materials, either expressly, by implication, inducement,
 *  estoppel or otherwise. Any license under such intellectual property rights
 *  must be express and approved by Intel in writing.
 */
//******************************************************************************
// Content: (version 0.5)
//      Based on a Monto Carlo method, this MPI sample code uses volumes to
//      estimate the number PI.
//
//******************************************************************************
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>
#include <math.h>

#include "mpi.h"

#define MASTER 0
#define TAG_HELLO 4
#define TAG_TEST 5
#define TAG_TIME 6

int main(int argc, char *argv[])
{
  int i, id, remote_id, num_procs;

  MPI_Status stat;
  int namelen;
  char name[MPI_MAX_PROCESSOR_NAME];

  // Start MPI.
  if (MPI_Init (&argc, &argv) != MPI_SUCCESS)
    {
      printf ("Failed to initialize MPI\n");
      return (-1);
    }

  // Create the communicator, and retrieve the number of processes.
  MPI_Comm_size (MPI_COMM_WORLD, &num_procs);

  // Determine the rank of the process.
  MPI_Comm_rank (MPI_COMM_WORLD, &id);
    // Get machine name
  MPI_Get_processor_name (name, &namelen);

  if (id == MASTER)
    {
      printf ("Hello world: rank %d of %d running on %s\n", id, num_procs, name);

      for (i = 1; i<num_procs; i++)
	{
	  MPI_Recv (&remote_id, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
	  MPI_Recv (&num_procs, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
	  MPI_Recv (&namelen, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
	  MPI_Recv (name, namelen+1, MPI_CHAR, i, TAG_HELLO, MPI_COMM_WORLD, &stat);

	  printf ("Hello world: rank %d of %d running on %s\n", remote_id, num_procs, name);
	}
    }
  else
    {
      MPI_Send (&id, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&num_procs, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&namelen, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (name, namelen+1, MPI_CHAR, MASTER, TAG_HELLO, MPI_COMM_WORLD);
    }

  // Rank 0 distributes seek randomly to all processes.
  double startprocess, endprocess;

  int distributed_seed = 0;
  int *buff;

  buff = (int *)malloc(num_procs * sizeof(int));

  unsigned int MAX_NUM_POINTS = pow (2,32) - 1;
  unsigned int num_local_points = MAX_NUM_POINTS / num_procs;

  if (id == MASTER)
    {
      srand (time(NULL));

      for (i=0; i<num_procs; i++)
	{
	  distributed_seed = rand();
	  buff[i] = distributed_seed;
	}
    }

  // Broadcast the seed to all processes
  MPI_Bcast(buff, num_procs, MPI_INT, MASTER, MPI_COMM_WORLD);

  // At this point, every process (including rank 0) has a different seed. Using their seed,
  // each process generates N points randomly in the interval [1/n, 1, 1]
  startprocess = MPI_Wtime();

  srand (buff[id]);

  unsigned int point = 0;
  unsigned int rand_MAX = 128000;
  float p_x, p_y, p_z;
  float temp, temp2, pi;
  double result;
  unsigned int inside = 0, total_inside = 0;
    for (point=0; point<num_local_points; point++)
    {
      temp = (rand() % (rand_MAX+1));
      p_x = temp / rand_MAX;
      p_x = p_x / num_procs;

      temp2 = (float)id / num_procs;	// id belongs to 0, num_procs-1
      p_x += temp2;

      temp = (rand() % (rand_MAX+1));
      p_y = temp / rand_MAX;

      temp = (rand() % (rand_MAX+1));
      p_z = temp / rand_MAX;

      // Compute the number of points residing inside of the 1/8 of the sphere
      result = p_x * p_x + p_y * p_y + p_z * p_z;

      if (result <= 1)
	  {
		inside++;
	  }
    }

  double elapsed = MPI_Wtime() - startprocess;

  MPI_Reduce (&inside, &total_inside, 1, MPI_UNSIGNED, MPI_SUM, MASTER, MPI_COMM_WORLD);

#if DEBUG
  printf ("rank %d counts %u points inside the sphere\n", id, inside);
#endif

  if (id == MASTER)
    {
      double timeprocess[num_procs];

      timeprocess[MASTER] = elapsed;
      printf("Elapsed time from rank %d: %10.2f (sec) \n", MASTER, timeprocess[MASTER]);

      for (i=1; i<num_procs; i++)
	{
	  // Rank 0 waits for elapsed time value
	  MPI_Recv (&timeprocess[i], 1, MPI_DOUBLE, i, TAG_TIME, MPI_COMM_WORLD, &stat);
	  printf("Elapsed time from rank %d: %10.2f (sec) \n", i, timeprocess[i]);
	}

      temp = 6 * (float)total_inside;
      pi = temp / MAX_NUM_POINTS;
      printf ( "Out of %u points, there are %u points inside the sphere => pi=%16.12f\n", MAX_NUM_POINTS, total_inside, pi);
    }
  else
    {
      // Send back the processing time (in second)
      MPI_Send (&elapsed, 1, MPI_DOUBLE, MASTER, TAG_TIME, MPI_COMM_WORLD);
    }

  free(buff);

  // Terminate MPI.
  MPI_Finalize();

  return 0;
}

Intel® Manycore Platform Software Stack Archive for the Intel® Xeon Phi™ Coprocessor x200 Product Family

$
0
0

On this page you will find the past releases of the Intel® Manycore Platform Software Stack (Intel® MPSS) for the Intel® Xeon Phi™ coprocessor x200 product family. The most recent release is found here: https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-for-intel-xeon-phi-coprocessor-x200. We recommend customers use the latest release wherever possible.

  • N-1 release for Intel® MPSS 4.4.x

Intel MPSS 4.4.0 HotFix 1 release for Linux*

Intel Manycore Platform Software Stack Version

Downloads Available

Size (range)

MD5 Checksum

Intel MPSS 4.4.0 Hotfix 1 (released: May 8, 2017)

RHEL 7.3

214MB

8a015c38379b8be42c8045d3ceb44545

 

RHEL 7.2

214MB

694b7b908c12061543d2982695985d8b

 

SLES 12.2

213MB

506ab12af774f78fa8e107fd7a4f96fd

 

SLES 12.1

213MB

b8520888954e846e8ac8604d62a9ba96

 

SLES 12.0

213MB

88a3a4415afae1238453ced7a0df28ea

 

Card installer file (mpss-4.4.0-card.tar)

761MB

d26e26868297cea5fd4ffafe8d78b66e

 

Source file (mpss-4.4.0-card-source.tar)

514MB

127713d06496090821b5bb3613c95b30

Description

Description

Last Updated On

Size (approx.)

releaseNotes-linux.txt

Release notes (English)

May 2017

15KB

readme.txt

Readme (includes installation instructions) for Linux (English)

May 2017

17KB

mpss_user_guide.pdf

Intel MPSS user guide

May 2017

3MB

eula.txt

End User License Agreement (Important: Read before downloading, installing, or using)

May 2017

33KB

 

Intel MPSS 4.4.0 HotFix 1 release for Windows*

Intel Manycore Platform Software Stack Version

Downloads Available

Size

MD5 Checksum

Intel MPSS 4.4.0 Hotfix 1 (released: May 8, 2017)

mpss-4.4.0-windows.zip

1091MB

204a65b36858842f472a37c77129eb53

Description

Description

Last Updated On

Size (approx.)

releasenotes-windows.txt

English - Release notes

May 2017

7KB

readme-windows.pdf

English - Readme for Windows

May 2017

399KB

mpss_users_guide_windows

Intel MPSS user guide for Windows

May 2017

3MB

eula.txt

End User License Agreement (Important: Read before downloading, installing, or using)

May 2017

33KB

 

The discussion forum at http://software.intel.com/en-us/forums/intel-many-integrated-core is available to join and discuss any enhancements or issues with the Intel MPSS.

Thread pool behavior for Apollo Lake Intel® SDK for OpenCL™ applications

$
0
0

This article describes internal driver optimizations for developers using Intel Atom™ Processors, Celeron™ Processors, and Pentium™ Processors based on the "Apollo Lake" Platform (Broxton Graphics).   The intent is to clarify existing documentation.  The optimizations described are completely transparent.  The only change needed from a developer perspective is to be aware that for this special case applications should be designed for the thread pool configuration instead of the underlying hardware.

Driver thread pool optimizations maximize Apollo Lake EU performance 

For  Intel® Core™ and Intel® Xeon® processors with integrated graphics, the number of EUs and EUs per subslice is large enough that mapping thread pools directly to subslices is efficient.  Tying thread pool implementation to hardware means that application behavior and hardware details can be described together in a way that is easy to visualize and remember.  This approach was used by many reference documents such as The Compute Architecture of Intel Processor Graphics Gen9.  

However, for the relatively smaller GPUs in the embedded processors listed above this approach could sometimes result in non-optimal mapping.  For these processors EUs are now pooled across subslices creating "virtual subslices" which do not match the hardware.  In this case it can help to understand where behavior is driven by thread pools instead of hardware layout.

Thread pools and physical subslices for HD graphics 505

There are two GPU configurations for Apollo Lake:

  • Intel® HD Graphics 505: 18 EUs, 3x6 physical, now using 2x9 thread pools (shown above)
  • Intel® HD Graphics 500: 12 EUs, 2x6 physical, now using 1x12 thread pools (not shown)

The thread pools determine how you should write your application, not the physical hardware.  For example, if you have HD Graphics 505 your application should be written as if there were two subslices with 9 EUs, not three subslices with six EUs.    

Extensive testing proved that the worst case was to match legacy configuration performance. The performance boost from switching to 2x9/1x12 often approaches 2X.  Since no scenarios were found which benefit from the legacy configuration there are no plans to add extensions to modify MEDIA_POOL_STATE. 

 

Thread pool size vs. physical hardware configuration

There are 4 main areas to consider:

  • Optimal work group size is determined by thread pool configuration, not physical hardware.  The driver will automatically handle the thread launch details to maximize thread occupancy.  State tracking (such as branch masking) is handled at the pool and subslice level by the driver, but for the most part these are implementation details which can be ignored by applications.
  • Local memory is shared by threads in the same pool.  The  number of bytes reported by CL_DEVICE_LOCAL_MEM_SIZE is physically located in lowest level cache, not the subslice.  For Apollo Lake this means either 1 (for 1x12 HD Graphics 500) or 2 (for 2x9 HD Graphics 505) regions are reserved to be shared by all threads in the same workgroup.  

  • Workgroup Barriers: again, behavior is tied to the work group, which is defined by the thread pool. There are now two types of internal barrier implementations -- "local barriers" within a physical subslice and "linked barriers" spanning subslices.  This behavior happens automatically and cannot be changed by the application.  There are no additional knobs provided to optimize.  
  • Subgroup extensions:  subgroups are "between" work groups and work items, so their mapping to hardware remains unchanged.   Work items in a subgroup execute on the same EU thread.  For more info see Ben Ashbaugh's excellent section on subgroups in our extension tutorial.

 

Conclusion

In the past, thread pools were always configured to match physical hardware.  Now there is a notable exception due to optimizations increasing performance on Apollo Lake processors.  You won't need to make a lot of changes to use these optimizations.  The most important takeaway is that Intel has done the work behind the scenes to make efficient use of Apollo Lake capabilities easy.  The details in this article are provided as a conceptual background but everything happens under the hood.   These changes are completely transparent.  To your application HD Graphics 500 has 1 subslice with 12 EUs and HD Graphics 505 has 2 subslices with 9 EUs -- even though the underlying hardware is 2x6 and 3x6.  Extensive internal testing has shown that this internal driver optimization  provides big improvements.  We have not seen a case of performance regression yet.  However, we are always open to feedback.  If you find a scenario where the legacy thread pool configuration may be a better fit please let us know.

For more information, please see the Broxton Graphics Programmer's Reference Manual.

An Example of a Convolutional Neural Network for Image Super-Resolution

$
0
0

Convolutional neural networks (CNN) are becoming mainstream in computer vision. In particular, CNNs are widely used for high-level vision tasks, like image classification (AlexNet*, for example). This article (and associated tutorial) describes an example of a CNN for image super-resolution (SR), which is a low-level vision task, and its implementation using the Intel® Distribution for Caffe* framework and Intel® Distribution for Python*. This CNN is based on the work described by1 and2, proposing a new approach to performing single-image SR using CNNs.

Introduction

Some modern camera sensors, present in everyday electronic devices like digital cameras, phones, and tablets, are able to produce reasonably high-resolution (HR) images and videos. The resolution in the images and videos produced by these devices is in many cases acceptable for general use.

However, there are situations where the image or video is considered low resolution (LR). Examples include the following situations:

  1. Device does not produce HR images or video (as in some surveillance systems).
  2. The objects of interest in the image or video are small compared to the size of the image or video frame; for example, faces of people or vehicle plates located far away from the camera.
  3. Blurred or noisy images.
  4. Application using the images or videos demands higher resolution than that present in the camera.
  5. Improving the resolution as a pre-processing step improves the performance of other algorithms that use the images; face detection, for example.

Super-resolution is a technique to obtain an HR image from one or several LR images. SR can be based on a single image or on several frames in a video sequence.

Single-image (or single-frame) SR uses pairs of LR and HR images to learn the mapping between them. For this purpose, image databases containing LR and HR pairs are created3 and used as a training set. The learned mapping can be used to predict HR details in a new image.

On the other hand, multiple-frame SR is based on several images taken from the same scene, but from slightly different conditions (such as angle, illumination, and position). This technique uses the non-redundant information present in multiple images (or frames in an image sequence) to increase the SR performance.

In this article, we will focus on a single-image SR method.

Single-Image Super-Resolution Using Convolutional Neural Networks

In this method, a training set is used to train a neural network (NN) to learn the mapping between the LR and HR images in the training set. There are many references in the literature about SR. Many different techniques have been proposed and used for about 30 years. Methods using deep CNNs have been developed in the last few years. One of the first methods was created by1, who described a three-layer CNN and named it Super-Resolution Convolutional Neural Network (SRCNN). Their pioneering work in this area is important because, besides demonstrating that the mapping from LR to HR can be cast as a CNN, they created a model often used as a reference. New methods compare its performance to the SRCNN results. The same authors have recently developed a modified version of their original SRCNN, which they named Fast Super-Resolution Convolutional Neural Network (FSRCNN), that offers better restoration quality and runs faster2.

In this article, we describe both the SRCNN and the FSRCNN, and, in a separate tutorial, we show an implementation of the improved FSRCNN. Both the SRCNN and the FSRCNN can be used as a basis for further experimentation with other published network architectures, as well as others that the readers might want to try. Although the FSRCNN (and other recent network architectures for SR) show clear improvement over the SRCNN, the original SRCNN is also described here to show how this pioneer network has evolved from its inception to newer networks that use different topologies to achieve better results. In the tutorial, we will implement the FSRCNN network using the Intel Distribution for Caffe deep learning framework and Intel Distribution for Python, which will let us take advantage of Intel® Xeon® processors and Intel® Xeon Phi™ processors, as well as Intel® libraries to accelerate training and testing of this network.

Super-Resolution Convolutional Neural Network (SRCNN) Structure

The authors of the SRCNN describe their network, pointing out the equivalence of their method to the sparse-coding method4, which is a widely used learning method for image SR. This is an important and educational aspect of their work, because it shows how example-based learning methods can be adapted and generalized to CNN models.

The SRCNN consists of the following operations1:

  1. Preprocessing: Up-scales LR image to desired HR size.
  2. Feature extraction: Extracts a set of feature maps from the up-scaled LR image.
  3. Non-linear mapping: Maps the feature maps representing LR to HR patches.
  4. Reconstruction: Produces the HR image from HR patches.

Operations 2–4 above can be cast as a convolutional layer in a CNN that accepts as input the preprocessed images from step 1 above, and outputs the HR image. The structure of this SRCNN consists of three convolutional layers:

  • Input Image: LR image up-sampled to desired higher resolution and c channels (the color components of the image)
  • Conv. Layer 1: Patch extraction
    • n1 filters of size c× f1× f1
    • Activation function: ReLU (rectified linear unit)
    • Output: n1 feature maps
    • Parameters to optimize: c× f1× f1× n1 weights and n1 biases
  • Conv. Layer 2: Non-linear mapping
    • n2 filters of size n1× f2× f2
    • Activation function: ReLU
    • Output: n2 feature maps
    • Parameters to optimize: n1× f1× f1× n2 weights and n2 biases
  • Conv. Layer 3: Reconstruction
    • One filter of size n2× f3× f3
    • Activation function: Identity
    • Output: HR image
    • Parameters to optimize: n2× f3× f3× c weights and c biases
  • Loss Function: Mean squared error (MSE) between the N reconstructed HR images and the N original true HR images in the training set (N is the number of images in the training set).

In their paper, the authors of SRCNN implement and test their SRCNN using several settings varying the number of filters. They get better SR performance when they increase the number of filters, at the expense of increasing the number of parameters (weights and biases) to optimize, which in turns increases the computational cost. Next is their reference model, which shows good overall results in terms of accuracy/performance (Figure 1):

  • Input Image: LR single-channel image up-sampled to desired higher resolution
  • Conv. Layer 1: Patch extraction
    • 64 filters of size 1 x 9 x 9
    • Activation function: ReLU
    • Output: 64 feature maps
    • Parameters to optimize: 1 x 9 x 9 x 64 = 5184 weights and 64 biases
  • Conv. Layer 2: Non-linear mapping
    • 32 filters of size 64 x 1 x 1
    • Activation function: ReLU
    • Output: 32 feature maps
    • Parameters to optimize: 64 x 1 x 1 x 32 = 2048 weights and 32 biases
  • Conv. Layer 3: Reconstruction
    • 1 filter of size 32 x 5 x 5
    • Activation function: Identity
    • Output: HR image
    • Parameters to optimize: 32 x 5 x 5 x 1 = 800 weights and 1 bias

Figure 1. Structure of SRCNN showing parameters for reference model.

Fast Super-Resolution Convolutional Neural Network (FSRCNN) Structure

The authors of the SRCNN recently created a new CNN which accelerates the training and prediction tasks, while achieving comparable or better performance compared to SRCNN. The new FSRCNN consists of the following operations2:

  1. Feature extraction: Extracts a set of feature maps directly from the LR image.
  2. Shrinking: Reduces dimension of feature vectors (thus decreasing the number of parameters) by using a smaller number of filters (compared to the number of filters used for feature extraction).
  3. Non-linear mapping: Maps the feature maps representing LR to HR patches. This step is performed using several mapping layers with filter size smaller than used in SCRNN.
  4. Expanding: Increases dimension of feature vectors. This operation performs the inverse operation as the shrinking layers, in order to more accurately produce the HR image.
  5. Deconvolution: Produces the HR image from HR features.

The authors explain in detail the differences between SRCNN and FSRCNN, but things particularly relevant for a quick implementation and experimentation (which is the scope of this article and the associated tutorial) are the following:

  1. FSRCNN uses multiple convolution layers for the non-linear mapping operation (instead of a single layer in SRCNN). The number of layers can be changed (compared to the author’s version) in order to experiment. Performance and accuracy of reconstruction will vary with those changes. Also, this is a good example for fine-tuning a CNN by keeping the portion of FSRCNN fixed up to the non-linear mapping layers, and then adding or changing those layers to experiment with different lengths for the non-linear LR-HR mapping operation.
  2. The input image is directly the LR image. It does not need to be up-sampled to the size of the expected HR image, as in the SRCNN. This is part of why this network is faster; the feature extraction stage uses a smaller number of parameters compared to the SRCNN.

As seen in Figure 2, the five operations shown above can be cast as a CNN using convolutional layers for operations 1–4, and a deconvolution layer for operation 5. Non-linearities are introduced via parametric rectified linear unit (PReLU) layers (described in5), which the authors for this particular model chose because of better and more stable performance, compared to rectified linear unit (ReLU) layers. See Appendix 1 for a brief description of ReLUs and PReLUs.

Figure 2. Structure of FSRCNN(56, 12, 4).

The overall best performing model reported by the authors is the FSRCNN (56, 12, 4) (Figure 2), which refers to a network with a LR feature dimension of 56 (number of filters both in the first convolution and in the deconvolution layer), 12 shrinking filters (the number of filters in the layers in the middle of the network, performing the mapping operation), and a mapping depth of 4 (the number of convolutional layers that implement the mapping between the LR and the HR feature space). This is the reason why this network looks like an hourglass; it is thick (more parameters) at the edges and thin (fewer parameters) in the middle. The overall shape of this reference model is symmetrical and its structure is as follows:

  • Input Image: LR single channel.
  • Conv. Layer 1: Feature extraction
    • 56 filters of size 1 x 5 x 5
    • Activation function: PReLU
    • Output: 56 feature maps
    • Parameters: 1 x 5 x 5 x 56 = 1400 weights and 56 biases
  • Conv. Layer 2: Shrinking
    • 12 filters of size 56 x 1 x 1
    • Activation function: PReLU
    • Output: 12 feature maps
    • Parameters: 56 x 1 x 1 x 12 = 672 weights and 12 biases
  • Conv. Layers 3–6: Mapping
    • 4 x 12 filters of size 12 x 3 x 3
    • Activation function: PReLU
    • Output: HR feature maps
    • Parameters: 4 x 12 x 3 x 3 x 12 = 5184 weights and 48 biases
  • Conv. Layer 7: Expanding
    • 56 filters of size 12 x 1 x 1
    • Activation function: PReLU
    • Output: 12 feature maps
    • Parameters: 12 x 1 x 1 x 56 = 672 weights and 56 biases
  • DeConv Layer 8: Deconvolution
    • One filter of size 56 x 9 x 9
    • Activation function: PReLU
    • Output: 12 feature maps
    • Parameters: 56 x 9 x 9 x 1 = 4536 weights and 1 bias

Total number of weights: 12464 (plus a very small number of parameters in PReLU layers)

Figure 3 shows an example of using the trained FSRCNN on one of the test images. The protobuf file describing this network, as well as training and testing data preparation and implementation details, will be covered in the associated tutorial.

Figure 3. An example of inference using a trained FSRCNN. The left image is the original. In the center, the original image was down-sampled and blurred. The image on the right is the reconstructed HR image using this network.

Summary

This article presented an overview of two recent CNNs for single-image super-resolution. The networks we chose were representative of the state of the art methods for SR and, having been one of the first published CNN-based methods, show interesting insights about how a non-CNN method (sparse coding) inspired a CNN-based method. In the tutorial, an implementation of FSRCNN is shown using the Intel® Distribution for Caffe* framework and Intel® Distribution for Python*. This reference implementation can be used to experiment with variations of this network and as a base for implementing newer networks for super-resolution that have been published recently. This is a good example for fine-tuning a network. New networks with varying architectures have been published recently. They show improvements in reconstruction or training/inference speed, and some of them attempt to solve the multi-frame SR problem. The reader is encouraged to experiment with these new networks.

Appendix 1: Rectified Linear Units (Rectifiers)

Rectified activation units (rectifiers) in neural networks are one way to introduce non-linearities in the network. A non-linear layer (also called activation layer) is necessary in a NN to prevent it from becoming a pure linear model with limited learning capabilities. Other possible activation layers are, among others, a sigmoid function or a hyperbolic tangent (tanh) layer. However, rectifiers have better computational efficiency, improving the overall training of the CNN.

The most commonly used rectifier is the traditional rectified linear unit (ReLU), which performs an operation defined mathematically as:

where xi is the input on the i-th channel.

Another rectifier introduced recently5 is the parametric rectified linear unit (PReLU), defined as:

which includes parameters pi controlling the slope of the line representing the negative inputs. These parameters will be learned jointly with the model during the training phase. To reduce the number of parameters, the pi parameters can be collapsed into one learnable parameter for all channels.

A particular case of the PReLU is the leaky ReLU (LReLU), which is a PReLU with pi defined as a small constant k for all input channels.

In Caffe, a PReLU layer can be defined (in a protobuf file) as

layer {
 name: "reluXX"
 type: "PReLU"
 bottom: "convXX"
 top: "convXX"
 prelu_param {
  channel_shared: 1
 }
}

Where, in this case, the negative slopes are shared across channels. A different option is to use LReLU with a fixed slope:

layer {
 name: "reluXX"
 type: "PReLU"
 bottom: "convXX"
 top: "convXX"
 prelu_param {
  filler: 0.1
 }
}

References

1. C. Dong, C. C. Loy, K. He and X. Tang, "Learning a Deep Convolutional Network for Image Super-Resolution," 2014.

2. C. Dong, C. C. Loy and X. Tang, "Accelerating the Super-Resolution Convolutional Neural Network," 2016.

3. P. B. Chopade and P. M. Patil, "Single and Multi Frame Image Super-Resolution and its Performance Analysis: A Comprehensive Survey," February 2015.

4. J. Yang, J. Wright, T. Huang and Y. Ma, "Image Super-Resolution via Sparse Representation,"IEEE Transactions on Image Processing, pp. 2861-2873, 2010.

5. K. He, X. Zhang, S. Ren and J. Sun, "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,"arxiv.org, 2015.

6. A. Greaves and H. Winter, "Multi-Frame Video Super-Resolution Using Convolutional Neural Networks," 2016.

7. J. Kim, J. K. Lee and K. M. Lee, "Accurate Image Super-Resolution Using Very Deep Convolutional Networks," 2016.

Getting Started with Intel® SDK for OpenCL™ Applications (Linux SRB4.1)

$
0
0

This article is a step by step guide to quickly get started developing applications using Intel®  SDK for OpenCL™ Applications in Linux for SRB4.1.  This is now a legacy release.  For instructions to install the latest release please see https://software.intel.com/articles/sdk-for-opencl-gsg.

  1. Install the driver
  2. Install the SDK
  3. Set up Eclipse

 

Step 1: Install the driver

To run applications using OpenCL kernels on the Intel Processor Graphics GPU device with the latest features for the newest processors, you will need a driver package from here: https://software.intel.com/en-us/articles/opencl-drivers.

(If your target processor does not include Intel Processor Graphics, install the latest runtime package instead.)

This script covers the steps needed to install the SRB4 driver package on Ubuntu 14.04.

To use

$ tar -xvf install_OCL_driver_ubuntu.tgz
$ sudo su
$ ./install_OCL_driver_ubuntu.sh

 

This script automates downloading prerequisites, installing the user-mode components, patching the 4.7 kernel, and building it. 

 

You can check your progress with the System Analyzer Utility.  If successful, you should see smoke test results looking like this at the bottom of the the system analyzer output:

--------------------------
Component Smoke Tests:
--------------------------
 [ OK ] OpenCL check:platform:Intel(R) OpenCL GPU OK CPU OK

 

 

Step 2: Install the SDK

This script will set up all prerequisites for successful SDK install for Ubuntu.  After this, run the SDK installer.

Here is a kernel to test the SDK install:

__kernel void simpleAdd(
                       __global int *pA,
                       __global int *pB,
                       __global int *pC)
{
    const int id = get_global_id(0);
    pC[id] = pA[id] + pB[id];
}                               

Check that the command line compiler ioc64 is installed with

$ ioc64 -input=simpleAdd.cl -asm

(expected output)
No command specified, using 'build' as default
OpenCL Intel(R) Graphics device was found!
Device name: Intel(R) HD Graphics
Device version: OpenCL 2.0
Device vendor: Intel(R) Corporation
Device profile: FULL_PROFILE
fcl build 1 succeeded.
bcl build succeeded.

simpleAdd info:
	Maximum work-group size: 256
	Compiler work-group size: (0, 0, 0)
	Local memory size: 0
	Preferred multiple of work-group size: 32
	Minimum amount of private memory: 0

Build succeeded!

 

Step 3: Set up Eclipse

Intel SDK for OpenCL applications works with Eclipse Mars and Neon.

After installing, copy the CodeBuilder*.jar file from the SDK eclipse-plug-in folder to the Eclipse dropins folder.

$ cd eclipse/dropins
$ find /opt/intel -name 'CodeBuilder*.jar' -exec cp {} . \;

Start Eclipse.  Code-Builder options should be available in the main menu.

GPU Debugging: Challenges and Opportunities

$
0
0

From GPU Debugging: Challenges and Opportunities presented at the International Workshop on OpenCL (IWOCL) 2017.

GPU debugging support matches the OpenCL™ 2.0 GPU/CPU driver package for Linux* (64-bit) from OpenCL™ Drivers and Runtimes for Intel® Architecture with the notable exception of processors based on the Broadwell architecture. 

Basic concepts to keep in mind for GPU debugging

  • There are "host" and "target" components.  Host = where you interact with the debugger, target = where the application is run 
  • There are 3 components: gdbserver, the application to be debugged, and the gdb session.
  • The gdb session and application can be on the same or different machines
  • Breakpoints in the graphics driver can affect screen rendering.  You cannot debug on the same system that is rendering your screen.
  • However, non-graphical connections (such as with SSH) are unaffected. The "host" can be connected to remotely as well for gdb's text interface.

Abbreviations used:

KMD - Kernel Mode Driver

RT = OpenCL Runtime

DCD – Debug Companion Driver

  • Ring-0 driver, provides low-level gfx access
  • Run control flow, breakpoints, etc

DSL – Debug Support Library

  • Ring-3 debugger driver (shared library)
  • Loaded into the gdbserver process

DSL <--> DCD

  • Communicate via IOCTLs

How to set up a debugging session

The simplest option is to use ssh for steps 1,2, and 3.  However, gdb can be run locally as well.  The target steps (1 and 2) should be run remotely because GPU breakpoints can cause rendering hangs.

1. launch gdbserver

/usr/bin/gdbserver-igfx :1234 --attach 123

2. launch the application

export IGFXDBG_OVERRIDE_CLIENT_PID=123
./gemm

Note: there is an automatic breakpoint at the first kernel launch

3. launch GDB

source /opt/intel/opencl-sdk/gt_debugger_2016.0/bin/debuggervars.sh

/opt/intel/opencl-sdk/gt_debugger_2016.0/bin/launch_gdb.sh –tui

In GDB

target remote :1234
continue
x/i $pc

GDB should now be able to step through the kernel code.

OpenCL™ Drivers and Runtimes for Intel® Architecture

$
0
0

What to Download

By downloading a package from this page, you accept the End User License Agreement.

Installation has two parts:

  1. Intel® SDK for OpenCL™ Applications Package
  2. Driver and library(runtime) packages

The SDK includes components to develop applications: IDE integration, offline compiler, debugger, and other tools.  Usually on a development machine the driver/runtime package is also installed for testing.  For deployment you can pick the package that best matches the target environment.

The illustration below shows some example install configurations. 

 

SDK Packages

Please note: A GPU/CPU driver package or CPU-only runtime package is required in addition to the SDK to execute applications

Standalone:

Suite: (also includes driver and Intel® Media SDK)

 

Driver/Runtime Packages Available

GPU/CPU Driver Packages

CPU-only Runtime Packages  

 


Intel® SDK for OpenCL™ Applications 2016 R3 for Linux (64-bit)

This is a standalone release for customers who do not need integration with the Intel® Media Server Studio. It provides components to develop OpenCL applications for Intel processors. 

Visit https://software.intel.com/en-us/intel-opencl to download the version for your platform. For details check out the Release Notes.

Intel® SDK for OpenCL™ Applications 2016 R3 for Windows* (64-bit)

This is a standalone release for customers who do not need integration with the Intel® Media Server Studio. The standard Windows graphics driver packages contains the driver and runtime library components necessary to run OpenCL applications. This package provides components for OpenCL development. 

Visit https://software.intel.com/en-us/intel-opencl to download the version for your platform. For details check out the Release Notes.


OpenCL™ 2.0 GPU/CPU driver package for Linux* (64-bit)

 

The intel-opencl-r5.0 (SRB5.0) Linux driver package enables OpenCL 1.2 or 2.0 on the GPU/CPU for the following Intel® processors:

  • Intel® 5th, 6th or 7th generation Core™ processor
  • Intel® Celeron® Processor J3000 Series with Intel® HD Graphics 500 (J3455, J3355), Intel® Pentium® Processor J4000 Series with Intel® HD Graphics 505 (J4205), Intel® Celeron® Processor N3000 Series with Intel® HD Graphics 500 (N3350, N3450), Intel® Pentium Processor N4000 Series with Intel® HD Graphics 505 (N4200)
  • Intel® Xeon® v4, or Intel® Xeon® v5 Processors with Intel® Graphics Technology (if enabled by OEM in BIOS and motherboard)

Installation Instructions.  Scripts to automate install and additional install documentation available here.

Intel validates the intel-opencl-r5.0 driver on CentOS 7.2 and 7.3 when running the following 64-bit kernels:

  • Linux 4.7 kernel patched for OpenCL
  • Linux 4.4 kernel patched for  Intel® Media Server Studio 2017 R3

Although Intel validates and provides technical support only for the above Linux kernels on CentOS 7.2 and 7.3, other distributions may be adapted by utilizing our generic operating system installation steps as well as MSS 2017 R3 installation steps.  

In addition: Intel also validates Ubuntu 16.04.2 when running the following 64-bit kernel:

•Ubuntu 16.04.2 default 4.8 kernel

Ubuntu 16.04 with the default kernel works fairly well but some core features (i.e. device enqueue, SVM memory coherency, VTune support) won’t work without kernel patches.  This configuration has been minimally validated to prove that it is viable to suggest for experimental use, but it is not fully supported or certified.

Supported OpenCL devices:

  • Intel® graphics (GPU)
  • CPU

For detailed information please see the driver package Release Notes. 

Previous Linux driver packages:

Intel intel-opencl-r4.1 (SRB4.1) Linux driver packageInstallation instructions Release Notes
Intel intel-opencl-r4.0 (SRB4) Linux driver packageInstallation instructionsRelease Notes
SRB3.1 Linux driver packageInstallation instructionsRelease Notes

For Linux drivers covering earlier platforms such as 4th generation Intel Core processor please see the versions of Media Server Studio in the Driver Support Matrix.


OpenCL™ Driver for Iris™ graphics and Intel® HD Graphics for Windows* OS (64-bit and 32-bit)

The standard Intel graphics drivers for Windows* include components needed to run OpenCL* and Intel® Media SDK applications on processors with Intel® Iris™ Graphics or Intel® HD Graphics on Windows* OS.

You can use the Intel Driver Update Utility to automatically detect and update your drivers and software.  Using the latest available graphics driver for your processor is usually recommended.

 

Supported OpenCL devices:

  • Intel graphics (GPU)
  • CPU

For the full list of Intel® Architecture processors with OpenCL support on Intel Graphics under Windows*, refer to the Release Notes.

 


OpenCL™ Runtime for Intel® Core™ and Intel® Xeon® Processors

This runtime software package adds OpenCL CPU device support on systems with Intel Core and Intel Xeon processors.

Supported OpenCL devices:

  • CPU

Latest release (16.1.1)

Previous Runtimes (16.1)

Previous Runtimes (15.1):

For the full list of supported Intel® architecture processors, refer to the OpenCL™ Runtime Release Notes.

 


 Deprecated Releases

Note: These releases are no longer maintained or supported by Intel

OpenCL™ Runtime 14.2 for Intel® CPU and Intel® Xeon Phi™ Coprocessors

This runtime software package adds OpenCL support to Intel Core and Xeon processors and Intel Xeon Phi coprocessors.

Supported OpenCL devices:

  • Intel Xeon Phi coprocessor
  • CPU

Available Runtimes

For the full list of supported Intel architecture processors, refer to the OpenCL™ Runtime Release Notes.


An Example of a Convolutional Neural Network for Image Super-Resolution—Tutorial

$
0
0

This tutorial describes one way to implement a CNN (convolutional neural network) for single image super-resolution optimized on Intel® architecture from the Caffe* deep learning framework and Intel® Distribution for Python*, which will let us take advantage of Intel processors and Intel libraries to accelerate training and testing of this CNN.

The CNN we use in this tutorial is the Fast Super-Resolution Convolutional Neural Network (FSRCNN), based on the work described in [1] and [2], who proposed a new approach to perform single-image SR using CNNs. We describe in more detail this network and its predecessor (the Super-Resolution Convolutional Neural Network (SRCNN)) in an associated article (“An Example of a Convolutional Neural Network for Image Super-Resolution”).

FSRCNN Structure

As described in the associated article and in [2], the FSRCNN consists of the following operations:

  1. Feature extraction: Extracts a set of feature maps directly from the low-resolution (LR) image.
  2. Shrinking: Reduces dimension of feature vectors (thus decreasing the number of parameters) by using a smaller number of filters (compared to the number of filters used for feature extraction).
  3. Non-linear mapping: Maps feature maps representing LR patches to high-resolution (HR) ones. This step is performed using several mapping layers with filter size smaller than the one used in SCRNN.
  4. Expanding: Increases dimension of feature vectors. This operation performs the inverse operation as the shrinking layers in order to more accurately produce the HR image.
  5. Deconvolution: Produces the HR image from HR features.

The structure of the FSRCNN (56, 12, 4) model (which is the best performing model reported in [2], and described in the associated article) is shown in Figure 1. It has a LR feature dimension of 56 (number of filters both in the first convolution and in the deconvolution layer), 12 shrinking filters (the number of filters in the layers in the middle of the network, performing the mapping operation), and a mapping depth of 4 (the number of convolutional layers that implement the mapping between the LR and the HR feature space).

Graphic showing structure of FSRCNN
Figure 1: Structure of the FSRCNN (56 ,12, 4).

Training and Testing Data Preparation

Datasets to train and test this implementation are available from the authors’ [2]  website. The train dataset consists of 91 images of different sizes. There are two test datasets: Set 5 (containing 5 images) and Set 14 (containing 14 images). In this tutorial, both train and test datasets will be packed into an HDF5* file (https://support.hdfgroup.org/), which can be efficiently used from the Caffe framework. For more information about Caffe optimized for Intel® architecture, visit Manage Deep Learning Networks with Caffe* Optimized for Intel® Architecture and Recipe: Optimized Caffe* for Deep Learning on Intel® Xeon Phi™ Processor x200.

Both train and test datasets need some preprocessing, as follows:

  • Train dataset: First, the images are converted to YCrCb color space (https://en.wikipedia.org/wiki/YCbCr), and only the luminance channel Y is used in this tutorial. Each of the 91 images in the train dataset is downsampled by a factor k, where k is the scaling factor desired for super-resolution, obtaining in this way a pair of corresponding LR and HR images. Next, each image pair (LR/HR) is cropped into a subset of small subimages, using stride s, so we end up with N pairs of LR/HR subimages for each one of the 91 original train images. The reason for cropping the images for training is that we want to train the model using both LR and HR local features located in a small area. The number of subimages, N, depends on the size of the subimages and the stride s. The authors of [2], for their experiments define a 7x7 pixels size for the LR subimages, and a 21x21 pixels size for the HR subimages, which corresponds to a scaling factor k=3.
  • Test dataset: Each image in the test dataset is processed in the same way as the training dataset, with the exception that the stride s can be larger than the one used for training, to accelerate the testing procedure.

The following Python code snippets show one possible way to generate the train and test datasets. We use OpenCV* (http://opencv.org/) to handle and preprocess the images. The first snippet shows how to generate the HR and LR subimage pair set from one of the original images in the 91-image train dataset for the specific case where scaling factor k=3 and stride = 19:

import os
import sys
import numpy as np
import h5py

sys.path.append('$CAFFE_HOME/opencv-2.4.13/release/lib/')
import cv2

# Parameters
scale = 3
stride = 19
size_ground = 19
size_input = 7
size_pad = 2

#Read image to process
image = cv2.imread('<PATH TO FILES>/Train/t1.bmp')

#Change color-space to YCR_CB
image_ycrcb = cv2.cvtColor(image, cv2.COLOR_RGB2YCR_CB)
image_ycrcb = image_ycrcb[:,:,0]
image_ycrcb = image_ycrcb.reshape((image_ycrcb.shape[0], image_ycrcb.shape[1], 1))

#Compute size of LR images and resize HR images to a multiple of scale
height_small = int(height/scale)
width_small  = int(width/scale)

image_pair_HR = cv2.resize(image_ycrcb, (width_small*scale, height_small*scale) )
image_pair_LR = cv2.resize(image_ycrcb, (width_small, height_small) )

# Declare tensors to hold 1024 LR-HR subimage pairs
input_HR = np.zeros((size_ground, size_ground, 1, 1024))
input_LR = np.zeros((size_input + 2*size_pad, size_input + 2*size_pad, 1, 1024))

height, width = image_pair_HR.shape[:2]

#Iterate over the train image using the specified stride and create LR-HR subimage pairs
count = 0
for i in range(0, height-size_ground+1, stride):
    for j in range(0, width-size_ground+1, stride):
       subimage_HR = image_pair_HR[i:i+size_ground, j:j+size_ground]
       count = count + 1
       height_small = size_input
       width_small  = size_input
       subimage_LR = cv2.resize(subimage_HR, (width_small, height_small) )

       np.lib.pad(subimage_LR, ((size_pad, 2), (2, 2)), 'constant', constant_values=(0.0))
       input_HR[:,:,0,count-1] = subimage_HR
       input_LR[:,:,0,count-1] = np.lib.pad(subimage_LR, ((size_pad, 2), (2, 2)), 'constant', constant_values=(0.0))

The next snippet shows how to use the python h5py module to create an hdf5 file that contains the HR and LR subimage pair set created in the previous snippet:

(…)
#Create an hdf5 file
with h5py.File('train1.h5','w') as H5:
    H5.create_dataset( 'Input', data=input_LR )
    H5.create_dataset( 'Ground', data=input_HR )
(…)

The previous two snippets can be used to create the hdf5 file containing the entire training set of 91 images to be used for training in Caffe.

FSRCNN Training

The reference model (described in the previous section) is implemented using Intel® Distribution for Caffe, which has been optimized to run on Intel CPUs. An introduction to the basics of this framework and directions to install it can be found at the Intel® Nervana AI Academy.

In Caffe, models are defined using protobuf files. The FSRCNN model can be downloaded from the authors’ [2] website. The code snippet below shows the input layer and the first convolutional layer of the FSRCNN (56, 12, 4) model defined by its authors [2]. The input layer reads the train/test data from the files whose filenames are defined in the source files located in the $HOME_CAFFE/examples directory (train.txt and test.txt). The batch size for training is 128.

name: "SR_test"
layer {
  name: "data"
  type: "HDF5Data"
  top: "data"
  top: "label"
  hdf5_data_param {
    source: "examples/FSRCNN/train.txt"
    batch_size: 128
  }
  include: { phase: TRAIN }
}
layer {
  name: "data"
  type: "HDF5Data"
  top: "data"
  top: "label"
  hdf5_data_param {
    source: "examples/FSRCNN/test.txt"
    batch_size: 2
  }
  include: { phase: TEST }
}

layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 0.1
  }
  convolution_param {
    num_output: 56
    kernel_size: 5
    stride: 1
    pad: 0
    weight_filler {
      type: "gaussian"
      std: 0.0378
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
(...)

To train the above model, the authors of [2] provide in their website a solver protobuf file containing the training parameters and the location of the protobuf network definition file:

# The train/test net protocol buffer definition
net: "examples/FSRCNN/FSRCNN.prototxt"
test_iter: 100
# Carry out testing every 500 training iterations.
test_interval: 5000
# The base learning rate, momentum and the weight decay of the network.
#base_lr: 0.005
base_lr: 0.001
momentum: 0.9
weight_decay: 0
# Learning rate policy
lr_policy: "fixed"
# Display results every 100 iterations
display: 1000
# Maximum number of iterations
max_iter: 1000000
# write intermediate results (snapshots)
snapshot: 5000
snapshot_prefix: "examples/FSRCNN/RESULTS/FSRCNN-56_12_4"
# solver mode: CPU or GPU
solver_mode: CPU

The solver shown above will train the network defined in the model definition file FSRCNN.prototxt using the following parameters:

  • The test interval will be every 5000 iterations, and 100 is the number of forward passes the test should perform.
  • The base learning rate will be 0.005, and the learning rate policy is fixed, which means the learning rate will not change with time. Momentum is 0.9 (a common choice) and weight_decay is zero (no regularization to penalize large weights).
  • Intermediate results (snapshots) will be written to disk every 5000 iterations, and the maximum number of iterations (when the training will stop) is 1000000.
  • Snapshot results will be written to the examples/FSRCNN/RESULTS directory (assuming we run Caffe from the install directory $CAFFE_ROOT). Model files (containing the trained weights) will be pre-fixed by the string ‘FSRCNN-56_12_4’.

The reader is encouraged to experiment with different parameters. One useful option is to define a small maximum number of iterations and explore how the test error decreases, and compare this rate between different sets of parameters.

Once the network definition and solver files are ready, start training by running the caffe command located in the build/tools directory:

export CAFFE_ROOT=< Path to caffe >
$CAFFE_ROOT/build/tools/caffe train -engine "MKL2017"–solver \ $CAFFE_ROOT/examples/FSRCNN//FSRCNN_solver.prototxt 2>$CAFFE_ROOT/examples/FSRCNN/output.log

Resume Training Using Saved Snapshots

After training the CNN, the network parameters (weights) will be written to disk according to the frequency specified by the snapshot parameter. Caffe will create two files at each snapshot:

FSRCNN-56_12_4_iter_1000000.caffemodel
FSRCNN-56_12_4_iter_1000000.solverstate

The model file contains the learned model parameters corresponding to the indicated iteration, serialized as binary protocol buffer files. The solver state file is the state snapshot containing all the necessary information to recover the solver state at the time of the snapshot. This file will let us resume training from the snapshot instead of restarting from scratch. For example, let us assume we ran training for 1 million iterations, and after that we realize that we need to run it for an extra 500K iterations to further reduce the testing error. We can restart the training using the snapshot taken after 1 million iterations:

$CAFFE_ROOT/build/tools/caffe train -engine "MKL2017"–solver\ $CAFFE_ROOT/examples/FSRCNN//FSRCNN_solver.prototxt –snapshot\ $CAFFE_ROOT/examples/FSRCNN/RESULTS/FSRCNN-56_12_4_iter_1000000.solverstate\ 2>$CAFFE_ROOT/examples/FSRCNN/output_resume.log

So, the new training will run until the new number of iterations specified in the solver file is reached, which in this case is 1500000.

FSRCNN Testing Using Pre-Trained Parameters

Once we have a trained model, we can use it to perform super-resolution on an input LR image. We can test the network at any moment during the training as long as we have model snapshots already generated.

In practice, we can use the super-resolution model we trained to increase the resolution on any image or video. However, for the purposes of this tutorial, we want to test our trained model in a LR image for which we have an HR image to compare with. To this effect, we will use a sample image from the test dataset that is used in [1] and [2] (from the Set5 dataset, which is also commonly used to test SR models in other publications).

To perform the test, we will use a sample image (butterfly) as the ground truth. To create the input LR image, we will blur and downsample the ground truth image, and will use it to feed the trained network. Once we forward-run the network with the input image, obtaining a super-resolved image as output, we will compare the three images (ground truth, LR, and super-resolved) to visually evaluate the performance of the SR network we trained.

The test procedure described above can be implemented in several ways. As an example, the following Python script implements the testing procedure using the OpenCV library for image handling:

	 import os
     import sys
     import numpy as np

     #Set up caffe root directory and add to path
     caffe_root = '$APPS/caffe/'
     sys.path.insert(0, caffe_root + 'python')
     sys.path.append('opencv-2.4.13/release/lib/')

    import cv2
    import caffe

    # Parameters
    scale = 3

    #Create Caffe model using pretrained model
    net = caffe.Net(caffe_root + 'FSRCNN_predict.prototxt',
                      caffe_root + 'examples/FSRCNN/RESULTS/FSRCNN-56_12_4_iter_300000.caffemodel', caffe.TRAIN)

    #Input directories
    input_dir = caffe_root + 'examples/SRCNN/DATA/Set5/'

    #Input ground truth image
    im_raw = cv2.imread(caffe_root + '/examples/SRCNN/DATA/Set5/butterfly.bmp')

    #Change format to YCR_CB
    ycrcb = cv2.cvtColor(im_raw, cv2.COLOR_RGB2YCR_CB)
    im_raw = ycrcb[:,:,0]
    im_raw = im_raw.reshape((im_raw.shape[0], im_raw.shape[1], 1))

    #Blur image and resize to create input for network
    im_blur = cv2.blur(im_raw, (4,4))
    im_small = cv2.resize(im_blur, (int(im_raw.shape[0]/scale), int(im_raw.shape[1]/scale)))

    im_raw = im_raw.reshape((1, 1, im_raw.shape[0], im_raw.shape[1]))
    im_blur = im_blur.reshape((1, 1, im_blur.shape[0], im_blur.shape[1]))
    im_small = im_small.reshape((1, 1, im_small.shape[0], im_small.shape[1]))

    im_comp = im_blur
    im_input = im_small

    #Set mode to run on CPU
    caffe.set_mode_cpu()

    #Copy input image data to net structure
    c1,c2,h,w = im_input.shape
    net.blobs['data'].data[...] = im_input

    #Run forward pass
    out = net.forward()

    #Extract output image from net, change format to int8 and reshape
    mat = out['conv3'][0]
    mat = (mat[0,:,:]).astype('uint8')

    im_raw = im_raw.reshape((im_raw.shape[2], im_raw.shape[3]))
    im_blur = im_blur.reshape((im_blur.shape[2], im_blur.shape[3]))
    im_comp = im_blur.reshape((im_comp.shape[2], im_comp.shape[3]))

    #Display original (ground truth), blurred and restored images
    cv2.imshow("image",im_raw)
    cv2.imshow("image2",im_comp)
    cv2.imshow("image3",mat)
    cv2.waitKey()

    cv2.destroyAllWindows()

Running the above script on the test image displays the output shown in Figure 2. Readers are encouraged to try this network and refine the parameters to obtain better super-resolution results.

 Grayscale samples comparison of butterfly wing after FSRCNN
Figure 2: Testing the trained FSRCNN. The left image is the ground truth. The image in the center is the ground truth after being blurred and downsampled. The image on the right is the super-resolved image using a model snapshot after 300000 iterations.

Summary

In this short tutorial, we have shown how to train and test a CNN for super-resolution. The CNN we described is the Fast Super-Resolution Convolutional Neural Network (FSRCNN) [2], which is described in more detailed in in an associated article (“An Example of a Convolutional Neural Network for Image Super-Resolution”). This particular CNN was chosen for this tutorial because of its relative simplicity, good performance, and the importance of the authors’ work in the area of CNNs for super-resolution. Several new CNN architectures for super-resolution have been described in the literature recently, and several of them compare their performance to the FSRCNN or its predecessor, created by the same authors: the SRCNN [1].

The training and testing in this tutorial was performed using Intel® Xeon® processors and Intel® Xeon Phi™ processors, using the Intel Distribution for Caffe deep learning framework and Intel Distribution for Python, which are optimized to run on Intel Xeon processors and Intel Xeon Phi processors.

Deep learning-based image/video super-resolution is an exciting development in the field of computer vision. Readers are encouraged to experiment with this network, as well as newer architectures, and test with their own images and videos. To start using Intel’s optimized tools for machine learning and deep learning, visit Intel® Developer Zone (Intel® DZ).

Bibliography

[1] C. Dong, C. C. Loy, K. He and X. Tang, "Learning a Deep Convolutional Network for Image Super-Resolution," 2014.

[2] C. Dong, C. C. Loy and X. Tang, "Accelerating the Super-Resolution Convolutional Neural Network," 2016.

Setting Up Intel® Ethernet Flow Director

$
0
0

Introduction

Intel® Ethernet Flow Director (Intel® Ethernet FD) directs Ethernet packets to the core where the packet consuming process, application, container, or microservice is running. It is a step beyond receive side scaling (RSS) in which packets are sent to different cores for interrupt processing, and then subsequently forwarded to cores on which the consuming process is running.

Intel Ethernet FD supports advanced filters that direct received packets to different queues, and enables tight control on flow in the platform. It matches flows and CPU cores where the processing application is running for flow affinity, and supports multiple parameters for flexible flow classification and load balancing. When operating in Application Targeting Routing (ATR) mode, Intel Ethernet FD is essentially the hardware offloaded version of Receive Flow Steering available on Linux* systems, and when running in this mode, Receive Packet Steering and Receive Flow Steering are disabled.

It provides the most benefit on Linux bare-metal usages (that is, not using virtual machines (VMs)) where packets are small and traffic is heavy. And because the packet processing is offloaded to the network interface card (NIC), Intel Ethernet FD could be used to avert denial-of-service attacks.

Supported Devices

Intel Ethernet FD is supported on devices that use the ixgbe driver, including the following:

  • Intel® Ethernet Converged Network Adapter X520
  • Intel® Ethernet Converged Network Adapter X540
  • Intel® Ethernet Controller 10 Gigabit 82599 family

It is also supported on devices that use the i40e driver:

  • Intel® Ethernet Controller X710 family
  • Intel® Ethernet Controller XL710 family

DPDK includes support for Intel Ethernet FD on the devices listed above. See the DPDK documentation for how to use DPDK and testpmd with Intel Ethernet FD.

In order to determine whether your device supports Intel Ethernet FD, use the ethtool command with the --show-features or -k parameter on the network interface you want to use:

# ethtool --show-features <interface name> | grep ntuple

Screenshot of using ethool command to detect Intel Flow Director support.

If the ntuple-filters feature is followed by off or on, Intel Ethernet FD is supported on your Ethernet adapter. However, if the ntuple-filters feature is followed by off [fixed], Intel Ethernet FD is not supported on your network interface.

Enabling Intel® Ethernet Flow Director

Driver Parameters for Devices Supported by the ixgbe Driver

On devices that are supported by the ixgbe driver, there are two parameters that can be passed-in when the driver is loaded into the kernel that will affect Intel Ethernet FD:

  • FdirPballoc
  • AtrSampleRate 

FdirPballoc

This driver parameter specifies the packet buffer size allocated to Intel Ethernet FD. The valid range is 1–3, where 1 specifies that 64k should be allocated for the packet buffer, 2 specifies a 128k packet buffer, and 3 specifies a 256k packet buffer. If this parameter is not explicitly passed to the driver when it is loaded into the kernel, the default value is 1 for a 64k packet buffer.

AtrSampleRate

The AtrSampleRate parameter indicates how many Tx packets will be skipped before a sample is taken. The valid range is from 0 to 255. If the parameter is not passed to the driver when it is loaded into the kernel, the default value is 20, meaning that every 20th packet will be sampled to determine if a new flow should be created. Passing a value of 0 will disable ATR mode, and no samples will be taken from the Tx queues.

The above driver parameters are not supported on devices that use the i40e driver.

To enable these parameters, first unload the ixgbe module from the kernel. Note, if you are connecting to the system over ssh, this may disconnect your session:

# rmmod ixgbe

Then re-load the ixgbe driver into the kernel with the desired parameters listed above:

# modprobe ixgbe FdirPballoc=3,2,2,3 AtrSampleRate=31,63,127,255

Note that, in this example, for each parameter there are four values. This is because on my test system, I have two network adapters that are using the ixgbe driver--an Intel Ethernet Controller 10 Gigabit 82599, and an Intel® Ethernet Controller 10 Gigabit X540--each of which has two ports. The order in which the parameters are applied is in PCI Bus/Device/Function order. To determine the PCI BDF order on your system, use the following command:

# lshw -c network -businfo

Screenshot of lshw command showing PCI Bus, Device Function information for NICs

Based on this system configuration, using the modprobe command above, the Intel Ethernet Controller 10 Gigabit X540-AT2 port at PCI address 00:03.0 is allocated the FdirPballoc and AtrSampleRate parameters of 3 and 31, respectively, and the Intel Ethernet Controller 10 Gigabit 82599 port at PCI address 81:00.1 is allocated the FdirPballoc and AtrSampleRate parameters of 3 and 255, respectively.

Once you have determined that your Intel branded server network adapter supports Intel Ethernet FD and you have loaded the desired parameters into the driver (on supported models), execute the following command to enable Intel Ethernet FD:

# ethtool --features enp4s0f0 ntuple on

Screenshot of using ethtool command to turn Intel Flow Director on

Because the commands below only indicate which Rx queue a matched packet should be sent to, ideally an additional step should be taken to pin both Rx queues and the process, application, or container that is consuming the network traffic to the same CPU. Pinning an application/process/container to a CPU is beyond the scope of this document, but it can be done using the taskset command. Pinning IRQs to a CPU can be done using the set_irq_affinity script that is included with the freely available sources of the i40e and ixgbe drivers. See Intel Support: Drivers and Software for the latest versions of these drivers. See also the IRQ Affinity section in this tuning guide for how to set IRQ affinity.

Using Intel Ethernet Flow Director

Intel Ethernet FD can run in one of two modes: externally programmed (EP) mode, and ATR mode. Once Intel Ethernet FD is enabled as shown above, ATR mode is the default mode, provided that the driver is in multiple Tx queue mode. When running in EP mode, the user or management/orchestration software can manually set how flows are handled. In either mode, fields are intelligently selected from the packets in the Rx queues to index into the Perfect-Match filter table. For more information on how Intel Ethernet FD works, see this whitepaper.

Application Targeting Routing

In ATR mode, Intel Ethernet FD uses fields from the outgoing packets in the Tx queues to populate the 8K-entry Perfect-Match filter table. The fields that are selected depend on the packet type; for example, fields to filter TCP traffic will be different than those used to filter user diagram protocol (UDP) traffic. Intel Ethernet FD then uses the Perfect-Match filter table to intelligently route incoming traffic to the Rx queues.

To disable ATR mode and switch to EP mode, simply use the ethtool command shown under Adding Filters to manually add a filter, and the driver will automatically enter EP mode. To automatically re-enable ATR mode, use the ethtool command under Removing Filters until the Perfect-Match filter table is empty.

Externally Programmed Mode

When Intel Ethernet FD runs in EP mode, flows are manually entered by an administrator or by management/orchestration software (for example, OpenFlow*). As mentioned above, once enabled, Intel Ethernet FD automatically enters EP mode when a flow is manually entered using the ethtool command listed under Adding Filters.

Adding Filters

The following commands illustrate how to add flows/filters to Intel Ethernet FD using the -U,
-N, or --config-ntuple switch to ethtool.

To specify that all traffic from 10.23.4.6 to 10.23.4.18 be placed in queue 4, issue this command:

# ethtool --config-ntuple flow-type tcp4 src-ip 10.23.4.6 dst-ip 10.23.4.18 action 4

Note: Without the ‘loc’ parameter, the rule is placed at position 1 of the Perfect-Match filter table. If a rule is already in that position, it is overwritten.

Forwards to queue 2 all IPv4 TCP traffic from 192.168.10.1:2000 that is going to 192.168.10.2:2001, placing the filter at position 33 of the Perfect-Match filter table (and overwriting any rule currently in that position):

# ethtool --config-ntuple <interface name> flow-type tcp4 src-ip 192.168.10.1 dst-ip 192.168.10.2 src-port 2000 dst-port 2001 action 2 loc 33

Drops all UDP packets from 10.4.83.2:

# ethtool --config-ntuple flow-type udp4 src-ip 10.4.82.2 action -1

Note: The VLAN field is not a supported filter with the i40e driver (Intel Ethernet Controller XL710 and Intel Ethernet Controller X710 NICs).

For more information and options, see the ethtool man page documentation on the -U, -N, or --config-ntuple option.

Note: The Intel Ethernet Controller XL710 and the Intel Ethernet Controller X710, of the Intel® Ethernet Adapter family, provide extended cloud filter flow support for more complex cloud networks. For more information on this feature, please see the Cloud Filter Support section in this ReadMe document, or in the ReadMe document in the root folder of the i40e driver sources.

Removing Filters

In EP mode, to remove a filter from the Perfect-Match filter table, execute the following command against the appropriate interface. ‘N’ in the rule below is the numeric location in the table that contains the rule you want to delete:

# ethtool --config-ntuple <interface name> delete N

Listing Filters

To list the filters that have been manually entered in EP mode, execute the following command against the desired interface:

# ethtool --show-ntuple <interface name>

Disabling Intel Ethernet Flow Director

Disabling Intel Ethernet FD is done with this command:

# ethtool --features  enp4s0f0 ntuple off

This flushes all entries from the Perfect-Filter flow table.

Conclusion

Intel Ethernet FD directs Ethernet packets to the core where the packet consuming process, application, container, or microservice is running. This functionality is a step beyond RSS, in which packets are simply sent to different cores for interrupt processing, and then subsequently forwarded to cores on which the consuming process is running. It can be explicitly programmed by administrators and control plane management software, or it can intelligently sample outgoing traffic and automatically create Perfect-Match filters for incoming packets. When operating in automatic ATR mode, Intel Ethernet FD is essentially the hardware offloaded version of Receive Flow Steering available on Linux systems.

Intel Ethernet FD can provide additional performance benefit, particularly in workloads where packets are small and traffic is heavy (for example, in Telco environments). And because it can be used to filter and drop packets at the network interface card (NIC), it could be used to avert denial-of-service attacks.

 

Resources

https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/82599-10-gbe-controller-datasheet.pdf

https://downloadmirror.intel.com/26556/eng/README.txt

https://downloadmirror.intel.com/26713/eng/Readme.txt

https://downloadmirror.intel.com/22919/eng/README.txt

http://dpdk.org/doc/guides/howto/flow_bifurcation.html

http://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/xl710-sr-iov-config-guide-gbe-linux-brief.pdf

http://software.intel.com/en-us/videos/creating-virtual-functions-using-sr-iov

Also, view the ReadMe file found in the root directory of both the i40e and ixgbe driver sources.

Notices

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

This sample source code is released under the Intel Sample Source Code License Agreement.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

Microsoft, Windows, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries.

*Other names and brands may be claimed as the property of others.

© 2017 Intel Corporation

Cannot Connect to Intel® Flexlm* License Server Due to Firewall

$
0
0

Problem

License check-out received error on Client system:

 INTEL: Cannot connect to license server system. (-15,570:115 "Operation now in progress")

User could telnet to server by server port: 28518. Server port: 28518 was opened in firewall.


Root Cause

When Intel(R) Flexlm license server starts, there are two server daemon running:

One is FlexNet Publisher* license server daemon, which used default 28518 port or the one set in SERVER line of the license file.

The other one is Intel(R) Software License Manager Vendor Daemon, which used TCP/IP port number specified on the VENDOR line of the license file. Normally this number is omitted. You may find the actual number from the server log. Depending on your operating system, the server log files are located on Windows*: <install drive>:\program files\common files\intel\flexlm\iflexlmlog.txt, or Linux* or OS*: <install location of servers>/lmgrd.log. You may find lines like followings from the log file:

... (INTEL) (@INTEL-SLOG@) === Network Info ===
... (INTEL) (@INTEL-SLOG@) Socket interface: IPV4
... (INTEL) (@INTEL-SLOG@) Listening port: 49163
... (INTEL) (@INTEL-SLOG@) Daemon select timeout (in seconds):

In this example, listening port 49163 is the default TCP/IP port for Intel VENDOR Daemon.

Because the later listening port: 49163 was not opened in Firewall, we cannot connect to the Vendor daemon and received error during license check-out on Client machines.


Solution

To connect to license server with Firewall enabled, you must add exceptions to open both listening ports of FlexNet Publisher* license server daemon and Intel(R) Software License Manager Vendor Daemon. In this case, opening port 49163 besides 28518 resolved the problem.

Intel’s FLEXlm specifies two ports:

1. SERVER host_name host_id port1 -- This one is specified in the product license for 28518)

2. VENDOR INTEL port=port2 -- Usually we specify port1 to 28518 and port2 is omitted (then system will choose one randomly).

However, you may specify port2 to a fixed value and open that port too on the firewall.

How Embree Delivers Uncompromising Photorealism

$
0
0

Introduction

Rendering is the process of generating final output from a collection of data that defines the geometry of objects, the materials they are made from, and the light sources in a 3D scene.

Rendering is a computationally demanding task, involving calculations for many millions of rays as they travel through the scene and interact with the materials on every surface. With full global illumination (GI), light bounces from surface to surface, changing intensity and color as it goes, and it may also be reflected and refracted, and even absorbed or scattered by the volume of an object rather than just interacting with its surface.

3D rendering of an expensive car

Rendering is used in industrial applications as diverse as architectural and product visualization, animations, visual effects in movies, automotive rendering, and more.

For all these applications, the content creator may vary from being a lone freelancer to an employee working for a multi-million dollar companies with hundreds of employees. The hardware used can be anything from a single machine to many hundreds of machines, any of which may be brand new or a decade old.

No matter the size of the company or what hardware they use, 3D artists are expected to create photoreal output within tight and demanding deadlines.

Corona Renderer*, developed by Render Legion a.s., is a leading rendering solution that meets all these diverse needs.

Logo for Corona

The Challenges

3D rendering of an opulent room

End users require absolute realism in the final output, but they also need unrivaled speed and stability due to the strict deadlines involved in their work.

Handling the complex calculations needed to generate the final output is only half the battle. A modern render engine must also increase productivity through ease-of-use. In other words, the software must allow users to work as well as render faster. As a result, end users need real-time previews to assess changes in lighting and materials, adjustments in the point of view, or changes to the objects themselves.

Each industry also has its own specific needs, and every user has a set of preferred tools. A render engine must be compatible with as wide a range of tools and plug-ins as possible.

Development costs for the render engine have to be managed, so that the final product is affordable for both single users and large firms.

The render engine must also work on as wide a variety of hardware as possible and scale across setups, from single laptops to render farms consisting of hundreds of multiprocessor machines.

The Solution

3d rendering of a large modern airplane on a sunny runway

Corona Renderer uses Embree ray tracing kernels to carry out the intensive computations necessary in rendering. This ensures that end users get the best speed and performance and the highest level of realism in their final output.

Using Embree ray tracing kernels gives another benefit: the development team is freed from having to optimize these calculations themselves, which means they can use their time and talents to meet other demands such as:

  • Creating a simple and intuitive user interface
  • Ensuring compatibility with a wide range of tools and plug-ins
  • Meeting the specialized needs for each industry
  • Creating code that allows seamless scaling on setups of a single machine through to hundreds of machines

The developers of Corona Renderer use Intel processor-based machines for coding and testing, which helps ensure that the development and testing environments are similar to the hardware used by most end users (90 percent of them work and render on Intel processor-based machines), and gives the developers the most stable, reliable, and best-performing environment for their work.

Close collaboration with the Embree team at Intel means that the Corona Renderer developers get the best results from Intel technology, a benefit that is passed directly on to Corona Renderer end users.

Results – With Embree versus without Embree

Application: Path tracer renderer using Embree ray tracing kernels

Description: Path tracer renderer integrated into various 3D modeling software packages

Highlights: Corona Renderer is a production-quality path tracer renderer, used in industrial applications such as architectural and product visualization, animation for film and television, automotive rendering, and more. It uses Embree ray tracing kernels to accelerate geometry processing and rendering.

End-user benefits: Speed, reliability, and ease-of-use in creating production-quality images and animations. Interactive rendering provides the same capabilities for real-time viewing while working with a scene as featured in GPU-based renderers but with none of the drawbacks and limitations of a GPU-based solution.

Comparisons: The same scene was rendered with and without Embree, with results averaged from five machines with different processors (from fastest rendering times on a dual Intel® Xeon® processor E5-2690 @ 2.9 GHz to slowest rending times on an Intel® Core™ i7-3930K processor @ 4.2GHz).

graphics comparison

graphics comparison

Results – Comparing Intel® Xeon® Processor Generations

3d rendering of a tank

Application: Path tracer renderer benchmarking application, using Embree ray tracing kernels

Description: Path tracing app for use in benchmarking hardware performance

Highlights: Corona Renderer is a production-quality path tracer renderer, used in industrial applications such as architectural and product visualization, animation for film and television, automotive rendering, and more. It uses Embree ray tracing kernels to accelerate geometry processing and rendering.

End-user benefits: Ability to compare performance of different computer hardware configurations, in particular different brands and generations of CPUs.

The application allows end users to test their own system, while an online database of results allows the user to look up the performance of other configurations.

Comparisons:

graphics comparison
Each generation of Intel® Xeon® processors offers significant improvement in rendering performance, with the Intel® Xeon® processor E5 v4 family processing roughly twice as many rays per second as the Intel® Xeon® processor E5 v2 family.

graphics comparison
Each generation of Intel® Xeon® processors offers significant improvement in rendering time, with the Intel® Xeon® processor E5 v4 family processing being roughly twice as fast as the Intel® Xeon® processor E5 v2 family.

The GPU Question - Speed

There is an ongoing debate about whether GPUs or CPUs provide the best solution for rendering.

The many thousands of cores that a GPU-based rendering solution offers may sound like an advantage, but at best this is only true with relatively simple scenes. As scene complexity increases, the more sophisticated architecture of CPUs takes the lead, providing the best rendering performance.

Corona Renderer uses full GI for true realism—something that is often disabled in GPU renderers for previews and even for final renders. This lack of full GI is behind some of the claims of the speed of GPU renderers. While you can disable this true calculation of light bounces throughout a scene with CPU-based solutions, you don’t really need to, since CPUs don’t struggle with these calculations in the same way GPU solutions do.

3d rendering of a bedroom with textures

GPUs gain their benefits when each of their thousands of cores is performing a similar type of calculation and when “what needs to be calculated next” is well known. However, when handling the millions of rays bouncing through a 3D scene, each core may have to do a very different calculation, and there will be many logic branches that will need to be accounted for. This is where the sophisticated architecture of a CPU pulls ahead, thanks to the more flexible and adaptive scheduling of processes across its cores.

GPU Speed Comparison

Test Setup

An interior scene was created, illuminated only by environment lighting entering the scene through small windows. This is a particularly challenging situation, as most of the lighting in the scene is indirect, coming from light bouncing throughout the scene.

To standardize across the different renderers, only their default materials were used, and the environment lighting was a single color. The render engines were left at their default settings as much as possible, although where relevant the GPU render engines were changed from defaults to use full GI.

The Corona Renderer was set to run for 2 minutes and 30 seconds, which included scene parsing and render time. Since the cost of the single NVIDIA GTX* 1080 card in the test setup is roughly half the cost of the Intel® Core™ i7-6900K processor, the GPU engines were set to run for 5 minutes, approximating the effect of having two GTX 1080 cards to give a comparable measure of performance-per-cost.

Hardware

The same PC was used for each test:
Corona Renderer: Intel Core i7-6900K processor, 3.2 GHz base, 3.7 GHz turbo
GPU engines: NVIDIA GTX 1080, 1.6 GHz base, 1.7 GHz turbo

Results

3d rendering of a bedroom without textures

3d rendering of a bedroom without textures

3d rendering of a bedroom without textures

Despite running for twice as long, both GPU engines showed significant noise in the results and did not approach the results shown by the CPU-based Corona Renderer, which had very little noise remaining in half the time.

By using Corona Renderer’s denoising feature, the same 2 minutes and 30 seconds (the last 5 seconds used for denoising rather than rendering) gives an image that is almost completely free of noise.

3d rendering of a bedroom without textures

Speed Conclusion

At best, the speed benefits of a GPU-based solution only apply to simpler scenes. As the path that the lighting follows increases in complexity, the more sophisticated CPU architecture takes a clear lead.

Many of the claims of GPU rendering speed are based on artificial simplifications of a scene, such as disabling full light bouncing, clamping highlights, and so on. CPU-based solutions can also implement these simplifications, but performance is so good that they are not required.

Other CPU versus GPU Considerations

Stability

CPU-based solutions lead the way in terms of stability and reliability, factors that can be critical in many industries and are not reliant on the stability of frequently updated graphics card drivers.

Compatibility

Most 3D software has a wide range of plug-ins or shaders available that expand on the inbuilt functionality or materials. CPU rendering solutions offer the widest compatibility with these plug-ins, some of which are integral to the workflow of certain industries.

Also, many companies, and even freelancers, turn to commercial render farms to deliver content within the tight deadlines set by their clients. Render farms use many hundreds of machines to accelerate rendering. While there are many long-established render farms supporting CPU-based solutions like Corona Renderer, far fewer farms exist that support GPU-based renderers.

3D rendering of an elegant courtyard

Interactive Rendering

The ability to see changes in your scene without having to start and stop a render has become critical to the workflow of many artists. This kind of real-time rendering is not unique to GPU-based solutions, however since its release, Corona Renderer has included Interactive Rendering that provides this exact functionality, allowing a user to move an object, change the lighting, alter a material, move a camera, and so on, and see the results of that change immediately.

The result shown in the Interactive Renderer is identical to the final render, including full GI and any post-processing, and can be displayed in a separate window (the Corona VFB) or even in the viewport of the host 3D application as shown below:

3d rendering of a terrace and it&#039;s 3d environment

Hardware - Networking

Even for freelancers, it is common practice to have a network of machines to use as a local render farm.

Building a multiple-GPU solution can take special hardware and knowledge, and many of the claims of the high performance of GPU-based solutions come from users who have specialized setups that support four or more graphics cards.

With a CPU-based solution, anyone can create a similar network without needing any specialized knowledge or hardware. Any computer from the last decade can be added to the rendering network, thanks to Corona Renderer’s inbuilt Distributed Rendering. If the machines are on the same network, you can use auto-discovery to add them without any manual setup at all, while those on a different network can be added by simply adding their IP addresses to a list. In both cases, those machines can then be used to assist in accelerating rendering beyond what a single machine can do.

This ability is also reflected in the availability of render farms and cloud-based rendering services, which allow users to submit their renders to many hundreds of machines for even faster processing. For CPU-based render engines, users can choose from many farms, while only a handful of farms offer similar services for GPU-based render engines.

This means that CPU-based renderers make it easy to create a farm of rendering machines, even for freelancers or hobby-level users, and of course each machine can still be used as a computer in its own right, while networked GPUs are only useful during rendering.

Hardware - Upgrading

With upgrading a CPU, the benefits are realized across all applications. Upgrading a GPU on the other hand only offers benefits for a few select applications and uses - money invested in GPU hardware almost exclusively benefits only your render times, while money invested in an upgraded CPU will benefit every aspect of your workflow.

Hardware - Memory

The maximum RAM directly available on a graphics card is limited by the current technology, with the most expensive cards at the time of writing (in the region of USD 2,500) offering a maximum of 32 GB of memory. CPU-based solutions on the other hand can easily and affordably have two, three or four times as much directly accessible RAM as the “latest and greatest” graphics cards at a fraction of the cost.

While GPU render engines may be able to access out-of-core memory (that is, memory not on the graphics card itself), this capability often results in a reduction of rendering performance. With a CPU-based solution like Corona Renderer, there is much less need to worry about optimizing a scene to reduce polygon counts, textures, and so on.

Also, upgrading a GPU is all-or-nothing: if more memory on the graphics card itself is required, to avoid using out-of-core memory for example, you must replace the entire card. With a CPU-based solution, adding more memory is simple and again benefits every application, not just rendering.

3D redenring of inner gears of a timepiece

GPU versus CPU Summary

Corona Renderer remains CPU-based, ensuring the widest compatibility, greatest stability, easiest hardware setup, and best performance as scenes scale in complexity. Ninety percent of Corona Renderer end users choose Intel processor-based solutions.

Thanks to the power of modern CPUs and the Embree ray tracing kernels, Corona Renderer can offer all the benefits claimed by GPU-based render engines, but with none of the drawbacks.

Conclusion

Corona Renderer continues to grow in market share thanks to its combination of rendering speed and simple user interface that allows users to take an artistic rather than technical approach to their work without any loss of speed, power, or performance.

By ensuring the best possible rendering performance, using the Embree ray tracing kernels lets you look beyond optimizing the rendering process and focus instead on ease of use and continuing to innovate, introducing unique features such as LightMix, Interactive Rendering, the standalone Corona Image Editor, inbuilt scattering tools, and more.

Despite recent developments in the field of GPUs, CPU-based solutions remain the best, and sometimes only, solution for a great many companies, freelancers, and individuals.

3d rendering of a elegant car in a wet rainy environment

Learn More

Corona Renderer home page:

https://corona-renderer.com/

Embree ray tracing kernels:

https://embree.github.io/

About Corona Renderer and Render Legion a.s.

Corona is a CPU-based rendering engine initially developed for 3ds Max*, and also available as a standalone command-line interface application. It is currently being ported for Cinema 4D*.

It began as a solo student project by Ondřej Karlík at the Czech Technical University in Prague, evolving into a full-time commercial project in 2014 after Ondřej established a company along with former CG artist Adam Hotový and Jaroslav Křivánek, an associate professor and researcher at Charles University in Prague.

It was first released commercially in February 2015, and since then Render Legion has grown to more than 15 members.

Corona Renderer’s focus has been mainly on polishing the ArchViz feature set, and future plans are for further specialization to meet the needs of the automotive, product design, and VFX industries.

Pixeldash Studios Diversifies to Succeed.

$
0
0

The original article is published by Intel Game Dev on VentureBeat*: Pixeldash Studios Diversifies to Succeed. Get more game dev news and related topics from Intel on VentureBeat.

Screenshot of fast motorcycle running into a car crash, in a winter snowy pass

The state of Louisiana—the Pelican State, apparently—was not really considered a hotbed of game development when Jason Tate and Evan Smith were working for the only studio in Baton Rouge. When that company folded, it became a crossroads moment: move to the Bay Area or any other gaming hotbed, or stay where they were and make the best of the situation in their home state.

“We opted to start our own company so we could make games, and stay in Louisiana,” says Tate, the co-founder and lead programmer, of the decision he made with Smith, who would act as Creative Director on the two-man team.

This decision in 2011 was assisted by the state government’s efforts to build the games industry, which had even seen representatives visit E3 to alert indie companies to the tax programs and other opportunities being offered to attract teams. “We’re part of an incubator called the Louisiana Technology Park where we now have a very cool facility, it’s very affordable, and six years later there are seven indie companies under one roof,” says Tate.

What makes the Pixel Dash story a little different is that despite just being two people, they worked on client projects to supplement funds raised for their game, Road Redemption, through Kickstarter, the Humble Store, and eventually Steam* Early Access. “In the beginning, it was basically two people doing the job of 10 people,” says Smith, “but it was about building something in the community that would have some longevity, so we didn’t do just one project and bet everything on that.”

Having seen and been a part of a team that had bet on one horse and ultimately fizzled, this team was determined to ensure it had legs. “It’s been an interesting journey over five years, bouncing around projects like corporate client work, e-learning, and taking our gaming skills over to training simulations,” says Smith.

Yes, Pixel Dash has crafted apps and other software for diverse topics like student preparedness at Louisiana State University (creating an app that gamified job searches, including elements where you follow a create-your-own-adventure path and could learn at the end where you did well or went wrong. The tool would even advise on outfits so that students would be prepared for the ‘business casual’ world.)

While for a small team this has meant that work on its prime gaming project may have taken some time, it has kept the lights on and the process running. It hasn’t been a one-way street of gamifying corporate apps or programs, however.

“One of the benefits of working with clients is the game design expertise can travel over to the corporate world, but you can learn things from that side, too, that you can then channel back to your game side,” says Tate.

The purpose of all this is to make sure that Road Redemption sees a full release. For an indie game to have started development back in 2011 and hitting Kickstarter* in 2013, that’s a long cycle. But due to this diversified client list, the game will see full release, and that’s the main goal of every developer. “The combination of Kickstarter and Early Access has been the reason we could fund a project of this size. All the funds flow straight back into development,” says Smith.

Screenshot of fast speed motorcycle rooftop chase with guns
Above: Bikes, speed, guns, and rooftop courses give Road Redemption real visual style.

Spirit of Road Rash

It may not surprise you to learn that a game titled Road Redemption is designed in the spirit of the classic road beat-em-up, Road Rash. Inspired by games like Twisted Metal and even Skitchin’ that was released on the Sega Genesis, it would seem from community feedback and continued engagement with the game through the fundraising campaigns, that there is a passionate audience for this kind of road combat game.

“We wanted to add new stuff like projectile weapons and guns, and we were surprised how polarizing that was,” says Smith. “Some of the purists said ‘there are no guns in Road Rash, there should be none here’.”

“There’s a camp of people very vocal about staying true to the original and another camp excited about something different,” adds Tate.

Community engagement and interacting with YouTube* influencers and streamers has also kept Road Redemption in the minds of gamers. “It fits well for YouTubers because a lot of ridiculous things happen, so we’ve had PewDiePie cover us twice, and others. We’ve certainly been trying to build relationships…and have over 50 million combined YouTube video views,” says Smith.

It’s a vital marketing outlet for a team without the manpower or budget to follow more traditional advertising methods. Instead, they rely on the forums where gamers discuss their tactics and share reviews as simple as “that was awesome…Santa Claus smacking someone in the face with a shovel.” It also allowed them to balance certain aspects using community feedback and consider requested modes.

Of course, a couple of handy tactics involved appealing to the egos of some prominent influencers. “One thing we did early on was put a lot of the YouTube personalities names in the game, so they were encouraged to go find themselves in the game,” says Smith. Crafty!

Screenshot of high speed motorcycle chase, at night. Biker with a pumpkin on his head attacks other with a baseball bat
Above: Beating rivals with a baseball bat from the back of a bike…don’t try this at home, kids.

“As internet comments go there’s always a lot of hate, so when you see that one person who seems enamored, it really pushes you forward to do well,” says Smith. Following this format also allowed them to balance certain aspects using community feedback and consider requested modes. “One mode was to have to beat the game with just one health. We think that’s going to take ages, and then within a few days on the forums, people are talking about having beaten it,” says Tate. Yeah, they do that, but with the time to make tweaks and follow the myriad comments, the result should be a game that has the longevity the company hopes to achieve.

It’s quite an achievement that may have fallen at the first hurdle if not for the combination of support from the state, the ease of entry through digital distribution, and emerging funding opportunities.

That’s something important for this team in the state of Louisiana. Pixel Dash has taken on interns in paid and course credit programs from LSU, and many of them have turned into full-time employees.

It’s a far cry from the opportunities Smith saw several years ago. “Growing up in Louisiana, it’s not something we thought would be here for us. Personally, it’s like ‘wow, this is a thing now’. We can develop games here…we have dev kits and are working for console, and it would never have happened years ago.”

It’s happening now.

How Yahoo! JAPAN Used Open vSwitch* with DPDK to Accelerate L7 Performance in Large-Scale Deployment Case Study

$
0
0

View PDF [783 KB]

As cloud architects and developers know, it can be incredibly challenging to keep up with the rapidly increasing cloud infrastructure demands of both users and services. Many cloud providers are looking for proven and effective ways to improve network performance. This case study discusses one such collaborative project undertaken between Yahoo! JAPAN and Intel in which Yahoo! JAPAN implemented Open vSwitch* (OvS) with Data Plane Development Kit (OvS with DPDK) to deliver up to 2x practical cloud application L7 performance improvement while successfully completing a more than 500-node, large-scale deployment.

Introduction to Yahoo! JAPAN

Yahoo! JAPAN is a Japanese Internet company that was originally formed as a joint venture between Yahoo! Inc. and Softbank. The Yahoo! JAPAN portal is one of the most frequently visited websites in Japan, and its many services have been running on OpenStack* Private Cloud since 2012. Yahoo! JAPAN receives over 69 billion monthly page views, of which more than 39 billion come from smartphones alone. Yahoo! JAPAN also has over 380 million total app downloads, and it currently runs more than 100 services.

Network Performance Challenges

As a result of rapid cloud expansion, Yahoo! JAPAN began observing some network bottlenecks in its environment beginning in 2015. At that time, both cloud resources and users were doubling year by year, causing a rapid increase in virtual machine (VM) density. Yahoo! JAPAN was also noticing huge spikes in network traffic and burst traffic when breaking news, weather updates, or public service announcements, related to an earthquake for example, would happen. This dynamic put an additional burden on the network environment.

As these network performance challenges arose, Yahoo! JAPAN began experiencing some difficulties meeting service-level agreements (SLAs) for its many services. Engineers from the network infrastructure team at Yahoo! JAPAN noticed that noisy VMs (also known as “noisy neighbors”) were disrupting the network environment.

When that phenomenon occurs, a rogue VM may monopolize bandwidth, disk I/O, CPU, and other resources, which then impacts other VMs and applications in the environment.

Yahoo! JAPAN also noticed that the compute nodes were processing a large volume of short packets and that the network was handling a very heavy load (see Figure 1). Consequently, decreased network performance was affecting the SLAs.

Figure 1. A compute node showing a potential network bottleneck in a virtual switch.

Yahoo! JAPAN determined that its cloud infrastructure required a higher level of network performance in order to meet its application and SLAs. In the course of its research Yahoo! JAPAN had noticed that the Linux* Bridge overrun counter was increasing, which meant that the cause of its network difficulties was located in the network kernel. As a result, the company decided it needed to find a new solution to meet its needs going forward.

About OvS with DPDK

OvS with DPDK could be a potential solution to such network performance issues in cloud environments that are already using OpenStack Cloud, since it features OvS as a virtual switch. Native OvS uses kernel space for packet forwarding, which imposes a performance overhead and can limit network performance. DPDK, however, accelerates packet forwarding by bypassing the kernel.

DPDK integration with OvS offers other beneficial performance enhancements as well. For example, DPDK’s Poll Mode Driver eliminates context switch overhead. DPDK also uses direct user memory access to and from the NIC to eliminate kernel-user memory copy overhead. Both optimizations can greatly boost network performance. Overall, DPDK maintains compatibility with OvS while accelerating packet forwarding performance. Refer to Intel Developer Zone’s article, Open vSwitch with DPDK Overview, for more information.

Collaboration between Intel and Yahoo! JAPAN

As Yahoo! JAPAN was encountering network performance issues, Intel suggested that the company consider OvS with DPDK since it was now possible to use the two technologies in combination with one another. Yahoo! JAPAN was already aware that DPDK offered network performance benefits for a variety of telecommunications use cases but, being a web-based company, the company thought that it would not be able to take advantage of that particular solution. After discussing the project with Intel and learning about ways in which the technologies could work for a cloud service provider, Yahoo! JAPAN decided to try OvS with DPDK in their OpenStack environment.

For optimal performance deployment in OvS with DPDK, Yahoo! JAPAN enabled 1 GB hugepages. This step was important from a performance perspective, because it enabled Yahoo! JAPAN to reduce Translation Lookaside Buffer (TLB) misses and prevent page faults. The company also paid special attention to its CPU affinity design, carefully identifying ideal resource settings for each function. Without that step, Yahoo! JAPAN would not have been able to ensure stable network performance.

OpenStack’s Mitaka release offered the features required for Yahoo! JAPAN’s OvS with DPDK implementation, so the company decided to build a Mitaka cluster running with the configurations mentioned above. The first cluster includes over 150 nodes and uses Open Compute Project (OCP) servers.

Benchmark Test Results

Yahoo! JAPAN achieved impressive performance results after implementing OvS with DPDK in its cloud environment. To demonstrate these gains, the engineers measured two benchmarks: the network layer (L2) and the application layer (L7).

Table 1. Benchmark test configuration.

Hardware

Software

CPU

Intel® Xeon™ processor E5-2683 v3 2S

Host OS

CentOS* 7.2

Memory

512 GB DDR4-2400 RDIMM

Guest OS

CentOS 7.2

NIC

Intel® Ethernet Converged Network Adapter X520-DA2

OpenStack*

Mitaka

 

 

QEMU*

2.6.2

 

 

Open vSwitch

2.5.90 + TSO patch (a6be657)

 

 

Data Plane Development Kit

16.04

Figure 2. L2 network benchmark test.

L2 Network Benchmark Test Results

In the L2 benchmark test, Yahoo! JAPAN used Ixia IxNetwork* as a packet generator. Upon measuring L2 performance (see Figure 2), Yahoo! JAPAN observed 10x network throughput performance improvement in its short packet traffic. The company also found that OvS with DPDK reduced latency up to ~1/20x (1/20th). With these results, Yahoo! JAPAN successfully confirmed that OvS with DPDK accelerates the L2 path to the VM. These results were about in line with what Yahoo! JAPAN expected to find, as telecommunications companies had achieved similar results in their benchmark tests.

L7 Network Benchmark Test Results

The L7 single VM benchmark results for the application layer, however, exceeded Yahoo! JAPAN’s expectations. In this test, Yahoo! JAPAN instructed one VM to send a query and another VM to return a response. All applications (HTTP, MQ, DNS, RDB) demonstrated significant performance gains in this scenario (see Figure 3). Particularly in the MySQL* sysbench result, Yahoo! JAPAN saw simultaneous improvement in two important metrics: 1.5x better throughput (transaction/sec) and 1/1.5x less latency (response time).

Figure 3. Various application benchmark test results.

Application Benchmark Test Results

Why did network performance improve so dramatically? In the case of HTTP, for example, Yahoo! JAPAN saw a 2.0x improvement in OvS with DPDK when compared to Linux Bridge. Yahoo! JAPAN determined that this performance metric improved because OvS with DPDK reduces the number of context switches by 45 percent when compared with Linux Bridge.

The benchmark results for RabbitMQ* revealed another promising discovery. When Yahoo! JAPAN ran their first stress test on RabbitMQ under Linux Bridge, it observed degraded performance. When it ran the same stress test under OvS with DPDK, the application environment maintained a much more consistent and satisfactory level of performance (see Figure 4).

Figure 4. RabbitMQ stress test results.

RabbitMQ Stress Test Results

How was this possible? In both tests, noisy conditions created a high degree of context switching. In the Linux Bridge world, it’s necessary to pay a 50 percent tax to the kernel. But in the OvS with DPDK world, that tax is only 10 percent. This is because OvS with DPDK suppresses context switching, which prevents network performance from degrading even under challenging real world conditions. Yahoo! JAPAN found that CPU pinning relaxes interference between multiple noisy neighbor VMs and the critical OvS process, which also contributed to the performance improvements observed in this test. Which world would you want to live in: Linux Bridge or OvS with DPDK?

Ultimately, Yahoo! JAPAN found that OvS with DPDK delivers terrific network performance improvements for cloud environments. This finding was key to resolving Yahoo! JAPAN’s network performance issues and meeting the company’s SLA requirements.

Summary

Despite what you might think, deploying OvS with DPDK is actually not so difficult. Yahoo! JAPAN is already successfully using this technology in a production system with over 500 nodes. OvS with DPDK offers powerful performance benefits and provides a stable network environment, which enables Yahoo! JAPAN to meet its SLAs and easily support the demands placed on its cloud infrastructure. The impressive results that Yahoo! JAPAN has achieved through its implementation of OvS with DPDK can be enjoyed by other cloud service providers too.

When assessing whether OvS with DPDK will meet your requirements, it is important to carefully investigate what is causing the bottlenecks in your cloud environment. Once you fully understand the problem, you can identify which solution will best fit your specific needs.

To accomplish this task, Yahoo! JAPAN performed a thorough analysis of its network traffic before deciding how to proceed. The company learned that there was a high volume of short packets traveling throughout its network. This discovery indicated that OvS with DPDK might be a good solution for its problem, since OvS with DPDK is known to improve performance in network environments where a high volume of short packets is present. For this reason, Yahoo! JAPAN concluded that it is necessary to not only benchmark your results but also have a full understanding of your network’s characteristics in order to find the right solution.

Now that you’ve learned about the performance improvements that Yahoo! JAPAN achieved by implementing OvS with DPDK, have you considered deploying OvS with DPDK within your own cloud? To learn more about enabling OvS with DPDK on OpenStack, read these articles: Using Open vSwitch and DPDK with Neutron in DevStack, Using OpenSwitch with DPDK, and DPDK vHost User Ports.

Acknowledgment

Thanks to this successful collaboration with Intel, Yusuke Tatsumi, network engineer for Yahoo! JAPAN’s infrastructure team, said: “We found out that the OvS and DPDK combination definitely improves application performance for cloud service providers. It strengthened our cloud architecture and made it more robust.” Yahoo! JAPAN is pleased to have demonstrated that OvS with DPDK is a valuable technology that can achieve impressive network performance results and meet the demanding daily traffic requirements of a leading Japanese Internet company.

About the Author

Rose de Fremery is a New York-based writer and technologist. She is the former Managing Editor of The Social Media Monthly, the world's first print magazine devoted to the social media revolution. Rose currently writes about a range of business IT topics including cloud infrastructure, VoIP, UC, CRM, business innovation, and teleworking.

Notices

Testing conducted on Yahoo! JAPAN. Testing done by Yahoo! JAPAN.

Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

Intel, the Intel logo, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

© 2017 Intel Corporation.

What's New? - Intel® VTune™ Amplifier XE 2017 Update 4

$
0
0

Intel® VTune™ Amplifier XE 2017 performance profiler

A performance profiler for serial and parallel performance analysis. Overviewtrainingsupport.

New for the 2017 Update 4! (Optional update unless you need...)

As compared to 2017 Update 3:

  • General Exploration, Memory Access, HPC Performance Characterization analysis types extended to support Intel® Xeon® Processor Scalable family
  • Support for Microsoft Windows* 10 Creators Update (RS2) 

Resources

  • Learn (“How to” videos, technical articles, documentation, …)
  • Support (forum, knowledgebase articles, how to contact Intel® Premier Support)
  • Release Notes (pre-requisites, software compatibility, installation instructions, and known issues)

Contents

File: vtune_amplifier_xe_2017_update4.tar.gz

Installer for Intel® VTune™ Amplifier XE 2017 for Linux* Update 4

File: VTune_Amplifier_XE_2017_update4_setup.exe

Installer for Intel® VTune™ Amplifier XE 2017 for Windows* Update 4 

File: vtune_amplifier_xe_2017_update4.dmg

Installer for Intel® VTune™ Amplifier XE 2017 - OS X* host only Update 4 

* Other names and brands may be claimed as the property of others.

Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries.


What is OpenCV?

$
0
0

OpenCV is a software toolkit for processing real-time image and video, as well as providing analytics, and machine learning capabilities.

Development Benefits

Using OpenCV, a BSD licensed library, developers can access many advanced computer vision algorithms used for image and video processing in 2D and 3D as part of their programs. The algorithms are otherwise only found in high-end image and video processing software.

Powerful Built-In Video Analytics

Video analytics is much simpler to implement with OpenCV API’s for basic building blocks such as background removal, filters, pattern matching and classification.

Real-time video analytics capabilities include classifying, recognizing, and tracking: objects, animals, people, specific features such as vehicle number plates, animal species, and facial features such as faces, eyes, lips, chin, etc.

Hardware and Software Requirements

OpenCV is written in Optimized C/C++, is cross-platform by design and works on a wide variety of hardware platforms, including Intel Atom® platform, Intel® Core™ processor family, and Intel® Xeon® processor family.

Developers can program OpenCV using C++, C, Python*, and Java* on Operating Systems such as Windows*, many Linux* distros, Mac OS*, iOS* and Android*.

Although some cameras work better due to better drivers, if a camera has a working driver for the Operating System in use, OpenCV will be able to use it.

Hardware Optimizations

OpenCV takes advantage of multi-core processing and OpenCL™. Hence, OpenCV can also take advantage of hardware acceleration if integrated graphics is present.

OpenCV v3.2.0 release can use Intel optimized LAPACK/BLAS included in the Intel® Math Kernel Libraries (Intel® MKL) for acceleration. It can also use Intel® Threading Building Blocks (Intel® TBB) and Intel® Integrated Performance Primitives (Intel® IPP) for optimized performance on Intel platforms.

OpenCV uses the FFMPEG library and can use Intel® Quick Sync Video technology to accelerate encoding and decoding using hardware.

OpenCV and IoT

OpenCV has a wide range of applications in traditional computer vision applications such as optical character recognition or medical imaging.

For example, OpenCV can detect Bone fractures1. OpenCV can also help classify skin lesions and help in the early detection of skin melanomas2.

However, OpenCV coupled with the right processor and camera can become a powerful new class of computer vision enabled IoT sensor. This type of design can scale from simple sensors to multi-camera video analytics arrays. See Designing Scalable IoT Architectures for more information.3

IoT developers can use OpenCV to build embedded computer vision sensors for detecting IoT application events such as motion detection or people detection.

Designers can also use OpenCV to build even more advanced sensor systems such as face recognition, gesture recognition or even sentiment analysis as part of the IoT application flow.

IoT applications can also deploy OpenCV on Fog nodes at the Edge as an analytics platform for a larger number of camera based sensors.

For example, IoT applications use camera sensors with OpenCV for road traffic analysis, Advanced Driver Assistance Systems (ADAS)3, video surveillance4, and advanced digital signage with analytics in visual retail applications5.

OpenCV Integration

When developers integrated OpenCV with a neural-network backend, it unleashed the true power of computer vision. Using this approach, OpenCV works with Convolutional Neural Networks (CNN) and Deep Neural Networks (DNN) to allow developers to build innovative and powerful new vision applications.

To target multiple hardware platforms, these integrations need to be cross platform by design. Hardware optimization of deep learning algorithms breaks this design goal. The OpenVX architecture standard proposes resource and execution abstractions.

Hardware vendors can optimize implementations with a strong focus on specific platforms. This allows developers to write code that is portable across multiple vendors and platforms, as well as multiple hardware types.

Intel® Computer Vision SDK (Beta) is an integrated design framework and a powerful toolkit for developers to solve complex problems in computer vision. It includes Intel’s implementation of the OpenVX API as well as custom extensions. It supports OpenCL custom kernels and can integrate CNN or DNN.

The pre-built and included OpenCV binary has hooks for Intel® VTune™Amplifier for profiling vision applications.

Getting Started:

Try this tutorial on basic people recognition.  Also, see OpenCV 3.2.0 Documentation for more tutorials.

Related Software:

Intel® Computer Vision SDK - Accelerated computer vision solutions based on OpenVX standard, integrating OpenCV and deep learning support using the included Deep Learning (DL) Deployment Toolkit.

Intel® Integrated Performance Primitives (IPP) - Programming toolkit for high-quality, production-ready, low-level building blocks for image processing, signal processing, and data processing (data compression/decompression and cryptography) applications.

Intel® Math Kernel Library (MKL) - Library with accelerated math processing routines to increase application performance.

Intel® Media SDK - A cross-platform API for developing media applications using Intel® Quick Sync Video technology.

Intel® SDK for OpenCL™ Applications - Accelerated and optimized application performance with Intel® Graphics Technology compute offload and high-performance media pipelines.

Intel® Distribution for Python* - Specially optimized Python distribution for High-Performance Computing (HPC) with accelerated compute-intensive Python computational packages like NumPy, SciPy, and scikit-learn.

Intel® Quick Sync Video - Leverage dedicated media processing capabilities of Intel® Graphics Technology to decode and encode fast, enabling the processor to complete other tasks and improving system responsiveness.

Intel® Threading Building Blocks (TBB) - Library for shared-memory parallel programming and intra-node distributed memory programming.

References:

  1. Bone fracture detection using OpenCV
  2. Mole Investigator: Detecting Cancerous Skin Moles Through Computer Vision
  3. Designing Scalable IoT Architectures
  4. Advanced Driver Assistance Systems (ADAS)
  5. Smarter Security Camera: A Proof of Concept (PoC) Using the Intel® IoT Gateway
  6. Introduction to Developing and Optimizing Display Technology

Intel® Xeon® Processor Scalable Family Technical Overview

$
0
0

Executive Summary

Intel uses a tick-tock model associated with its generation of processors. The new generation, the Intel® Xeon® processor Scalable family (formerly code-named Skylake-SP), is a “tock” based on 14nm process technology. Major architecture changes take place on a “tock,” while minor architecture changes and a die shrink occur on a “tick.”

Tick-Tock model
Figure 1. Tick-Tock model.

Intel Xeon processor Scalable family on the Purley platform is a new microarchitecture with many additional features compared to the previous-generation Intel® Xeon® processor E5-2600 v4 product family (formerly Broadwell microarchitecture). These features include increased processor cores, increased memory bandwidth, non-inclusive cache, Intel® Advanced Vector Extensions 512 (Intel® AVX-512), Intel® Memory Protection Extensions (Intel® MPX), Intel® Ultra Path Interconnect (Intel® UPI), and sub-NUMA clusters.

In previous generations two and four socket processor families were segregated into different product lines. One of the big changes with the Intel Xeon processor Scalable family is that it includes all the processor models associated with this new generation. The processors from Intel Xeon processor Scalable family are scalable from a two-socket configuration to an eight-socket configuration. They are Intel’s platform of choice for the most scalable and reliable performance with the greatest variety of features and integrations designed to meet the needs of the widest variety of workloads.

New branding for processor models
Figure 2. New branding for processor models.

A two-socket Intel Xeon processor Scalable family configuration can be found within all the levels of bronze through platinum, while a four-socket configuration will only be found at the gold through platinum levels, and the eight-socket configuration will only be found at the platinum level. The bronze level has the least amount of features and as you move towards platinum more features are added. All available features are available across the entire range of processor socket count (two through eight) at the platinum level.

Introduction

This paper discusses the new features and enhancements available in Intel Xeon processor Scalable family and what developers need to do to take advantage of them.

Intel® Xeon® processor Scalable family Microarchitecture Overview

Block Diagram of the Intel® Xeon® processor scalable family microarchitecture
Figure 3. Block Diagram of the Intel® Xeon® processor Scalable family microarchitecture.

The Intel Xeon processor Scalable family on the Purley platform provides up to 28 cores, which bring additional computing power to the table compared to the 22 cores of its predecessor. Additional improvements include a non-inclusive last-level cache, a larger 1MB L2 cache, faster 2666 MHz DDR4 memory, an increase to six memory channels per CPU, new memory protection features, Intel® Speed Shift Technology, on-die PMAX detection, integrated Fabric via Intel® Omni-Path Architecture (Intel® OPA), Internet Wide Area RDMA Protocol (iWARP)*, memory bandwidth allocation, Intel® Virtual RAID on CPU (Intel® VROC), and more.

Table 1. Generational comparison of the Intel Xeon processor Scalable family to the Intel® Xeon® processor E5-2600 and E7-4600 product families.

Table 1 generational comparison

Intel Xeon processor Scalable family feature overview

The rest of this paper discusses the performance improvements, new capabilities, security enhancements, and virtualization enhancements in the Intel Xeon processor Scalable family.

Table 2. New features and technologies of the Intel Xeon processor Scalable family.

Table 2 New features and technologies

Skylake Mesh Architecture

On the previous generations of Intel® Xeon® processor families (formerly Haswell and Broadwell) on the Grantley platform, the processors, the cores, last-level cache (LLC), memory controller, IO controller and inter-socket Intel® QuickPath Interconnect (Intel® QPI) ports are connected together using a ring architecture, which has been in place for the last several generations of Intel® multi-core CPUs. As the number of cores on the CPU increased with each generation, the access latency increased and available bandwidth per core diminished. This trend was mitigated by dividing the chip into two halves and introducing a second ring to reduce distances and to add additional bandwidth.

Platform ring architecture
Figure 4. Intel® Xeon® processor E5-2600 product family (formerly Broadwell-EP) on Grantley platform ring architecture.

With additional cores per processor and much higher memory and I/O bandwidth in the Intel® Xeon® processor Scalable family, the additional demands on the on-chip interconnect could become a performance limiter with the ring-based architecture. Therefore, the Intel Xeon processor Scalable family introduces a mesh architecture to mitigate the increased latencies and bandwidth constraints associated with previous ring-based architecture. The Intel Xeon processor Scalable family also integrates the caching agent, the home agent, and the IO subsystem on the mesh interconnect in a modular and distributed way to remove bottlenecks in accessing these functions. Each core and LLC slice has a combined Caching and Home Agent (CHA), which provides scalability of resources across the mesh for Intel® Ultra Path Interconnect (Intel® UPI) cache coherency functionality without any hotspots.

The Intel Xeon processor Scalable family mesh architecture encompasses an array of vertical and horizontal communication paths allowing traversal from one core to another through a shortest path (hop on vertical path to correct row, and hop across horizontal path to correct column). The CHA located at each of the LLC slices maps addresses being accessed to specific LLC bank, memory controller, or IO subsystem, and provides the routing information required to reach its destination using the mesh interconnect.

Intel Xeon processor Scalable family mesh architecture
Figure 5. Intel Xeon processor Scalable family mesh architecture.

In addition to the improvements expected in the overall core-to-cache and core-to-memory latency, we also expect to see improvements in latency for IO initiated accesses. In the previous generation of processors, in order to access data in LLC, memory or IO, a core or IO would need to go around the ring and arbitrate through the switch between the rings if the source and targets are not on the same ring. In Intel Xeon processor Scalable family, a core or IO can access the data in LLC, memory, or IO through the shortest path over the mesh.

Intel® Ultra Path Interconnect (Intel® UPI)

The previous generation of Intel® Xeon® processors utilized Intel QPI, which has been replaced on the Intel Xeon processor Scalable family with Intel UPI. Intel UPI is a coherent interconnect for scalable systems containing multiple processors in a single shared address space. Intel Xeon processors that support Intel UPI, provide either two or three Intel UPI links for connecting to other Intel Xeon processors and do so using a high-speed, low-latency path to the other CPU sockets. Intel UPI uses a directory-based home snoop coherency protocol, which provides an operational speed of up to 10.4 GT/s, improves power efficiency through an L0p state low-power state, provides improved data transfer efficiency over the link using a new packetization format, and has improvements at the protocol layer such as no preallocation to remove scalability limits with Intel QPI.

Typical two- socket configuration
Figure 6. Typical two- socket configuration.

Typical four-socket ring configuration
Figure 7. Typical four-socket ring configuration.

Typical four-socket crossbar configuration
Figure 8. Typical four-socket crossbar configuration.

Typical eight-socket configuration
Figure 9. Typical eight-socket configuration.

Intel® Ultra Path Interconnect Caching and Home Agent

Previous implementations of Intel Xeon processors provided a distributed Intel QPI caching agent located with each core and a centralized Intel QPI home agent located with each memory controller. Intel Xeon processor Scalable family processors implement a combined CHA that is distributed and located with each core and LLC bank, and thus provides resources that scale with the number of cores and LLC banks. CHA is responsible for tracking of requests from the core and responding to snoops from local and remote agents as well as resolution of coherency across multiple processors.

Intel UPI removes the requirement on preallocation of resources at the home agent, which allows the home agent to be implemented in a distributed manner. The distributed home agents are still logically a single Intel UPI agent that is address-interleaved across different CHAs, so the number of visible Intel UPI nodes is always one, irrespective of the number of cores, memory controllers used, or the sub-NUMA clustering mode. Each CHA implements a slice of the aggregated CHA functionality responsible for a portion of the address space mapped to that slice.

Sub-NUMA Clustering

A sub-NUMA cluster (SNC) is similar to a cluster-on-die (COD) feature that was introduced with Haswell, though there are some differences between the two. An SNC creates two localization domains within a processor by mapping addresses from one of the local memory controllers in one half of the LLC slices closer to that memory controller and addresses mapped to the other memory controller into the LLC slices in the other half. Through this address-mapping mechanism, processes running on cores on one of the SNC domains using memory from the memory controller in the same SNC domain observe lower LLC and memory latency compared to latency on accesses mapped to locations outside of the same SNC domain.

Unlike a COD mechanism where a cache line could have copies in the LLC of each cluster, SNC has a unique location for every address in the LLC, and it is never duplicated within the LLC banks. Also, localization of addresses within the LLC for each SNC domain applies only to addresses mapped to the memory controllers in the same socket. All addresses mapped to memory on remote sockets are uniformly distributed across all LLC banks independent of the SNC mode. Therefore even in the SNC mode, the entire LLC capacity on the socket is available to each core, and the LLC capacity reported through the CPUID is not affected by the SNC mode.

Figure 10 represents a two-cluster configuration that consists of SNC Domain 0 and 1 in addition to their associated core, LLC, and memory controllers. Each SNC domain contains half of the processors on the socket, half of the LLC banks, and one of the memory controllers with three DDR4 channels. The affinity of cores, LLC, and memory within a domain are expressed using the usual NUMA affinity parameters to the OS, which can take SNC domains into account in scheduling tasks and allocating memory to a process for optimal performance.

SNC requires that memory is not interleaved in a fine-grain manner across memory controllers. In addition, SNC mode has to be enabled by BIOS to expose two SNC domains per socket and set up resource affinity and latency parameters for use with NUMA primitives.

Sub-NUMA cluster domains
Figure 10. Sub-NUMA cluster domains.

Directory-Based Coherency

Unlike the prior generation of Intel Xeon processors that supported four different snoop modes (no-snoop, early snoop, home snoop, and directory), the Intel Xeon processor Scalable family of processors only supports the directory mode. With the change in cache hierarchy to a non-inclusive LLC, the snoop resolution latency can be longer depending on where in the cache hierarchy a cache line is located. Also, with much higher memory bandwidth, the inter-socket Intel UPI bandwidth is a much more precious resource and could become a bottleneck in system performance if unnecessary snoops are sent to remote sockets. As a result, the optimization trade-offs for various snoop modes are different in Intel Xeon processor Scalable family compared to previous Intel Xeon processors, and therefore the complexity of supporting multiple snoop modes is not beneficial.

The Intel Xeon processor Scalable family carries forward some of the coherency optimizations from prior generations and introduces some new ones to reduce the effective memory latency. For example, some of the directory caching optimizations such as IO directory cache and HitME cache are still supported and further enhanced on the Intel Xeon processor Scalable family. The opportunistic broadcast feature is also supported, but it is used only with writes to local memory to avoid memory access due to directory lookup.

For IO directory cache (IODC), the Intel Xeon processor Scalable family provides an eight-entry directory cache per CHA to the cache directory state of IO writes from remote sockets. IO writes usually require multiple transactions to invalidate a cache line from all caching agents followed by a writeback to put updated data in memory or home sockets LLC. With the directory information stored in memory, multiple accesses may be required to retrieve and update directory state. IODC reduces accesses to memory to complete IO writes by keeping the directory information cached in the IODC structure.

HitME cache is another capability in the CHA that caches directory information for speeding up cache-to-cache transfer. With the distributed home agent architecture of the CHA, the HitME cache resources scale with number of CHAs.

Opportunistic Snoop Broadcast (OSB) is another feature carried over from previous generations into the Intel Xeon processor Scalable family. OSB broadcasts snoops when the Intel UPI link is lightly loaded, thus avoiding a directory lookup from memory and reducing memory bandwidth. In the Intel Xeon processor Scalable family, OSB is used only for local InvItoE (generated due to full-line writes from the core or IO) requests since data read is not required for this operation. Avoiding directory lookup has a direct impact on saving memory bandwidth.

Cache Hierarchy Changes

Generational cache comparison
Figure 11. Generational cache comparison.

In the previous generation the mid-level cache was 256 KB per core and the last level cache was a shared inclusive cache with 2.5 MB per core. In the Intel Xeon processor Scalable family, the cache hierarchy has changed to provide a larger MLC of 1 MB per core and a smaller shared non-inclusive 1.375 MB LLC per core. A larger MLC increases the hit rate into the MLC resulting in lower effective memory latency and also lowers demand on the mesh interconnect and LLC. The shift to a non-inclusive cache for the LLC allows for more effective utilization of the overall cache on the chip versus an inclusive cache.

If the core on the Intel Xeon processor Scalable family has a miss on all the levels of the cache, it fetches the line from memory and puts it directly into MLC of the requesting core, rather than putting a copy into both the MLC and LLC as was done on the previous generation. When the cache line is evicted from the MLC, it is placed into the LLC if it is expected to be reused.

Due to the non-inclusive nature of LLC, the absence of a cache line in LLC does not indicate that the line is not present in private caches of any of the cores. Therefore, a snoop filter is used to keep track of the location of cache lines in the L1 or MLC of cores when it is not allocated in the LLC. On the previous-generation CPUs, the shared LLC itself took care of this task.

Even with the changed cache hierarchy in Intel Xeon processor Scalable family, the effective cache available per core is roughly the same as the previous generation for a usage scenario where different applications are running on different cores. Because of the non-inclusive nature of LLC, the effective cache capacity for an application running on a single core is a combination of MLC cache size and a portion of LLC cache size. For other usage scenarios, such as multithreaded applications running across multiple cores with some shared code and data, or a scenario where only a subset of the cores on the socket are used, the effective cache capacity seen by the applications may seem different than previous-generation CPUs. In some cases, application developers may need to adapt their code to optimize it with the changed cache hierarchy on the Intel Xeon processor Scalable family of processors.

Page Protection Keys

Because of stray writes, memory corruption is an issue with complex multithreaded applications. For example, not every part of the code in a database application needs to have the same level of privilege. The log writer should have write privileges to the log buffer, but it should have only read privileges on other pages. Similarly, in an application with producer and consumer threads for some critical data structures, producer threads can be given additional rights over consumer threads on specific pages. 

The page-based memory protection mechanism can be used to harden applications. However, page table changes are costly for performance since these changes require Translation Lookaside Buffer (TLB) shoot downs and subsequent TLB misses. Protection keys provide a user-level, page-granular way to grant and revoke access permission without changing page tables.

Protection keys provide 16 domains for user pages and use bits 62:59 of the page table leaf nodes (for example, PTE) to identify the protection domain (PKEY). Each protection domain has two permission bits in a new thread-private register called PKRU. On a memory access, the page table lookup is used to determine the protection domain (PKEY) of the access, and the corresponding protection domain-specific permission is determined from PKRU register content to see if access and write permission is granted. An access is allowed only if both protection keys and legacy page permissions allow the access. Protection keys violations are reported as page faults with a new page fault error code bit. Protection keys have no effect on supervisor pages, but supervisor accesses to user pages are subject to the same checks as user accesses.

Diagram of memory data access with protection key
Figure 12. Diagram of memory data access with protection key.

In order to benefit from protection keys, support is required from the virtual machine manager, OS, and complier. Utilizing this feature does not cause a performance impact because it is an extension of the memory management architecture.

Intel® Memory Protection Extensions (Intel® MPX)

C/C++ pointer arithmetic is a convenient language construct often used to step through an array of data structures. If an iterative write operation does not take into consideration the bounds of the destination, adjacent memory locations may get corrupted. Such unintended modification of adjacent data is referred as a buffer overflow. Buffer overflows have been known to be exploited, causing denial-of-service (DoS) attacks and system crashes. Similarly, uncontrolled reads could reveal cryptographic keys and passwords. More sinister attacks, which do not immediately draw the attention of the user or system administrator, alter the code execution path such as modifying the return address in the stack frame to execute malicious code or script.

Intel’s Execute Disable Bit and similar hardware features from other vendors have blocked buffer overflow attacks that redirected the execution to malicious code stored as data. Intel® MPX technology consists of new Intel® architecture instructions and registers that compilers can use to check the bounds of a pointer at runtime before it is used. This new hardware technology is supported by the compiler.

New Intel® Memory Protection Extensions instructions and example of their effect on memory
Figure 13. New Intel® Memory Protection Extensions instructions and example of their effect on memory.

For additional information see Intel® Memory Protection Extensions Enabling Guide.

Mode-Based Execute (MBE) Control

MBE provides finer grain control on execute permissions to help protect the integrity of the system code from malicious changes. It provides additional refinement within the Extended Page Tables (EPT) by turning the Execute Enable (X) permission bit into two options:

  • XU for user pages
  • XS for supervisor pages

The CPU selects one or the other based on permission of the guest page and maintains an invariant for every page that does not allow it to be writable and supervisor-executable at the same time. A benefit of this feature is that a hypervisor can more reliably verify and enforce the integrity of kernel-level code. The value of the XU/XS bits is delivered through the hypervisor, so hypervisor support is necessary.

Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

Generational overview of Intel® Advanced Vector Extensions technology
Figure 14. Generational overview of Intel® Advanced Vector Extensions technology.

Intel® Advanced Vector Extensions 512 (Intel® AVX-512) was originally introduced with the Intel® Xeon Phi™ processor product line (formerly Knights Landing). There are certain Intel AVX-512 instruction groups (AVX512CD and AVX512F) that are common to the Intel® Xeon Phi™ processor product line and the Intel Xeon processor Scalable family. However the Intel Xeon processor Scalable family introduces new Intel AVX-512 instruction groups (AVX512BW and AVX512DQ) as well as a new capability (AVX512VL) to expand the benefits of the technology. The AVX512DQ instruction group is focused on new additions for benefiting high-performance computing (HPC) workloads such as oil and gas, seismic modeling, financial services industry, molecular dynamics, ray tracing, double-precision matrix multiplication, fast Fourier transform and convolutions, and RSA cryptography. The AVX512BW instruction group supports Byte/Word operations, which can benefit some enterprise applications, media applications, as well as HPC. AVX512VL is not an instruction group but a feature that is associated with vector length orthogonality.

Broadwell, the previous processor generation, has up to two floating point FMAs (Fused Multiple Add) per core and this has not changed with the Intel Xeon processor Scalable family. However the Intel Xeon processor Scalable family doubles the number of elements that can be processed compared to Broadwell as the FMAs on the Intel Xeon processor Scalable family of processors have been expanded from 256 bits to 512 bits.

Generation feature comparison of Intel® Advanced Vector Extensions technology
Figure 15. Generation feature comparison of Intel® Advanced Vector Extensions technology.

Intel AVX-512 instructions offer the highest degree of support to software developers by including an unprecedented level of richness in the design of the instructions. This includes 512-bit operations on packed floating-point data or packed integer data, embedded rounding controls (override global settings), embedded broadcast, embedded floating-point fault suppression, embedded memory fault suppression, additional gather/scatter support, high-speed math instructions, and compact representation of large displacement value. The following sections cover some of the details of the new features of Intel AVX-512.

AVX512DQ

The doubleword and quadword instructions, indicated by the AVX512DQ CPUID flag enhance integer and floating-point operations, consisting of additional instructions that operate on 512-bit vectors whose elements are 16 32-bit elements or 8 64-bit elements. Some of these instructions provide new functionality such as the conversion of floating point numbers to 64-bit integers. Other instructions promote existing instructions such as with the vxorps instruction to use 512-bit registers.

AVX512BW

The byte and word instructions, indicated by the AVX512BW CPUID flag, enhance integer operations, extending write-masking and zero-masking to support smaller element sizes. The original Intel AVX-512 Foundation instructions supported such masking with vector element sizes of 32 or 64 bits, because a 512-bit vector register could hold at most 16 32-bit elements, so a write mask size of 16 bits was sufficient.

An instruction indicated by an AVX512BW CPUID flag requires a write mask size of up to 64 bits because a 512-bit vector register can hold 64 8-bit elements or 32 16-bit elements. Two new mask types (_mmask32 and _mmask64) along with additional maskable intrinsics have been introduced to support this operation.

AVX512VL

An additional orthogonal capability known as Vector Length Extensions provide for most Intel AVX-512 instructions to operate on 128 or 256 bits, instead of only 512. Vector Length Extensions can currently be applied to most Foundation Instructions and the Conflict Detection Instructions, as well as the new Byte, Word, Doubleword, and Quadword instructions. These Intel AVX-512 Vector Length Extensions are indicated by the AVX512VL CPUID flag. The use of Vector Length Extensions extends most Intel AVX-512 operations to also operate on XMM (128-bit, SSE) registers and YMM (256-bit, AVX) registers. The use of Vector Length Extensions allows the capabilities of EVEX encodings, including the use of mask registers and access to registers 16..31, to be applied to XMM and YMM registers instead of only to ZMM registers.

Mask Registers

In previous generations of Intel® Advanced Vector Extensions and Intel® Advanced Vector Extensions 2, the ability to mask bits was limited to load and store operations. In Intel AVX-512 this feature has been greatly expanded with eight new opmask registers used for conditional execution and efficient merging of destination operands. The width of each opmask register is 64-bits, and they are identified as k0–k7. Seven of the eight opmask registers (k1–k7) can be used in conjunction with EVEX-encoded Intel AVX-512 Foundation Instructions to provide conditional processing, such as with vectorized remainders that only partially fill the register. While the Opmask register k0 is typically treated as a “no mask” when unconditional processing of all data elements is desired. Additionally, the opmask registers are also used as vector flags/element level vector sources to introduce novel SIMD functionality as seen in new instructions such as VCOMPRESSPS. Support for the 512-bit SIMD registers and the opmask registers is managed by the operating system using XSAVE/XRSTOR/XSAVEOPT instructions. (see Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2B, and Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A).

Example of opmask register k1
Figure 16. Example of opmask register k1.

Embedded Rounding

Embedded Rounding provides additional support for math calculations by allowing the floating point rounding mode to be explicitly specified for an individual operation, without having to modify the rounding controls in the MXCSR control register. In previous SIMD instruction extensions, rounding control is generally specified in the MXCSR control register, with a handful of instructions providing per-instruction rounding override via encoding fields within the imm8 operand. Intel AVX-512 offers a more flexible encoding attribute to override MXCSR-based rounding control for floating-pointing instruction with rounding semantic. This rounding attribute embedded in the EVEX prefix is called Static (per instruction) Rounding Mode or Rounding Mode override. Static rounding also implies exception suppression (SAE) as if all floating point exceptions are disabled, and no status flags are set. Static rounding enables better accuracy control in intermediate steps for division and square root operations for extra precision, while the default MXCSR rounding mode is used in the last step. It can also help in cases where precision is needed the least significant bit such as in range reduction for trigonometric functions.

Embedded Broadcast

Embedded broadcast provides a bit-field to encode data broadcast for some load-op instructions such as instructions that load data from memory and perform some computational or data movement operation. A source element from memory can be broadcasted (repeated) across all elements of the effective source operand, without requiring an extra instruction. This is useful when we want to reuse the same scalar operand for all operations in a vector instruction. Embedded broadcast is only enabled on instructions with an element size of 32 or 64 bits and not on byte and word instructions.

Quadword Integer Arithmetic

Quadword integer arithmetic removes the need for expensive software emulation sequences. These instructions include gather/scatter with D/Qword indices, and instructions that can partially execute, where k-reg mask is used as a completion mask.

Table 3. Quadword integer arithmetic instructions.

Table 3 Quadword integer arithmetic instructions

Math Support

Math Support is designed to aid with math library writing and to benefit financial applications. Data types that are available include PS, PD, SS, and SS. IEEE division/square root formats, DP transcendental primitives, and new transcendental support instructions are also included.

Table 4. Math support instructions.

Table 4 Math support instructions

New Permutation Primitives

Intel AVX-512 introduces new permutation primitives such as 2-source shuffles with 16/32-entry table lookups with transcendental support, matrix transpose, and a variable VALIGN emulation.

Table 5. 2-Source shuffles instructions.

Table 5 2-Source shuffles instructions

Example of a 2-source shuffles operation
Figure 17. Example of a 2-source shuffles operation.

Expand and Compress

Expand and Compress allows vectorization of conditional loops. Similar to FORTRAN pack/unpack intrinsic it also provides memory fault suppression, can be faster than using gather/scatter, and also has opposite operation capability for compress. The figure below shows an example of an expand operation.

Expand instruction and diagram
Figure 18. Expand instruction and diagram.

Bit Manipulation

Intel AVX-512 provides support for bit manipulation operations on mask and vector operands including Vector rotate. These operations can be used to manipulate mask registers and they have some application with cryptography algorithms.

Table 6. Bit manipulation instructions.

Bit manipulation instructions

Universal Ternary Logical Operation

A universal ternary logical operation is another feature of Intel AVX-512 that provides a way to mimic an FPGA cell. The VPTERNLOGD and VPTERNLOGQ instructions operate on dword and qword elements and take three-bit vectors of the respective input data elements to form a set of 32/64 indices, where each 3-bit value provides an index into an 8-bit lookup table represented by the imm8 byte of the instruction. The 256 possible values of the imm8 byte is constructed as a 16x16 Boolean logic table, which can be filled with simple or compound Boolean logic expressions.

Conflict Detection Instructions

Intel AVX-512 introduces new conflict detection instructions. This includes the VPCONFLICT instruction along with a subset of supporting instructions. The VPCONFLICT instruction allows for detection of elements with previous conflicts in a vector of indexes. It can generate a mask with a subset of elements that are guaranteed to be conflict free. The computation loop can be re-executed with the remaining elements until all the indexes have been operated on.

Table 7. Conflict detection instructions.

Conflict detection instructions

VPCONFLICT{D,Q} zmm1{k1}{z}, zmm2/B(mV), For every element in ZMM2, compare it against everybody and generate a mask identifying the matches, but ignoring elements to the left of the current one, that is “newer.”

Diagram of mask generation for VPCONFLICT
Figure 19. Diagram of mask generation for VPCONFLICT.

In order to benefit from CDI, use Intel compilers version 16.0 in Intel® C++ Composer XE 2016 which will recognize potential run-time conflicts and generate VPCONFLICT loops automatically

Transcendental Support

Additional 512-bit instruction extensions have been provided to accelerate certain transcendental mathematic computations and can be found in the instructions VEXP2PD, VEXP2PS, VRCP28xx, and VRSQRT28xx, also known as Intel AVX-512 Exponential and Reciprocal instructions. These can benefit some finance applications.

Compiler Support

Intel AVX-512 optimizations are included in Intel compilers version 16.0 in Intel C++ Composer XE 2016 and the GNU* Compiler Collection (GCC) 5.0 (NASM 2.11.08 and binutils 2.25). Table 8 summarizes compiler arguments for optimization on the Intel Xeon processor Scalable family microarchitecture with Intel AVX-512.

Table 8. Summary of Intel Xeon processor Scalable family compiler optimizations.

Table 8 Summary of Intel Xeon processor Scalable family compiler optimizations

For more information see Intel® Architecture Instruction Set Extensions Programming Reference

Time Stamp Counter (TSC) Enhancement for Virtualization

The Intel Xeon processor Scalable family introduces a new TSC scaling feature to assist with migration of a virtual machine across different systems. In previous Intel Xeon processors, the TSC of a VM cannot automatically adjust itself to compensate for a processor frequency difference as it migrates from one platform to another. The Intel Xeon processor Scalable family enhances TSC virtualization support by adding a scaling feature in addition to the offsetting feature available in prior-generation CPUs. For more details on this feature see Intel® 64 SDM (search for “TSC Scaling”, e.g., Vol 3A – Sec 24.6.5, Sec 25.3, Sec 36.5.2.6).

Intel® Resource Director Technology (Intel® RDT)

Intel® Resource Director Technology (Intel® RDT) is a set of technologies designed to help monitor and manage shared resources. See Optimize Resource Utilization with Intel® Resource Director Technology for an animation illustrating the key principles behind Intel RDT. Intel RDT already has several existing features that provide benefit such as Cache Monitoring Technology (CMT), Cache Allocation Technology (CAT), Memory Bandwidth Monitoring (MBM), and Code Data Prioritization (CDP). The Intel Xeon processor Scalable family on the Purley platform introduces a new feature called Memory Bandwidth Allocation (MBA) which has been added to provide a per-thread memory bandwidth control. Through software the amount of memory bandwidth consumption of a thread or core can be limited. This feature can be used in conjunction with MBM to isolate a noisy neighbor. Chapter 17.16 in volume 3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM) covers programming details on Intel RDT features. Using this feature requires enabling at the OS or VMM level, and the Intel® Virtualization Technology (Intel® VT) for IA-32, Intel® 64 and Intel® Architecture (Intel® VT-x) feature must be enabled at the BIOS level. For instructions on setting Intel VT-x, refer to your OEM BIOS guide.

Memory Bandwidth Monitoring (MBM)

Conceptual diagram of using Memory Bandwidth Monitoring
Memory Bandwidth Allocation (MBA)

 

Conceptual diagram of using Memory Bandwidth Monitoring
Figure 20. Conceptual diagram of using Memory Bandwidth Monitoring to identify noisy neighbor (core 0) and then using Memory Bandwidth Allocation to prioritize memory bandwidth.

Intel® Speed Shift Technology

Broadwell introduced Hardware Power Management (HWPM), a new optional processor power management feature in the hardware that liberates the OS from making decisions about processor frequency. HWPM allows the platform to provide information on all available constraints, allowing the hardware to choose optimal operating point. Operating independently, the hardware uses information that is not available to software and is able to make a more optimized decision in regard to the p-states and c-states. The Intel Xeon processor Scalable family on the Purley platform expands on this feature by providing a broader range of states that it can affect as well as a finer level of granularity and microarchitecture observability via the Package Control Unit (PCU). On Broadwell the HWPM was autonomous also known as Out-of-Band (OOB) mode and oblivious to the operating system, the Intel Xeon processor Scalable family allows for this as well but also offers the option for a collaboration between the HWPM and the operating system, known as native mode. The operating system can directly control the tuning of the performance and power profile when and where it is desired, while elsewhere the PCU can take autonomous control in the absence of constraints placed by the operating system. In native mode The Intel Xeon processor Scalable family is able to optimize frequency control for legacy operating systems, while providing new usage models for modern operating systems. The end user can set these options within the BIOS; see your OEM BIOS guide for more information. Modern operating systems that provide full integration with native mode include Linux* starting with kernel 4.10 and Windows Server* 2016.

PMax Detection

A processor implemented detection circuit provides faster detection and response to PMax level load events. Previously PMax detection circuits resided in either the power supply unit (PSU) or on the system board, while the new detection circuit on the Intel Xeon processor Scalable family resides primarily on the processor side. In general, the PMax detection circuit provided with the Intel Xeon processor Scalable family allows for faster PMax detection and response time as compared to the prior-generation PMax detection methods. PMax detection allows for the processor to be throttled back when it detects that power limits are being hit. This can assist with PMax spikes associated with virus applications while in turbo mode, prior to the PSU reacting. A faster response time due to PMax load events potentially allows for possible power cost savings. The end user can set PMax detection within the BIOS; see your OEM BIOS guide for more information.

Intel® Omni-Path Architecture (Intel® OPA)

Intel® Omni-Path Architecture (Intel® OPA), an element of Intel® Scalable System Framework, delivers the performance for tomorrow’s high performance computing (HPC) workloads and the ability to scale to tens of thousands of nodes—and eventually more—at a price competitive with today’s fabrics. The Intel OPA 100 Series product line is an end-to-end solution of PCIe* adapters, silicon, switches, cables, and management software. As the successor to Intel® True Scale Fabric, this optimized HPC fabric is built upon a combination of enhanced IP and Intel® technology.

For software applications, Intel OPA will maintain consistency and compatibility with existing Intel True Scale Fabric and InfiniBand* APIs by working through the open source OpenFabrics Alliance (OFA) software stack on leading Linux distribution releases. Intel True Scale Fabric customers will be able to migrate to Intel OPA through an upgrade program.

The Intel Xeon processor Scalable family on the Purley platform supports Intel OPA in one of two forms: through the use of an Intel® Omni-Path Host Fabric Interface 100 Series add-in card or through a specific processor model line (SKL-F) found within the Intel Xeon processor Scalable family that has a Host Fabric Interface integrated into the processor. The fabric integration on the processor has its own dedicated pathways on the motherboard and doesn’t impact the PCIe lanes available for add-in cards. The architecture is able to provide up to 100 Gb/s per processor socket.

Intel is working with the open source community to provide all host components with changes being pushed upstream in conjunction with Delta Package releases. OSVs are working in conjunction with Intel to incorporate into future OS distributions. While existing Message Passing Interface (MPI) programs and MPI libraries for Intel True Scale Fabric that use PSM will work as-is with Intel Omni-Path Host Fabric Interface without recompiling, although recompiling can expose additional benefit.

For software support Intel Download Center and complier support can be found in Intel® Parallel Studio XE 2017

Intel QuickAssist Technology

Intel® QuickAssist Technology (Intel® QAT) accelerates and compresses cryptographic workloads by offloading the data to hardware capable of optimizing those functions. This makes it easier for developers to integrate built-in cryptographic accelerators into network, storage, and security applications. In the case of the Intel Xeon processor Scalable family on the Purley platform, Intel QAT is integrated into the hardware of the Intel® C620 series chipset (formerly Lewisburg) on the Purley platform and offers outstanding capabilities including 100 Gbs Crypto, 100Gbs Compression, 100kops RSA, and 2k Decrypt. Segments that can benefit from the technology include the following:

  • Server: secure browsing, email, search, big-data analytics (Hadoop), secure multi-tenancy, IPsec, SSL/TLS, OpenSSL
  • Networking: firewall, IDS/IPS, VPN, secure routing, Web proxy, WAN optimization (IP Comp), 3G/4G authentication
  • Storage: real-time data compression, static data compression, secure storage.

Supported Algorithms include the following:

  • Cipher Algorithms: Null, ARC4, AES (key sizes 128,192, 256), DES, 3DES, Kasumi, Snow3G, and ZUC
  • Hash/Authentication Algorithms Supported: MD5, SHA1, SHA-2 (output sizes 224,256,384,512), SHA-3 (output size 256 only), Advanced Encryption Standard (key sizes 128, 192, 256), Kasumi, Snow 3G, and ZUC
  • Authentication Encryption (AEAD) Algorithm: AES (key sizes 128, 192, 256)
  • Public Key Cryptography Algorithms: RSA, DSA, Diffie-Hellman (DH), Large Number Arithmetic, ECDSA, ECDH, EC, SM2 and EC25519

ZUC and SHA-3 are new algorithms that have been included in the third generation of Intel QuickAssist Technology found on the Intel® C620 series chipset.

Intel® Key Protection Technology (Intel® KPT) is a new supplemental feature of Intel QAT that can be found on the Intel Xeon processor Scalable family on the Purley platform with the Intel® C620 series chipset. Intel KPT has been developed to help secure cryptographic keys from platform level software and hardware attacks when the key is stored and used on the platform. This new feature focuses on protecting keys during runtime usage and is embodied within tools, techniques, and the API framework.

For a more detailed overview see Intel® QuickAssist Technology for Storage, Server, Networking and Cloud-Based Deployments. Programming and optimization guides can be found on the 01 Intel Open Source website.

Internet Wide Area RDMA Protocol (iWARP)

iWarp is a technology that allows network traffic managed by the NIC to bypass the kernel, which thus reduces the impact on the processor due to the absence of network-related interrupts. This is accomplished by the NICs communicating with each other via queue pairs to deliver traffic directly into the application user space. Large storage blocks and virtual machine migration tend to place more burden on the CPU due to the network traffic. This is where iWARP can be of benefit. Through the use of the queue pairs it is already known where the data needs to go and thus it is able to be placed directly into the application user space. This eliminates extra data copies between the kernel space and the user space that would normally occur without iWARP.

For more information see an information video on Accelerating Ethernet with iWARP Technology

iWARP comparison block diagram
Figure 21. iWARP comparison block diagram.

The Purley platform has an Integrated Intel Ethernet Connection X722 with up to 4x10 GbE/1 Gb connections that provide iWARP support. This new feature can benefit various segments including network function virtualization and software-defined infrastructure. It can also be combined with the Data Plane Development Kit to provide additional benefits with packet forwarding.

iWARP uses VERB APIs to talk to each other instead of traditional sockets. For Linux* OFA OFED provides VERB APIs, while Windows* uses Network Direct APIs. Contact your Linux distribution to see if it supports OFED verbs, while on Windows support is provided starting with Windows Server 2012 R2 or newer.

New and Enhanced RAS Features

The Intel Xeon processor Scalable family on the Purley platform provides several new features as well as enhancements of some existing features associated with the RAS (Reliability, Availability, and Serviceability) and Intel® Run Sure Technology. Two levels of support are provided with the Intel Xeon processor Scalable family: Standard RAS and Advanced RAS. Advanced RAS includes all of the Standard RAS features along with additional features.

In previous generations there could be limitations in RAS features based on the processor socket count (2–8). This has changed and all of the RAS features are available on a two-socket version of the platform or greater depending on the level (bronze through platinum) of the processors. Listed below is a summary of the new and enhanced RAS features from the previous generation.

Table 9. RAS feature summary table.

Table 9 RAS feature summary table

Intel® Virtual RAID on CPU (Intel® VROC)

Intel® VROC replaces third-party raid cards
Figure 22. Intel® VROC replaces third-party raid cards.

Intel VROC is a software solution that integrates with a new hardware technology called Intel® Volume Management Device (Intel® VMD) to provide a compelling hybrid RAID solution for NVMe* (Non-Volatile Memory Express*) solid-state drives (SSDs). The CPU has onboard capabilities that work more closely with the chipset to provide quick access to the directly attached NVMe SSDs on the PCIe lanes of the platform. The major features that help to make this possible are Intel® Rapid Storage Technology enterprise (Intel® RSTe) version 5.0, Intel VMD, and the Intel provided NVMe driver.

Intel RSTe is a driver and application package that allows for administration of the RAID features. It has been updated (version 5.0) on the Purley platform to take advantage of all of the new features. The NVMe driver allows restrictions that might have been placed on it by an operating system to be bypassed. This means that features like hot insert could be available even if the OS doesn’t provide it, and the driver can also provide support for third-party vendor NVMe non-Intel SSDs.

Intel VMD is a new technology introduced with the Intel Xeon processor Scalable family primarily to improve the management of high-speed SSDs. Previously SSDs were attached to a SATA or other interface types and managing them through software was acceptable. When we move toward directly attaching the SSDs to a PCIe interface in order to improve bandwidth, software management of those SSDs adds more delays. Intel VMD uses hardware to mitigate these management issues rather than completely relying on software.

Some of the major RAID features provided by Intel VROC include the protected write-back cache, isolated storage devices from the OS (error handling), and protection of RAID 5 data from a RAID write hole issue through the use of software logging, which can eliminate the need for a battery backup unit. Direct attached NVMe RAID volumes are RAID bootable, have Hot Insert and Surprise Removal capability, provide LED management options, 4K native NVMe SSD support, and multiple management options including remote access from a webpage, interaction at the UEFI level for pre-OS tasks, and a GUI interface at the OS level.

Boot Guard

Boot Guard adds another level of protection to the Purley platform by performing a cryptographic Root of Trust for Measurement (RTM) of the early firmware platform storage device such as the trusted platform module or Intel® Platform Trust Technology (Intel® PTT). It can also cryptographically verify early firmware using OEM-provided policies. Unlike Intel® Trusted Execution Technology (Intel® TXT), Boot Guard doesn’t have any software requirements; it is enabled at the factory, and it cannot be disabled. Boot Guard operates independently of Intel® TXT but it is also compatible with it. Boot Guard reduces the chance of malware exploiting the hardware or software components.

Boot Guard secure boot options
Figure 23. Boot Guard secure boot options.

BIOS Guard 2.0

BIOS Guard is an augmentation of existing chipset-based BIOS flash protection capabilities. The Purley platform adds the fault tolerant boot block update capability. The BIOS flash is segregated into a protected and unprotected regions. Purley bypasses the top-swap feature and flash range register locks/protections, for explicitly enabled signed scripts, to facilitate the fault-tolerant boot block update. This feature protects the BIOS flash from modification without the platform manufacturer’s authorization, as well as during BIOS updates. It can also help defend the platform from low-level DOS attacks.

For more details see Intel® Hardware-based Security Technologies for Intelligent Retail Devices.

BIOS Guard 2.0 block diagram
Figure 24. BIOS Guard 2.0 block diagram.

Intel® Processor Trace

Intel® Processor Trace (Intel® PT) is an exciting feature with improved support on the Intel Xeon processor Scalable family that can be enormously helpful in debugging, because it exposes an accurate and detailed trace of activity with triggering and filtering capabilities to help with isolating the tracing that matters.

Intel PT provides the context around all kinds of events. Performance profilers can use Intel PT to discover the root causes of “response-time” issues—performance issues that affect the quality of execution, if not the overall runtime.

Further, the complete tracing provided by Intel PT enables a much deeper view into execution than has previously been commonly available; for example, loop behavior, from entry and exit down to specific back-edges and loop tripcounts, is easy to extract and report.

Debuggers can use Intel PT to reconstruct the code flow that led to the current location, whether this is a crash site, a breakpoint, a watchpoint, or simply the instruction following a function call we just stepped over. They may even allow navigating in the recorded execution history via reverse stepping commands.

Another important use case is debugging stack corruptions. When the call stack has been corrupted, normal frame unwinding usually fails or may not produce reliable results. Intel PT can be used to reconstruct the stack back trace based on actual CALL and RET instructions.

Operating systems could include Intel PT into core files. This would allow debuggers to not only inspect the program state at the time of the crash, but also to reconstruct the control flow that led to the crash. It is also possible to extend this to the whole system to debug kernel panics and other system hangs. Intel PT can trace globally so that when an OS crash occurs, the trace can be saved as part of an OS crash dump mechanism and then used later to reconstruct the failure.

Intel PT can also help to narrow down data races in multi-threaded operating systems and user program code. It can log the execution of all threads with a rough time indication. While it is not precise enough to detect data races automatically, it can give enough information to aid in the analysis.

To utilize Intel PT you need Intel® Vtune™ Amplifier version 2017.

For more information see Debug and fine-grain profiling with Intel processor trace given by Beeman Strong, Senior and Processor tracing by James Reinders.

Intel® Node Manager

Intel® Node Manager (Intel® NM) is a core set of power management features that provide a smart way to optimize and manage power, cooling, and compute resources in the data center. This server management technology extends component instrumentation to the platform level and can be used to make the most of every watt consumed in the data center. First, Intel NM reports vital platform information, such as power, temperature, and resource utilization using standards-based, out-of-band communications. Second, it provides fine-grained controls, such as helping with reduction of overall power consumption or maximizing rack loading, to limit platform power in compliance with IT policy This feature can be found across Intel’s product segments, including the Intel Xeon processor Scalable family, providing consistency within the data center.

The Intel Xeon processor Scalable family on the Purley platform includes the fourth generation of Intel NM, which extends control and reporting to a finer level of granularity than on the previous generation. To use this feature you must enable the BMC LAN and the associated BMC user configuration at the BIOS level, which should be available under the server management menu. The Programmer’s Reference Kit is simple to use and requires no additional external libraries to compile or run. All that is needed is a C/C++ compiler and to then run the configuration and compilation scripts.

Table 10. Intel Node Manager fourth-generation features

Table 10 Intel® Node Manager fourth-generation features

The Author: David Mulnix is a software engineer and has been with Intel Corporation for over 20 years. His areas of focus has included software automation, server power, and performance analysis, and he has contributed to the development support of the Server Efficiency Rating ToolTM.

Contributors: Akhilesh Kumar and Elmoustapha Ould-ahmed-vall

Resources

Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM)

Intel® Architecture Instruction Set Extensions Programming Reference

Intel® Resource Director Technology (Intel® RDT)

Optimize Resource Utilization with Intel® Resource Director Technology

Intel® Memory Protection Extensions Enabling Guide

Intel® Scalable System Framework

Intel® Run Sure Technology

Intel® Hardware-based Security Technologies for Intelligent Retail Devices

Processor tracing by James Reinders

Debug and fine-grain profiling with Intel processor trace given by Beeman Strong, Senior

Intel® Node Manager Website

Intel® Node Manager Programmer’s Reference Kit

Open Source Reference Kit for Intel® Node Manager

How to set up Intel® Node Manager

Intel® Performance Counter Monitor (Intel® PCM) a better way to measure CPU utilization

Intel® Memory Latency Checker (Intel® MLC) a Linux* tool available for measuring the DRAM latency on your system

Intel® VTune™ Amplifier 2017 a rich set of performance insight into hotspots, threading, locks & waits, OpenCL bandwidth and more, with profiler to visualize results

The Intel® Xeon® processor-based server refresh savings estimator

Intel® Sound Analytic Engine and Intel® Smart Home Developer Kits: Use Cases and Applications

$
0
0

Overview

Speech cognitive technology is all around us. From the telephone payment systems at your local utility company, to the digital personal assistant in your phone, and more recently, the smart speaker sitting in your living room, speech cognition is a pervasive and rapidly growing technology.
Over the past five years, adoption of voice-enabled devices has grown exponentially. Driven by market leaders (Amazon, Google and Microsoft) voice-controlled Smart Home devices are beginning to permeate the domestic environment. While we are most familiar with smart speakers, this is just the first phase in the evolution of the home, from a place of shelter and comfort to a valuable tool to make our lives easier. This presents an exciting opportunity for you as a product developer to add voice capabilities to new and innovative form factors. Whether it’s adding voice to a current design, or building a new, voice-first product, Intel® technology provides the building blocks for prototyping and bringing new Smart Home Solutions to market. But before determining what type of customer experience you intend to create, let’s look at some of the benefits and user requirements for adoption.

Benefits of Enabling Voice on Smart Home devices

The benefits of adding speech understanding to Smart Home devices for manufacturers can be grouped into three different categories:
  • Simplifying and accelerating access to the internet 
  • Learning and characterization of the user needs
  • Best-in-class hardware with over-the-top applications

Simplifying and Accelerating Access to the Internet 

Devices like smart speakers allow users to access information and services online with the intuitive power of their voice. These devices typically feature Personal Assistant technology and are connected to the internet to provide Natural Language Processing (NLP) in the cloud. This allows manufacturers to deliver cloud-based services seamlessly through voice-first devices. The device providers are also able to identify and learn customer preferences, which allows them to improve over time and grow adoption. 

Learning and Characterization of the User Needs

Speech can also be a valuable tool to understand your customers’ needs without requiring cloud-based services. Simple command and control functionality can be added to many different form factors (think coffee pots, dishwashers, microwaves) as an easy interface for users. In turn, this allows companies to better understand how their products are being used and improve the product lifecycle. 

Best-in-class Hardware with User-friendly Applications

Manufacturers looking to leverage their existing products to build high fidelity sound capable devices gain a clear competitive advantage when adding voice to their platforms. These highly-integrated, user-friendly designs often require little to no training for customers to be up and running. This allows companies to focus on continuing to develop best-in-class hardware while supporting a large application development infrastructure to bring a uniform and intuitive experiences to users.

User Requirements

The rapid adoption of voice-first technology in the home is due in large part to the ease and instinctiveness of communicating with your natural voice. For that reason, it’s important for developers to focus on ease of use when building voice-enabled Smart Home products. To drive user adoption, voice-enabled Smart Home devices will require low latency, low word error rate (WER), a large vocabulary (local or cloud-based), and the ability to speak and be understood from a reasonable conversation distance. 

Intel® Technology for Smart Home devices

The Intel® Sound Analytic Engine is a dual DSP and neural network accelerator that provides silicon, algorithms, and a reference design microphone array designed with complex, far-field signal processing algorithms that use high dimensional microphone arrays to do beamforming, echo cancellation, and noise reduction. This simplifies enabling speech across a range of form factors, allowing developers to add far-field voice, speech recognition, and amazing acoustics to low-power devices. It enables the user requirements by providing a building block for voice-enabling that uses a silicon-based, Intel-developed Gaussian Network Accelerator (GNA).
Intel® Sound Analytic Engine provides you with a straightforward path to developing either a cloud-based voice recognition system or a large vocabulary local speech recognition system. It allows you to bring products to market quickly with a pre-established framework for a smart speaker design that can be integrated into many different form factors. 

Intel® Developer Kits for Creating Smart Home Products

Intel is introducing Smart Home Developer Kits to empower hardware and software developers to quickly bring new voice-enabled products to market. The primary technology in these kits is the Intel® Sound Analytic Engine. 
The first developer kit, the Intel® Speech Enabling Developer Kit, will be available for sale in October 2017. This kit contains the Intel® Sound Analytical Engine (a dual DSP with neural network accelerator), mic arrays, speaker mount, and a Raspberry Pi* connector cable to get you quickly prototyping with Alexa* Voice Services. Future developer kits will enable additional features, including imaging and sensors.

What you can Build for the Smart Home

There are two main categories of Smart Home devices that can utilize Intel® Sound Analytic Engine technology to enable speech understanding: smart appliances and smart speakers.
Transforming traditional appliances into “smart appliances” requires being able to interact with them directly. Rather than adding a keyboard and mouse or touchscreen, which still requires users to physically interact with their devices, truly smart appliances should have voice as their primary interface. This will require far-field understanding, low latency, and low power for always-on capabilities. Low cost and low power are critical for the digital microphones and speakers that power speech interaction. Adding voice to existing form factors will also require flexible designs that can fit into established chassis, like ovens, dishwashers, and tea kettles. When these requirements are satisfied, then you will achieve true value for your users. For instance, enabling speech understanding on a coffee pot would allow you to start your coffee with your voice, freeing you to accomplish other morning tasks simultaneously. 
Smart speakers enabled with Personal Assistants is a rapidly growing segment of smart home products. Research from Parks Associates suggests that adoption doubled, from 5% to 10-11% in the U.S. between 2015 and 2016. And total sales of smart speakers with Personal Assistants is estimated at 14 million units in 2016 . This rapid adoption can be attributed to the intuitive interface and utility of these devices, for everything from playing music to accessing cloud-based services for information. The Intel® Speech Enabling Developer Kit can be leveraged to prototype smart speakers equipped with cloud-based Personal Assistants. The Intel® Sound Analytic Engine provides a straightforward path to developing either a cloud-based voice recognition system or a large vocabulary local speech recognition system that provides a high-quality speech recognition experience and a broad choice of pre- and post-processing capabilities. It allows you to get to market very quickly with a pre-established framework for a smart speaker design that can be integrated into many different form factors to enable a ubiquitous speech platform.

Conclusion

In the last five years, we have experienced an explosion of voice-enabled devices. However, we have just scratched the surface of what is possible when you add voice in the Smart Home. From dishwashers to washing machines, speech understanding can provide a clear competitive advantage by differentiating your product from that of your competitors. Not only will the Dev Kit help to differentiate your product from your competitors, but it will give you a head-start on the future when having a personal assistant will be the household standard. It also provides a path to better understand your customers, more easily deliver cloud-based services, and maintain differentiators like best-in-class hardware. 
The Intel® Sound Analytic Engine platform can enable speech across a wide range of form factors by providing silicon, algorithms, and a reference design microphone array that is designed to enable user requirements for Smart Home devices. It uses Intel’s silicon hardened GNA (Gaussian Network Accelerator) to improve cloud and local based speech recognition, acoustic context awareness, and power reduction. This technology is available in Intel® Smart Home Developer Kits to simplify prototyping. For more information about developer kits, check out the resources below.

Additional information

Links to:
  • How-to video
  • How-to workbook
  • Code samples
  • SmartHome.intel.com 

1 http://www.parksassociates.com/bento/shop/whitepapers/files/Parks%20Assoc%20%20Impact%20of%20Voice%20Whitepaper%202017.pdf

Get Started with IPsec Acceleration in the FD.io VPP Project

$
0
0

Introduction

This article looks at IPsec acceleration improvements in the FD.io VPP project based on the Data Plane Development Kit (DPDK) Cryptodev framework.

It gives a brief introduction to FD.io, VPP, DPDK, and the DPDK Cryptodev library, and shows how they are combined to provide enhanced IPsec performance and functionality.

It also shows how to install, build, configure, and run VPP with Cryptodev, and shows the type of performance gains that can be achieved.

Background

FD.io, the Fast Data Project, is an umbrella project for open source software aimed at providing high-performance networking solutions.

VPP, the Vector Packet Processing library, is one of the core projects in FD.io. VPP is an extensible framework that provides production-quality userspace switch/router functionality running on commodity CPUs.

The default implementation of VPP contains IPsec functionality, which relies on the OpenSSL library, as shown in Figure 1.

Graphic of a flowchart
Figure 1. Default IPsec implementation in VPP

While this implementation is feature rich, it is not optimal from a performance point of view. In VPP 17.01, the IPsec implementation was extended to include support for DPDK’s Cryptodev API to provide improved performance and features.

DPDK is a userspace library for fast packet processing. It is used by VPP as one of its optional I/O layers.

The DPDK Cryptodev library provides a crypto device framework for management and provisioning of hardware and software crypto poll mode drivers, and defines generic APIs that support a number of different Crypto operations.

The extended DPDK Cryptodev IPsec Implementation architecture in VPP is shown in Figure 2.

Graphic of a flowchart
Figure 2. DPDK Cryptodev IPsec Implementation in VPP

Enabling DPDK Cryptodev in VPP improves the performance for the most common IPsec encryption and authentication algorithms, including AES-GCM, which was not enabled in the default VPP implementation.

As such, enabling the DPDK Cryptodev feature not only increases performance but also provides access to additional options and flexibility such as:

  • Devices with Intel® QuickAssist Technology (Intel® QAT) for hardware offloading of all supported algorithms, including AES-GCM.
  • Intel® Multi-Buffer Crypto for IPsec Library for a heavily optimized software implementation.
  • Intel® Intelligent Storage Acceleration Library (Intel® ISA-L) for heavily optimized software implementation of the AES-GCM algorithm.

The Intel QuickAssist Technology-based Crypto Poll Mode Driver provides support for a range of hardware accelerator devices. For more information see the DPDK documentation for Intel QuickAssist Technology.

Building VPP with DPDK Cryptodev

To try out VPP with DPDK and Cryptodev support you can download and build VPP as follows:

		$ git clone https://gerrit.fd.io/r/vpp
		$ cd vpp
		$ git checkout v17.04
		$ vpp_uses_dpdk_cryptodev_sw=yes make build-release -j

Note that the build command line enables DPDK Cryptodev support for software-optimized libraries and the specific VPP release. The build process will automatically download and build VPP, DPDK and the required software crypto libraries.

To start VPP with DPDK Cryptodev use the following command:

$ make run-release STARTUP_CONF=/vpp_test/vpp_conf/startup.conf

The “startup_conf” path should be changed to suit the specific location in the end-user’s environment.

Testing VPP with DPDK Cryptodev

Figure 3 represents a typical test configuration.

Graphic of a flowchart
Figure 3. Test system setup

Pktgen, as shown in Figure 3, is a software traffic generator based on DPDK that is being used in this configuration to test the VPP IPsec.

For this example we used the following hardware configuration:

  • 1 x Intel® Core™ i7-4770K series processor @ 3.5 GHz
  • 2 x Intel® Ethernet Controller 10 Gigabit 82599ES NICs, 10G, 2 ports
  • 1 x Intel® QuickAssist Adapter 8950

The VPP “startup.conf” configuration file used in the test setup is shown below:

		unix {
		  nodaemon
		  interactive
		  exec /vpp_test/vpp_conf/ipsec.cli
		}

		cpu {
		  main-core 0
		  corelist-workers 1-2
		}


		dpdk {
		  socket-mem 1024
		  uio-driver igb_uio

		  dev 0000:06:00.0
		  {
			workers 0
		  }
		  dev 0000:06:00.1
		  {
			workers 1
		  }


		  # Option 1: Leave both options below commented out
		  # to fall back to the default OpenSSL.

		  # Option 2: Multi-Buffer Crypto Library.
		  #enable-cryptodev
		  #vdev cryptodev_aesni_mb_pmd,socket_id=0

		  # Option3: QAT hardware acceleration.
		  #enable-cryptodev
		  #dev 0000:03:01.1
		  #dev 0000:03:01.2
		}
		

The configuration file allows these configuration options:

  1. The default VPP + OpenSSL option
  2. Cryptodev with the optimized Multi-Buffer software library
  3. Cryptodev with Intel QuickAssist Technology-based hardware acceleration

The example “startup.conf” file in Figure 4 sets up the system configuration. The devices “0000:06:00.0” and “0000:06:00.1” are the network ports and “0000:03:01.1” and “0000:03:01.2” are the Intel QuickAssist Technology-based device VFs (Virtual Functions). These should be changed to match the end-user system where the test is replicated.

The CPP “ipsec.cli” file used for testing is shown in below:

		set int ip address TenGigabitEthernet6/0/0 192.168.30.30/24
		set int promiscuous on TenGigabitEthernet6/0/0
		set int ip address TenGigabitEthernet6/0/1 192.168.30.31/24
		set int promiscuous on TenGigabitEthernet6/0/1

		ipsec spd add 1
		set interface ipsec spd TenGigabitEthernet6/0/1 1
		ipsec sa add 10 spi 1000 esp tunnel-src 192.168.1.1 tunnel-dst 192.168.1.2 crypto-key 4339314b55523947594d6d3547666b45 crypto-alg aes-cbc-128 integ-key 4339314b55523947594d6d3547666b45 integ-alg sha1-96
		ipsec policy add spd 1 outbound priority 100 action protect sa 10 local-ip-range 192.168.20.0-192.168.20.255 remote-ip-range 192.168.40.0-192.168.40.255
		ipsec policy add spd 1 outbound priority 90 protocol 50 action bypass

		ip route add 192.168.40.40/32 via 192.168.1.2 TenGigabitEthernet6/0/1
		set ip arp TenGigabitEthernet6/0/1 192.168.1.2 90:e2:ba:50:8f:19

		set int state TenGigabitEthernet6/0/0 up
		set int state TenGigabitEthernet6/0/1 up
		

This file sets up the network interfaces and the IPsec configuration.

In the configuration shown above, incoming packets on interface “TenGigabitEthernet6/0/0” with the destination IP “192.168.40.40” will be encrypted using the AES-CBC-128 algorithm and authenticated using the SHA1-96 algorithm. The packets are then sent out on the “TenGigabitEthernet6/0/1” interface. Again, this configuration should to be adjusted to reflect the end-user’s environment.

Using the Pktgen traffic generator configured to send packets that match the IPsec configuration above (source 192.168.20.20, destination 192.168.40.40, port 80, and packets of variable size 64, 128, 256, 512, and 1024), performance data can be generated as shown in Figure 4. Note, the data shown is for illustration purposes only, actual values will depend on the hardware configuration and software versions. The “MB” data refers to the Multi-Buffer Crypto Library which is an optimized software library.

Graphic data
Figure 4. Example throughput vs. packet size for different configurations

The illustrative data in Figure 4 shows a significant improvement using software-optimized libraries and an even larger improvement using hardware offloading, with the throughput being capped by the line rate rather than processing power.

Conclusion

This article gives an overview of the DPDK Cryptodev framework in VPP and shows how it can be used to improve performance and functionality in the VPP IPsec implementation.

About the Author

Radu Nicolau is a network software engineer with Intel. His work is currently focused on IPsec-related development of data plane functions and libraries. His contributions include enablement of the AES-GCM crypto algorithm in the VPP IPsec stack, IKEv2 initiator support for VPP, and inline IPsec enablement in DPDK.

Solve SVD problem for sparse matrix with Intel Math Kernel Library

$
0
0

    Introduction :

    Intel(R) MKL presents new parallel solution for large sparse SVD problem. Functionality is based on subspace iteration method combined with recent techniques that estimate eigenvalue counts [1]. The idea of underlying algorithm is to split the interval that contains all singular values into subintervals with close to an equal number of values and apply a polynomial filter on each of the subintervals in parallel. This approach allows to solve problems of size up to 10^6 and enables multiple levels of parallelism.

    This functionality is now available for non-commercial use as a separate package that must be linked with installed MKL version. Current version of package supports two functions:

 - Shared memory SVD. Supports parallelization in Open MP.

- Cluster version of SVD. This realization supports MPI parallelization of the independent subproblems defined by each subinterval and matrix operations within each subinterval are Open MP parallel.

Support of the trial package is now limited to linux/windows 64-bit Architectures with only C interface. In future, these limitations will be removed.

    Product overview :

Problem statement : Compute several largest/smallest singular values and corresponding (right/left) singular vectors.

    Unlike available in Intel(R) MKL LAPACK and SCALAPACK ?gesvd and p?gesvd functionality [2,3], new package supports sparse input matrix and only part of largest/smallest singular values on output, thus allowing to solve problems of larger size more efficiently.  Current implementation supports input matrix in CSR format and dense singular vectors on output. Customer must specify number of largest/smallest singular values to find as well as desired tolerance of solution. Number of singular values in question can be up to the size of matrix. However, it is up to customer to ensure that enough space is available to store dense output for singular vectors. On output, each MPI rank will store its part of singular values and singular vectors.

The chart below shows MPI scalability for Cluster SVD solver on randomly generated matrix of size 4*10^5 with 1000 largest singular values requested. Chart shows that almost linear scalability is achieved over many MPI processes.

The attached C examples demonstrate how to make these computations. 

Please let us know in the case if you would like to take this evaluation package.

[1] E. Di Napoli, E. Polizzi, Y. Saad, Efficient Estimation of Eigenvalue Counts in an Interval

[2] https://software.intel.com/en-us/mkl-developer-reference-c-singular-value-decomposition-lapack-driver-routines

[3] https://software.intel.com/en-us/mkl-developer-reference-c-singular-value-decomposition-scalapack-driver-routines

 

Viewing all 3384 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>