Quantcast
Channel: Intel Developer Zone Articles
Viewing all 3384 articles
Browse latest View live

Play audio from your Intel® Edison via Bluetooth* using Advanced Audio Distribution Profile (A2DP)

$
0
0

Requirements

  • A Bluetooth* audio device, like a headset or speaker that can play sounds.

  • Connect your Intel® Edison board to a Wi-Fi* network, see Step 3: Get your Board Online.

  • SCP using a host computer connected to the same network (to copy music files over).

  • Establish a terminal to your board either Via Serial port or SSH.

Setup

Make your Bluetooth* audio device discoverable.

Type the following in the terminal to your board.

rfkill unblock bluetooth
bluetoothctl

Scan for devices.

scan on

Find your device and pair to it (replace the MAC address with the MAC address of your device)

pair 78:24:AF:13:58:B9

Verify that your A2DP device (the LG* headset in this case) is recognized in pulse audio as a sink device and that its sink name starts with bluez_sink.

pactl list sinks

Configure the default sink to use pulse audio server with the following command, replacing with the details of your device.

pactl set-default-sink bluez_sink.00_ 18_6B_4e_A4_B8

Copy an audio file (*.wav, *.mp3, etc) to the Intel® Edison device using scp, and play the audio file using mplayer.

mplayer Wave4.wav

More Info


Benefits of Using Intel® Software Development Emulator

$
0
0

Introduction

New Intel processors introduce enhanced instruction set extensions to improve performance or strengthen security of an application.  Instruction set extensions like Intel AVX1 and AVX21 are used to improve performance and Intel SHA2 instructions are used for SHA acceleration to increase security of an application.

What happens if developers want to create applications using these new instructions while their current hardware does not support these instructions?  How does a company justify buying new systems supporting the new instructions while ensuring their applications can take advantage of these new instructions to improve performance?

The Intel® Software Development Emulator is used to execute applications containing new instructions on systems that don’t support them.

This article will discuss the benefit of using SDE to test code using new instructions.

What is the Intel Software Development Emulator (SDE)?

As its name implies, SDE is an emulator that allows code with new instruction sets to run on systems that don’t support these instructions. More information about SDE can be found at [3].   It should be noted that the SDE is useful to assess functionality, but not performance, as it runs programs many times more slowly than native hardware.  SDE can be downloaded here.

To test an application with new instruction sets using SDE, the application first needs to be compiled using appropriate compilers that support the new instruction sets.  For example, to compile applications having AVX2 instructions, use the Intel compiler 14.0, gcc 4.7 or Microsoft* Visual Studio* or later versions of these compilers.  SDE lists all instructions in assembly language.  It not only lists instructions of user applications but also those in the libraries and kernel.

To display the full list of SDE options, use the following command at the command prompt:

sde –help

Figure 1. List all SDE options

To display the longer list of SDE options type:

sde -help -long

How to Use SDE

This article will show how to use two popular SDE options.  A video showing where to download and how to install and run SDE can be found at [4].

Not all but two most common SDE options will be mentioned here:

mix

        

Figure 2. mix option

Figure 2 shows the mix option with the default output or user-defined output files.The “-mix” and “-omix” options will list all dynamic         instructions that are executed along with instruction length, instruction category, and ISA extension grouping. Run SDE using this             option by typing the following command at the command prompt:

        sde.exe -mix -- <application name>

The result will be written to a file called mix.out.  To specify a different name for the output file, use the following command:

        sde.exe –omix <user-defined output file name> -- <application name>

ast  

        

Figure 3. ast option

Figure 3 shows the ast option with the default output or user-defined output files.Use the “-ast” or “-oast” options to detect whether         there are transitions between SSE and AVX/AVX2 instructions.  This option is very useful since these kinds of transition activities             cost a lot of execution cycles.  Reducing instruction transitions will improve the performance of an application.  Run SDE using this           option by typing the following command at the command prompt:

        sde.exe -ast -- <application name>

The result will be written to a file called avx-sse-transition.out.  To specify a different name for the output file, use the following                 command:

         sde.exe -oast <user-defined output file name> -- <application name>

Note: The options can also be combined into one single command as follows:

sde.exe -mix -ast -- <application name>

or

sde.exe –omix < output mix file name> -oast < output ast file name> -- <application name>

A closer look at the output of the ‘mix’ Option

The output file resulting from using the mix option contains a great deal of information which will not all be explained in this article.  Let’s focus on certain portions of the file.

Figure 4. This portion of the output file shows that the application (linkpack_xeon64) has 28 threads currently running.

Figure 5. The output shows a portion of thread 0 (TID 0) in which instructions are used along with instruction types (AVX, FMA and SSE).

Figure 6. At the end of the thread there is a summary of which instructions were used in the threads along with how frequently they were called.

Figure 7. At the end of the output file, there is a summary of all instructions in the applications and libraries that are used by the application.

A closer look at the output of the ‘ast’ Option

The output from running SDE using the ast option helps identify transitions from SSE to AVX and vice versa.

Figure 8.  SSE-AVX and AVX-SSE transitions do not exist

Figure 8 shows the case when there is no SSE-AVX and AVX-SSE transitions. Transitions between SSE-AVX and vice versa consume a lot of valuable execution cycles.  Sometimes these transitions can reach up to 20,000 cycles.  It is important to reduce these transitions.  In case7 there are transitions, the transition output file will look like in figure 9.

Figure 9.  SSE-AVX and AVX-SSE transitions exist

There is a good article about how to avoid these transaction penalties at [9].

Benefits of using SDE

Detecting Instructions​ 

SDE only counts instructions that are executed (dynamic) while the application is running, not instructions existing in the code (static).  This is a very good way to debug the application in case there is a problem assuming that certain instructions are expected to be executed within certain portions of the code.  If no expected instructions were detected in a specific block of addresses corresponding to certain portions of the code, then something must have happened like an unexpected/unaware branching condition. SDE provides some options like start-address and stop-address to handle this situation.  Note that these options are not documented5.  Hopefully, these options will be documented in future SDE releases.

One important thing should be noted here is that running an application with different options or input will also trigger different behavior and dynamic execution, so one cannot assume that a single run of an application will tell them the whole story.  It is always a good idea to run an application under different workloads to observe the application behavior.

Potential Performance Improvement

SDE can detect SSE-AVX and AVX-SSE transitions.  By reducing these transitions, performance can be improved.

Checking for Bad Pointers and Data Misalignment

SDE has the ability to check for bad pointers and data misalignment.  Below is a snapshot of the options for these two features from SDE documentation8.

Figure 10. Options for debugging

Conclusion

SDE is used to test applications that utilize new Intel instruction sets in the absence of hardware supporting these instructions; thus it helps assess whether an application might benefit from a new or future platform released by Intel.   It should be noted that SDE runs considerably more slowly than a native platform, and is not intended to provide insight on future performance.  SDE dynamically counts instructions that are executed and not instructions in the code, as well as SSE-AVX and AVX-SSE transitions.  These features can be used for debugging and optimization purposes.  SDE can also aid in debugging in the form of detecting bad pointers and data misalignment.   These are only a few of the features offered by this emulation environment:   we invite you to explore more by looking at the documentation included in the Reference section of this article.

References

[1] http://en.wikipedia.org/wiki/Advanced_Vector_Extensions

[2] http://en.wikipedia.org/wiki/Intel_SHA_extensions

[3] https://software.intel.com/en-us/articles/intel-software-development-emulator

[4] http://goparallel.sourceforge.net/installing-running-intel-software-development-emulator/

[5] https://software.intel.com/en-us/forums/topic/533825

[6] http://en.wikipedia.org/wiki/Intrinsic_function

[7] https://software.intel.com/en-us/forums/topic/538142

[8] https://software.intel.com/en-us/articles/intel-software-development-emulator#BASIC

[9] https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties

 

Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license. Intel, the Intel logo, Intel Core, and Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. Copyright © 2015 Intel Corporation. All rights reserved.

*Other names and brands may be claimed as the property of others.

Bringing SSL to Arduino* on Galileo Through wolfSSL*

$
0
0

The Intel® Galileo development board is an Arduino*-certified development and prototyping board. Built on the Yocto 1.4 Poky Linux* release, Galileo merges an Arduino development environment with a complete Linux-based computer system allowing enthusiasts to incorporate Linux system calls and OS-provided services in their Ardunio sketches.

One long-standing limitation in the Arduino platform has been a lack of SSL support. Without SSL Arduino-based devices are incapable of securely transmitting data using HTTPS, and are thus forced to communicate insecurely using plain HTTP. In order to work around this limitation, devices that participate in the build out of the Internet of Things must rely on secondary devices which serve as bridges to the internet. The Arduino device communicates using HTTP to the bridge, which in turn communicates to the internet-based service using HTTPS.

This solution works well for devices that have a fixed network location, but it does require additional hardware and introduces a concentration point for multiple devices that itself may be vulnerable to attack. For mobile devices that may occasionally rely on public wireless networks, this approach can be entirely impractical. The best level of protection for connected devices is achieved with SSL support directly on the device itself.

On Galileo an Arduino sketch is just a C++ program that is cross-compiled into machine code and executed as a process that is managed by the operating system. That means that it has access to the same system resources as any other compiled program, and specifically that program can be linked against arbitrary, compiled libraries. The implication here is that adding SSL support is as simple as linking the Arduino sketch to an existing SSL library.

This paper examines two methods for adding SSL support to Arduino sketches running on Galileo via the wolfSSL library from wolfSSL, Inc.* (formerly named the CyaSSL library). The wolfSSL library is a lightweight SSL/TLS library that is designed for resource-constrained environments and embedded applications, and is distributed under the GPLv2 license.

This paper looks at two methods for linking the wolfSSL library to an Arduino sketch, but both of them follow the same basic steps:

  1. Build wolfSSL for Yocto
  2. Install the wolfSSL shared library onto your Galileo image
  3. Modify the compile patterns for the Arduino IDE for Galileo
  4. Install the wolfSSL build files onto the system hosting the Arduino IDE

This procedure is moderately complex and does require a firm grasp of the Linux environment, shell commands, software packages and software build procedures, as well as methods of transferring files to and from a Linux system. While this paper does go into some detail on specific Linux commands, it is not a step-by-step instruction manual and it assumes that the reader knows how to manipulate files on a Linux system.

These procedures should work on both Galileo and Galileo 2 boards.

Method 1: Dynamic linking

In the dynamic linking method the Arduino sketch is dynamically linked with the shared object library, libwolfssl.so. This method is the easiest to program for since the sketch just calls the library functions directly.

There are disadvantages to this approach, however:

  • The Arduino IDE for Galileo uses a single configuration for compiling all sketches, so the linker will put a reference to libwolfssl.so in the resulting executable whether or not it’s needed by a sketch. This is not a problem if the target Galileo system has the wolfSSL library installed on it, but if any sketch is compiled for another system that does not have the library then those sketches will not execute.
  • The system hosting the Arduino IDE for Galileo must have the cross-compiled wolfSSL library installed into the Arduino IDE build tree.

Method 2: Dynamic loading

In the dynamic loading method the Arduino sketch is linked with the dynamic linking loader library, libdl. The wolfSSL library and its symbols are loaded dynamically during execution using dlopen() and dlsym(). This method is more tedious to program for since the function names cannot be resolved directly by the linker and must be explicitly loaded by the code and saved as function pointers.

The advantages over the dynamic linking method are:

  • libdl is part of the Galileo SD card image, so arbitrary sketches compiled by the modified IDE will still run on other Galileo systems.
  • The system hosting the Arduino IDE for Galileo only needs to have the wolfSSL header files installed into the build tree.
  • Any dynamic library is available to the Arduino sketch with just this single modification.

The first step in bringing SSL support to the Arduino environment is to build the wolfSSL library for Yocto using uClibc as the C library. This is accomplished using the cross compiler that is bundled with Intel’s Arduino IDE for Linux. This step must be performed on a Linux system.

There have been multiple releases of the IDE since the original Galileo release and any of them will do, but because path names have changed from release to release this document assumes that you will be using the latest build as of this writing, which is the Intel bundle version 1.0.4 with Arduino IDE version 1.6.0.

Software archive:

http://www.intel.com/content/www/us/en/do-it-yourself/downloads-and-documentation.html

Target file:

Arduino Software 1.6.0 - Intel 1.0.4 for Linux

Choose the 32-bit or 64-bit archive, whichever is correct for your Linux distribution.

Configuring the cross-compiler

If you have already used this version of the IDE to build sketches for your Galileo device then it has already been configured properly and you can skip this task.

If you have not built a sketch with it yet, then you will need to run the installation script in order to correctly set the path names in the package configuration files. This script, install_script.sh, is located in the hardware/tools/i586 directory inside the root of your IDE package. Run it with no arguments:

~/galileo/arduino-1.6.0+Intel/hardware/tools/i586$ ./install_script.sh
Setting it up.../tmp/tmp.7FGQfwEaNz/relocate_sdk.sh /nfs/common/galileo/arduino-1.6.0+Intel/hardware/tools/i586/relocate_sdk.sh
link:/nfs/common/galileo/arduino-1.6.0+Intel/hardware/tools/i586/sysroots/x86_64-pokysdk-linux/lib/ld-linux-x86-64.so.2
link:/nfs/common/galileo/arduino-1.6.0+Intel/hardware/tools/i586/sysroots/x86_64-pokysdk-linux/lib/libpthread.so.0
link:/nfs/common/galileo/arduino-1.6.0+Intel/hardware/tools/i586/sysroots/x86_64-pokysdk-linux/lib/libnss_compat.so.2
link:/nfs/common/galileo/arduino-1.6.0+Intel/hardware/tools/i586/sysroots/x86_64-pokysdk-linux/lib/librt.so.1
link:/nfs/common/galileo/arduino-1.6.0+Intel/hardware/tools/i586/sysroots/x86_64-pokysdk-linux/lib/libresolv.so.2
…
SDK has been successfully set up and is ready to be used.

The cross-compiler is now ready for use.

Downloading the wolfSSL source

To build the wolfSSL library for Galileo you need to download the source code from wolfSSL, Inc. As of this writing, the latest version is 3.4.0 and is distributed as a Zip archive. Unzip the source into a directory of your choosing.

Building the library

In order to build the library, you must first set up your shell environment to reference the cross compiler. The environment setup files assume a Bourne shell environment so you must perform these steps in an appropriate and compatible shell such as sh or bash. Starting from a clean shell environment is strongly recommended.

First, source the environment setup file from the Intel Arduino IDE. Be sure to use the path to your Intel Arduino IDE instead of the path given in the example:

~/src/wolfssl-3.4.0$ . ~/galileo/arduino-1.6.0+Intel/hardware/tools/i586/environment-setup-i586-poky-linux-uclibc

This step will not generate any output.

Now, you are ready to run the configure script for wolfSSL. It is necessary to provide configure with a number of options in order to properly initialize it for a cross compile.

~/src/wolfssl-3.4.0$ ./configure --prefix=$HOME/wolfssl --host=i586-poky-linux-uclibc \
        --target=i586-poky-linux-uclibc

Note that you must supply absolute paths to the configure script, and cannot use ~ as a shortcut for your home directory. Use the $HOME shell variable instead.

The --prefix option tells build system where to install the library. Since you won’t actually be installing the library on this system, any directory will do. This example shows it going in $HOME/wolfssl.

The --host and --target options tell the build system that this will be a cross-compile, targeting the architecture identified as i586-poky-linux-uclibc.

The configure script will generate a lot of output. When it finishes, assuming there are no errors, you can build the software using “make”.

~/src/wolfssl-3.4.0$ make
make[1]: Entering directory `/nfs/users/johnm/src/wolfssl-3.4.0'
  CC       wolfcrypt/test/testsuite_testsuite_test-test.o
  CC       examples/client/testsuite_testsuite_test-client.o
  CC       examples/server/testsuite_testsuite_test-server.o
  CC       examples/client/tests_unit_test-client.o
  CC       examples/server/tests_unit_test-server.o
  CC       wolfcrypt/src/src_libwolfssl_la-hmac.lo
  CC       wolfcrypt/src/src_libwolfssl_la-random.lo
…
  CCLD     examples/client/client
  CCLD     examples/echoserver/echoserver
  CCLD     testsuite/testsuite.test
  CCLD     tests/unit.test
make[1]: Leaving directory `/nfs/users/johnm/src/wolfssl-3.4.0'

And then install it to the local/temporary location via “make install”:

~/src/wolfssl-3.4.0$ make install

Your library will now be in the directory you specified to the --prefix option of configure, in the lib subdirectory:

~/src/wolfssl-3.4.0$ cd $HOME/wolfssl/lib~/wolfssl/lib$ ls -CFs
total 188
  4 libwolfssl.la*    0 libwolfssl.so.0@        4 pkgconfig/
  0 libwolfssl.so@  180 libwolfssl.so.0.0.0*

You’re now ready to install the wolfSSL library onto Galileo.

There are two general approaches for installing the wolfSSL package onto Galileo: the first is to copy the files directly to the Galileo filesystem image, and the second is to copy the files onto a running Galileo system over a network connection. In either case, however, you do need to know which image you are running on your system, the SD-Card Linux image, or the IoT Developer Kit image.

For Galileo running the SD-Card Linux image

The SD-Card Linux image is the original system image for Galileo boards. It is a very minimal system image which is less than 312 MB in size. It lacks development tools (e.g., there is no compiler) and advanced Linux utilities. As of this writing, the latest version of the SD-Card image is 1.0.4.

Software archive:

http://www.intel.com/content/www/us/en/do-it-yourself/downloads-and-documentation.html

Target file:

SD-Card Linux Image (SDCard.1.0.4.tar.bz2)

Both installation methods are discussed below, but installing directly to the Galileo filesystem image is preferred because you have more powerful utilities at your disposal.

Installing wolfSSL to the filesystem image

This method is easier and less error-prone than the other since you have file synchronization tools available to you, and you don’t have the added complexities of networking. All that is necessary is to mount the Galileo filesystem image as a filesystem on the build machine and then you can use rsync to copy the wolfSSL package into place. You can either copy this file to your build system, or mount the microSD card with the image directly on your Linux system using a card reader.

In the Galileo SD Card filesystem tree, the main Galileo filesystem image is called image-full-galileo-clanton.ext3 and it can be mounted using the loop device. Create a mount point (directory) on your build system—the example below uses /mnt/galileo—and then use the mount command to mount it:

~/wolfssl$ cd /mnt/mnt$ sudo mkdir galileo/mnt$ mount –t ext3 –o loop /path/to/image-full-galileo-clanton.ext3 /mnt/galileo

The Galileo filesystem should now be visible as /mnt/galileo.

Use rsync to copy the shared library and its symlinks into place. They should be installed into /usr/lib on your Galileo system:

/mnt$ rsync –a $HOME/wolfssl/lib/lib* /mnt/galileo/usr/lib

Be sure to replace $HOME/wolfSSL with the actual location of your local wolfSSL build.

Installing wolfSSL over the network

For this method, the Galileo system must be up and running with an active network connection and you will need to know its IP address. Because Galileo lacks file synchronization utilities such as rsync, files will have to be copied using tar to ensure that symbolic links are handled correctly.

First, use cd to switch to the lib subdirectory of your local wolfSSL build.

~/wolfssl$ cd $HOME/wolfssl/lib

Now use tar to create an archive of the shared library and its symlinks, and the copy it to Galileo with scp.

~/wolfssl/lib$ tar cf /tmp/wolfssl.tar lib*~/wolfssl/lib$ cd /tmp/tmp$ scp wolfssl.tar root@192.168.1.2:/tmp
root@192.168.1.2’s password:

Be sure to enter the IP address of your Galileo instead of the example.

Now log in to your Galileo device and untar the archive:

/tmp$ ssh root@192.168.1.2
root@192.168.1.2’s password:
root@clanton:~# cd /usr/libroot@clanton:/usr/lib# tar xf /tmp/wolfssl.tar

For Galileo running the IoT Developer Kit image

The IoT Developer Kit image is a much larger and more traditional Linux system image which includes developer tools and many useful system utilities and daemons. It is distributed as a raw disk image which includes both FAT32 and ext3 disk partitions, and it must be direct-written to an SD card.

Software archive:

https://software.intel.com/en-us/iot/downloads

Target file:

iotdk-galileo-image.bz2

Both installation methods are discussed below.

As of this writing, you also need to replace the uClibc library on your Developer Kit image with the one bundled with your Intel Arduino IDE. Due to differences in the build procedure used for these two copies of the library, not all of the symbols that are exported in the IDE version are present in the Developer Kit version and that can lead to runtime crashes of Arduino sketches. The wolfSSL library, in particular, introduces a dependency on one of these symbols that is missing from the Developer Kit’s build of uClibc, and if you do not replace the library on the Galileo system attempts to use libwolfssl will fail.

Installing wolfSSL to the filesystem image

This method is easiest if you connect an SD card reader to your Linux system. Since the Developer Kit image contains an ext3 partition, most Linux distributions will automatically mount it for you, typically under /media or /mnt. Use the df command with the -T option to help you determine the mount point.

~$ df -T | grep ext3
/dev/sde2      ext3        991896  768032    172664  82% /media/johnm/048ce1b1-be13-4a5d-8352-2df03c0d9ed8

In this case, the mount point is /media/johnm/048ce1b1-be13-4a5d-8352-2df03c0d9ed8:

~$ /bin/ls -CFs /media/johnm/048ce1b1-be13-4a5d-8352-2df03c0d9ed8
total 96
 4 bin/   4 home/          4 media/          4 proc/    4 sys/   4 www/
 4 boot/  4 lib/           4 mnt/            4 run/     4 tmp/
 4 dev/   4 lib32/         4 node_app_slot/  4 sbin/    4 usr/
 4 etc/   16 lost+found/   4 opt/            4 sketch/  4 var/

The libraries used by Arduino sketches are kept in /lib32. Use cd to change to that directory and copy the wolfSSL shared libraries and their symlinks into this directory using rsync in order to preserve the symbolic links.

~/wolfssl$ cd /path-to-mountpoint/lib32lib32$ rsync –a $HOME/wolfssl/lib/lib* .

Be sure to replace path-to-mountpoint with the actual mount point for your SD card’s Galileo filesystem.

Now, you need to replace the Developer Kit’s uClibc library with the one from your Intel Arduino IDE package. Instead of removing it or overwriting it, the following procedure will simply rename it, effectively disabling the original copy of the library but without permanently deleting it:

lib32$ mv libuClibc-0.9.34-git.so libuClibc-0.9.34-git.so.distlib32$ cp ~/galileo/arduino-1.6.0+Intel/hardware/tools/i586/sysroots/i586-poky-linux-uclibc/lib/libuClibc-0.9.34-git.so .

Remember to use your actual path to your Intel Arduino IDE in place of the example one.

Installing wolfSSL over the network

For this method, the Galileo system must be up and running with an active network connection and you will need to know its IP address. Because Galileo lacks file synchronization utilities such as rsync, files will have to be copied using tar to ensure that symbolic links are handled correctly.

First, use cd to switch to the lib subdirectory of your local wolfSSL build.

~/wolfssl$ cd $HOME/wolfssl/lib

Now use tar to create an archive of the shared library and its symlinks, and the copy it to Galileo with scp.

~/wolfssl/lib$ tar cf /tmp/wolfssl.tar lib*~/wolfssl/lib$ cd /tmp/tmp$ scp wolfssl.tar root@192.168.1.2:/tmp
root@192.168.1.2’s password:

Be sure to enter the IP address of your Galileo instead of the example.

Now log in to your Galileo device and untar the archive:

/tmp$ ssh root@192.168.1.2
root@192.168.1.2’s password:
root@quark:~# cd /lib32root@quark:/lib32# tar xf /tmp/wolfssl.tar

Next, you need to replace the Developer Kit’s uClibc library with the one from your Intel Arduino IDE package. Instead of removing it or overwriting it, the following procedure will simply rename it, effectively disabling the original copy of the library but without permanently deleting it (this will also prevent the actively running sketch from crashing):

root@quark:/lib32$ mv libuClibc-0.9.34-git.so libuClibc-0.9.34-git.so.dist

Log out of your Galileo system and use scp to copy the library from your Intel Arduino IDE to your Galileo:

~$ scp ~/galileo/arduino-1.6.0+Intel/hardware/tools/i586/sysroots/i586-poky-linux-uclibc/lib/libuClibc-0.9.34-git.so root@192.168.1.2:/lib32

Remember to use your actual path to your Intel Arduino IDE in place of the example one, and your Galileo’s IP address.

To compile sketches that want to use the wolfSSL library you need to modify the compile patterns for the Arduino IDE for Galileo. The specific modification that is necessary depends on the method you have chosen for linking to libwolfssl, but no matter the method compile patters live inside of hardware/intel/i586-uclibc for the Intel 1.0.4 with Arduino IDE 1.5.3 and later.

Modifying the compile patterns

The file that holds your compile patterns is named platform.txt.

Locating the compile patterns file

You will be editing the line “recipe.c.combine.pattern”, which looks similar to this:

## Combine gc-sections, archives, and objects
recipe.c.combine.pattern="{compiler.path}{compiler.c.elf.cmd}" {compiler.c.elf.flags} -march={build.mcu} -o "{build.path}/{build.project_name}.elf" {object_files} "{build.path}/{archive_file}""-L{build.path}" -lm -lpthread

Dynamic linking

If you are using the dynamic linking method, then you need to tell the linker to add libwolfssl to the list of libraries when linking the executable. Add -lwolfssl to the end of the line.

## Combine gc-sections, archives, and objects
recipe.c.combine.pattern="{compiler.path}{compiler.c.elf.cmd}" {compiler.c.elf.flags} -march={build.mcu} -o "{build.path}/{build.project_name}.elf" {object_files} "{build.path}/{archive_file}""-L{build.path}" -lm –lpthread -lwolfssl

Be sure not to add any line breaks.

Dynamic loading

In the dynamic loading method, you need to tell the linker to add the dynamic loader library to the list of libraries. Add -ldl to the end of the line.

## Combine gc-sections, archives, and objects
recipe.c.combine.pattern="{compiler.path}{compiler.c.elf.cmd}" {compiler.c.elf.flags} -march={build.mcu} -o "{build.path}/{build.project_name}.elf" {object_files} "{build.path}/{archive_file}""-L{build.path}" -lm -ldl

Be sure not to add any line breaks.

The last step before you can compile sketches is to install the wolfSSL build files into the Arduino IDE for Galileo build tree. For the 1.6.0 release, the build tree is in hardware/tools/i586/i586-poky-linux-uclibc. In there you will find a UNIX-like directory structure containing directories etc, lib, usr, and var.

Installing the wolfSSL header files

Whether you are using the dynamic loading or dynamic linking method you will need to have the wolfSSL header files installed where the Arduino IDE can find them so that you can include them in your sketches with:

#include <wolfssl/ssl.h>

You can find the header files in the local installation of wolfSSL that you created in Step 1, in include subdirectory. For backwards compatability reasons, the wolfSSL distribution includes header files in include/cyassl and include/wolfssl.

The wolfSSL header files must but installed into usr/include:

Copying the wolfssl header files into the includes directory

Installing the wolfSSL libraries

If you are using the dynamic linking method, then you must also install the cross-compiled libraries into usr/lib. You can skip this step if you are using the dynamic loading method.

The libraries are in the local installation that was created in Step 1, inside the lib directory. From there copy:

libwolfssl.la
libwolfssl.so
libwolfssl.so.*

All but one of the shared object files will be symlinks, but it is okay for them to be copied as just regular files.

Installing the wolfssl library files into lib

The following example sketches show how to interact with the wolfSSL library using both the dynamic linking and dynamic loading methods. They perform the same function: connect to a target web server and fetch a web page using SSL. The page source is printed to the Arduino IDE for Galileo’s serial console.

These sketches are licensed under the Intel Sample Source Code license. In addition to browsing the source here, you can download them directly.

Dynamic linking example

/*
Copyright 2015 Intel Corporation All Rights Reserved.

The source code, information and material ("Material") contained herein is owned
by Intel Corporation or its suppliers or licensors, and title to such Material
remains with Intel Corporation or its suppliers or licensors. The Material
contains proprietary information of Intel or its suppliers and licensors. The
Material is protected by worldwide copyright laws and treaty provisions. No part
of the Material may be used, copied, reproduced, modified, published, uploaded,
posted, transmitted, distributed or disclosed in any way without Intel's prior
express written permission. No license under any patent, copyright or other
intellectual property rights in the Material is granted to or conferred upon
you, either expressly, by implication, inducement, estoppel or otherwise. Any
license under such intellectual property rights must be express and approved by
Intel in writing.

Include any supplier copyright notices as supplier requires Intel to use.

Include supplier trademarks or logos as supplier requires Intel to use,
preceded by an asterisk. An asterisked footnote can be added as follows: *Third
Party trademarks are the property of their respective owners.

Unless otherwise agreed by Intel in writing, you may not remove or alter this
notice or any other notice embedded in Materials by Intel or Intel's suppliers
or licensors in any way.
*/

#include <LiquidCrystal.h>
#include <dlfcn.h>
#include <wolfssl/ssl.h>
#include <Ethernet.h>
#include <string.h>

const char server[]= "www.example.com"; // Set this to a web server of your choice
const char req[]= "GET /Main_Page HTTP/1.0\r\n\r\n"; // Get the root page

int repeat;

int wolfssl_init ();
int client_send (WOLFSSL *, char *, int, void *);
int client_recv (WOLFSSL *, char *, int, void *);

LiquidCrystal lcd(8, 9, 4, 5, 6, 7);
void *handle;

EthernetClient client;


WOLFSSL_CTX *ctx= NULL;
WOLFSSL *ssl= NULL;
WOLFSSL_METHOD *method= NULL;

void setup() {
	Serial.begin(9600);
	Serial.println("Initializing");

	lcd.begin(16,2);
	lcd.clear();

	if ( wolfssl_init() == 0 ) goto fail;

	Serial.println("OK");

	// Set the repeat count to a maximum of 5 times so that we aren't
	// fetching the same URL over and over forever.

	repeat= 5;
	return;

fail:
	Serial.print("wolfSSL setup failed");
	repeat= 0;
}

int wolfssl_init ()
{
	char err[17];

	// Create our SSL context

	method= wolfTLSv1_2_client_method();
	ctx= wolfSSL_CTX_new(method);
	if ( ctx == NULL ) return 0;

	// Don't do certification verification
	wolfSSL_CTX_set_verify(ctx, SSL_VERIFY_NONE, 0);

	// Specify callbacks for reading to/writing from the socket (EthernetClient
	// object).

	wolfSSL_SetIORecv(ctx, client_recv);
	wolfSSL_SetIOSend(ctx, client_send);

	return 1;
}

int client_recv (WOLFSSL *_ssl, char *buf, int sz, void *_ctx)
{
	int i= 0;

	// Read a byte while one is available, and while our buffer isn't full.

	while ( client.available() > 0 && i < sz) {
		buf[i++]= client.read();
	}

	return i;
}

int client_send (WOLFSSL *_ssl, char *buf, int sz, void *_ctx)
{
	int n= client.write((byte *) buf, sz);
	return n;
}

void loop() {
	char errstr[81];
	char buf[256];
	int err;

	// Repeat until the repeat count is 0.

	if (repeat) {
		if ( client.connect(server, 443) ) {
			int bwritten, bread, totread;

			Serial.print("Connected to ");
			Serial.println(server);

			ssl= wolfSSL_new(ctx);
			if ( ssl == NULL ) {
				err= wolfSSL_get_error(ssl, 0);
				wolfSSL_ERR_error_string_n(err, errstr, 80);
				Serial.print("wolfSSL_new: ");
				Serial.println(errstr);
			}

			Serial.println(req);
			bwritten= wolfSSL_write(ssl, (char *) req, strlen(req));
			Serial.print("Bytes written= ");
			Serial.println(bwritten);

			if ( bwritten > 0 ) {
				totread= 0;

				while ( client.available() || wolfSSL_pending(ssl) ) {
					bread= wolfSSL_read(ssl, buf, sizeof(buf)-1);
					totread+= bread;

					if ( bread > 0 ) {
						buf[bread]= '\0';
						Serial.print(buf);
					} else {
						Serial.println();
						Serial.println("Read error");
					}
				}

				Serial.print("Bytes read= ");
				Serial.println(bread);
			}

			if ( ssl != NULL ) wolfSSL_free(ssl);

			client.stop();
			Serial.println("Connection closed");
		}

		--repeat;
	}

	// Be polite by sleeping between iterations

	delay(5000);
}

Dynamic loading example

/*
Copyright 2015 Intel Corporation All Rights Reserved.

The source code, information and material ("Material") contained herein is owned
by Intel Corporation or its suppliers or licensors, and title to such Material
remains with Intel Corporation or its suppliers or licensors. The Material
contains proprietary information of Intel or its suppliers and licensors. The
Material is protected by worldwide copyright laws and treaty provisions. No part
of the Material may be used, copied, reproduced, modified, published, uploaded,
posted, transmitted, distributed or disclosed in any way without Intel's prior
express written permission. No license under any patent, copyright or other
intellectual property rights in the Material is granted to or conferred upon
you, either expressly, by implication, inducement, estoppel or otherwise. Any
license under such intellectual property rights must be express and approved by
Intel in writing.

Include any supplier copyright notices as supplier requires Intel to use.

Include supplier trademarks or logos as supplier requires Intel to use,
preceded by an asterisk. An asterisked footnote can be added as follows: *Third
Party trademarks are the property of their respective owners.

Unless otherwise agreed by Intel in writing, you may not remove or alter this
notice or any other notice embedded in Materials by Intel or Intel's suppliers
or licensors in any way.
*/

#include <dlfcn.h>
#include <wolfssl/ssl.h>
#include <Ethernet.h>
#include <string.h>

/*
Set this to the location of your wolfssl shared library. By default you
shouldn't need to specify a path unless you put it somewhere other than
/usr/lib (SD-Card image) or /lib32 (IoT Developer Kit image).
*/
#define WOLFSSL_SHLIB_PATH "libwolfssl.so"

const char server[]= "www.example.com"; // Set this to a web server of your choice
const char req[]= "GET / HTTP/1.0\r\n\r\n"; // Get the root page
int repeat;

int wolfssl_dlload ();
int wolfssl_init ();
int client_send (WOLFSSL *, char *, int, void *);
int client_recv (WOLFSSL *, char *, int, void *);

void *handle;

EthernetClient client;


WOLFSSL_CTX *ctx= NULL;
WOLFSSL *ssl= NULL;
WOLFSSL_METHOD *method= NULL;

typedef struct wolfssl_handle_struct {
	WOLFSSL_METHOD *(*wolfTLSv1_2_client_method)();
	WOLFSSL_CTX *(*wolfSSL_CTX_new)(WOLFSSL_METHOD *);
	void (*wolfSSL_CTX_set_verify)(WOLFSSL_CTX *, int , VerifyCallback);
	int (*wolfSSL_connect)(WOLFSSL *);
	int (*wolfSSL_shutdown)(WOLFSSL *);
	int (*wolfSSL_get_error)(WOLFSSL *, int);
	void (*wolfSSL_ERR_error_string_n)(unsigned long, char *, unsigned long);
	WOLFSSL *(*wolfSSL_new)(WOLFSSL_CTX *);
	void (*wolfSSL_free)(WOLFSSL *);
	void (*wolfSSL_SetIORecv)(WOLFSSL_CTX *, CallbackIORecv);
	void (*wolfSSL_SetIOSend)(WOLFSSL_CTX *, CallbackIORecv);
	int (*wolfSSL_read)(WOLFSSL *, void *, int);
	int (*wolfSSL_write)(WOLFSSL *, void *, int);
	int (*wolfSSL_pending)(WOLFSSL *);
} wolfssl_t;

wolfssl_t wolf;

void setup() {
	Serial.begin(9600);
	Serial.println("Initializing");

	if ( wolfssl_dlload() == 0 ) goto fail;
	if ( wolfssl_init() == 0 ) goto fail;

	// Set the repeat count to a maximum of 5 times so that we aren't
	// fetching the same URL over and over forever.

	repeat= 5;
	return;

fail:
	Serial.print("wolfSSL setup failed");
	repeat= 0;
}

int wolfssl_init ()
{
	char err[17];

	// Create our SSL context

	method= wolf.wolfTLSv1_2_client_method();
	ctx= wolf.wolfSSL_CTX_new(method);
	if ( ctx == NULL ) return 0;

	// Don't do certification verification
	wolf.wolfSSL_CTX_set_verify(ctx, SSL_VERIFY_NONE, 0);

	// Specify callbacks for reading to/writing from the socket (EthernetClient
	// object).

	wolf.wolfSSL_SetIORecv(ctx, client_recv);
	wolf.wolfSSL_SetIOSend(ctx, client_send);

	return 1;
}

int wolfssl_dlload ()
{
	// Dynamically load our symbols from libwolfssl.so

	char *err;

	// goto is useful for constructs like this, where we need everything to succeed or
	// it's an overall failure and we abort. If just one of these fails, print an error
	// message and return 0.

	handle= dlopen(WOLFSSL_SHLIB_PATH, RTLD_NOW);
	if ( handle == NULL ) {
		err= dlerror();
		goto fail;
  }

	wolf.wolfTLSv1_2_client_method= (WOLFSSL_METHOD *(*)()) dlsym(handle, "wolfTLSv1_2_client_method");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_CTX_new= (WOLFSSL_CTX *(*)(WOLFSSL_METHOD *)) dlsym(handle, "wolfSSL_CTX_new");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_CTX_set_verify= (void (*)(WOLFSSL_CTX* , int , VerifyCallback)) dlsym(handle, "wolfSSL_CTX_set_verify");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_connect= (int (*)(WOLFSSL *)) dlsym(handle, "wolfSSL_connect");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_get_error= (int (*)(WOLFSSL *, int)) dlsym(handle, "wolfSSL_get_error");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_ERR_error_string_n= (void (*)(unsigned long, char *, unsigned long)) dlsym(handle, "wolfSSL_ERR_error_string_n");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_new= (WOLFSSL *(*)(WOLFSSL_CTX *)) dlsym(handle, "wolfSSL_new");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_free= (void (*)(WOLFSSL *)) dlsym(handle, "wolfSSL_free");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_SetIORecv= (void (*)(WOLFSSL_CTX *, CallbackIORecv)) dlsym(handle, "wolfSSL_SetIORecv");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_SetIOSend= (void (*)(WOLFSSL_CTX *, CallbackIORecv)) dlsym(handle, "wolfSSL_SetIOSend");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_read= (int (*)(WOLFSSL *, void *, int)) dlsym(handle, "wolfSSL_read");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_write= (int (*)(WOLFSSL *, void *, int)) dlsym(handle, "wolfSSL_write");
	if ( (err= dlerror()) != NULL ) goto fail;

	wolf.wolfSSL_pending= (int (*)(WOLFSSL *)) dlsym(handle, "wolfSSL_pending");
	if ( (err= dlerror()) != NULL ) goto fail;

	Serial.println("OK");

	return 1;

fail:
	Serial.println(err);
	return 0;
}

int client_recv (WOLFSSL *_ssl, char *buf, int sz, void *_ctx)
{
	int i= 0;

	// Read a byte while one is available, and while our buffer isn't full.

	while ( client.available() > 0 && i < sz) {
		buf[i++]= client.read();
	}

	return i;
}

int client_send (WOLFSSL *_ssl, char *buf, int sz, void *_ctx)
{
	int n= client.write((byte *) buf, sz);
	return n;
}

void loop() {
	char errstr[81];
	char buf[256];
	int err;

	// Repeat until the repeat count is 0.

	if (repeat) {
		if ( client.connect(server, 443) ) {
			int bwritten, bread, totread;

			Serial.print("Connected to ");
			Serial.println(server);

			ssl= wolf.wolfSSL_new(ctx);
			if ( ssl == NULL ) {
				err= wolf.wolfSSL_get_error(ssl, 0);
				wolf.wolfSSL_ERR_error_string_n(err, errstr, 80);
				Serial.print("wolfSSL_new: ");
				Serial.println(errstr);
			}

			Serial.println(req);
			bwritten= wolf.wolfSSL_write(ssl, (char *) req, strlen(req));
			Serial.print("Bytes written= ");
			Serial.println(bwritten);

			if ( bwritten > 0 ) {
				totread= 0;

				while ( client.available() || wolf.wolfSSL_pending(ssl) ) {
					bread= wolf.wolfSSL_read(ssl, buf, sizeof(buf)-1);
					totread+= bread;

					if ( bread > 0 ) {
						buf[bread]= '\0';
						Serial.print(buf);
					} else {
						Serial.println();
						Serial.println("Read error");
					}
				}

				Serial.print("Bytes read= ");
				Serial.println(totread);
			}

			if ( ssl != NULL ) wolf.wolfSSL_free(ssl);

			client.stop();
			Serial.println("Connection closed");
		}

		--repeat;
	}

	// Be polite by sleeping between iterations

	delay(5000);
}

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to:  http://www.intel.com/design/literature.htm

Intel, the Intel logo, VTune, Cilk and Xeon are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others

Copyright© 2012 Intel Corporation. All rights reserved.

§

Quick Start Guide for the Intel(r) Xeon Phi(tm) Processor X200

$
0
0

This document is under development.

Abaqus/Standard Performance Case Study on Intel® Xeon® E5-2600 v3 Product Family

$
0
0

Background

The whole point of simulation is to model the behavior of a design and potential changes against various conditions to determine whether we are getting an expected response; and simulation in software is far cheaper than building hardware and performing a physical simulation and modifying the hardware model each time.

Dassault Systèmes [1] through its SIMULIA* brand, is creating a new paradigm to establish Finite Element Analysis and mulitphysics simulation software as an integral business process in the engineering value chain. More information about SIMULIA can be found here [2].   

The Abaqus* Unified Finite Elements Analysis product suite, from Dassault Systèmes* SIMULIA, offers powerful and complete solutions for both routine and sophisticated engineering problems covering a vast spectrum of industrial applications in Automotive, Aerospace, Consumer Packaged Goods, Energy, High Tech, Industrial Equipment and Life Sciences. As an example,  automotive industry engineering work groups are able to consider full vehicle loads, dynamic vibration, multibody systems, impact/crash, nonlinear static, thermal coupling, and acoustic-structural coupling using a common model data structure and integrated solver technology.

What is Finite Element Analysis (FEA)?

FEA is a computerized method of simulating the behavior of engineering structures and components under a variety of conditions.  It is the application of the Finite Element method (FEM)[3] [8].  It works by breaking down an object into a large number of finite elements and each element is represented by an equation. By integrating all the element’s equations, the whole object can be mathematical modeled.

How Abaqus/Standard take advantage of Intel® AVX2

Abaqus/Standard is general purpose FEA.  It includes many analysis capabilities. According to Dassault Systèmes web site, it “employs solution technology ideal for static and low-speed dynamic events where highly accurate stress solutions are critically important. Examples include sealing pressure in a gasket joint, steady-state rolling of a tire, or crack propagation in a composite airplane fuselage. Within a single simulation, it is possible to analyze a model both in the time and frequency domain. For example, one may start by performing a nonlinear engine cover mounting analysis including sophisticated gasket mechanics. Following the mounting analysis, the pre-stressed natural frequencies of the cover can be extracted, or the frequency domain mechanical and acoustic response of the pre-stressed cover to engine induced vibrations can be examined.”  More information about Abaqus/Standard can be found at [9].

According to Dassault Systèmes web site, Abaqus/Standard uses Hilber-Hughes-Taylor time [12] integration by default. The time integration is implicit, meaning that the operator matrix must be inverted and a set of simultaneous nonlinear dynamic equilibrium equations must be solved at each time increment.  This solution is done iteratively using Newton’s [13] method.  This solution utilizes a function called DGEMM [5] (Double-Precision General Matrix Multiplication) in the Intel® Math Kernel Libraries (Intel® MKL [4]) to handle matrix multiplication involving double-precision values.

Analysis of Abaqus workloads using performance monitoring tools, such as Intel® VTune™, showed a significant number of them spend 40% to 50% of their runtime time in DGEMM.  Further analysis of the DGEMM function showed that it makes extensively used of the multiply-add operation since DGEMM is, basically, matrix multiplication.

One of the new feature of the Intel® Xeon® E5-2600 v3 Product Family is the support of a new extension set called Intel AVX2 [7]. One of the new instructions in Intel AVX2 is the three-operand fused multiply-add (FMA3 [6]).  By implementing the combined multiply-addition operation in the hardware, the speed of this operation is considerably improved.

Abaqus/Standard uses Intel® MKL’s DGEMM implementation.  It should also be noted that in Intel MKL version 11 update 5, and later versions, DGEMM was optimized to use Intel AVX2 extensions, thus allowing DGEMM to run optimally on the Intel® Xeon® E5-2600 v3 Product Family.

Performance test procedure

To prove the performance improvement brought forth by using a newer DGEMM implementation that takes advantage of Intel AVX2, we performed tests on two platforms. One system was equipped with Intel Xeon E5-2697 v3 and the other with Intel Xeon E5-2697 v2.  The duration of the tests were measured in seconds.

Performance test Benchmarks

The following four benchmarks from Abaqus/Standard were used: s2a, s3a, s3b and s4b.

Figure 1. S2a is a nonlinear static analysis of a flywheel with centrifugal loading.

Figure 2. S3 extracts the natural frequencies and mode shapes of a turbine impeller.

S3 has three versions.

S3a is a 360,000 degrees of freedom (DOF) using Lanczos Eigensolver [11] version.

S3b is a 1,100,000 degrees of freedom (DOF) using Lanczos Eigensolver version.

Figure 3. S4 is a benchmark that simulates the bolting of a cylinder head onto an engine block.

S4b is S4 version with 5,000,000 degrees of freedom (DOF) using direct solver version.

Note that these pictures are properties of Dassault Systèmes*.  They are reprinted with the permission from Dassault Systèmes.

Test configurations

System equipped with Intel Xeon E5-2697 v3

  • System: Pre-production
  • Processors: Xeon E5-2697 v3 @2.6GHz
  • Memory: 128GB DDR4-2133MHz

System equipped with Intel Xeon E5-2697 v2

  • System: Pre-production
  • Processors: Xeon E5-2697 v2 @2.7GHz
  • Memory: 64GB DDR4-1866MHz

Operating System: Red Hat* Enterprise Linux Server release 6.4

Application: Abaqus/Standard benchmarks version 6.13-1

Note:

1) Although the system equipped with the Intel® Xeon® E5-2697 v3 processor has more memory, the memory capacity does not affect the tests results, as the largest workload only used 43GB of memory.

2) The duration was measured by wall-clock time in seconds.

Test Results 

Figure 4. Comparison between Intel Xeon E5-2697 v3 and E5-2697 v2

Figure 4 shows the benchmarks running on a system equipped with Intel Xeon E5-2697 v3 and on a system equipped with E5-2697 v2. Performance improvement due to Intel AVX2 and hardware advantage ranging from 1.11X to 1.39X.

 

Figure 5. Comparison between benchmarks with Intel AVX2 enabled and disabled

Figure 5 shows the results of benchmarks with Intel AVX2 enabled and disabled on a system equipped with Intel Xeon E5-2697 v3.  Using Intel AVX2 allows benchmarks to finish faster than without using Intel AVX2.  The performance increase due to Intel AVX2 is ranging from 1.03X to 1.11X.

Note: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance

Conclusion

Simulation software performance is very critical since it can significantly reduce the model development and analysis time.  Abaqus/Standard is well-known for FEA that relies on DGEMM for its solvers.  As a result of the introduction of Intel® AVX2 in the Intel® Xeon® E5-2600 v3 Product Family, and as a result of the Intel MKL augmentation to take advantage of Intel AVX2, a simple change to the Abaqus/Standard to use the latest libraries yielded a considerable performance improvement. 

References

[1] www.3ds.com

[2] http://www.3ds.com/products-services/simulia/

[3] http://en.wikipedia.org/wiki/Finite_element_method

[4] http://en.wikipedia.org/wiki/Math_Kernel_Library

[5] https://software.intel.com/en-us/node/429920

[6] http://en.wikipedia.org/wiki/FMA_instruction_set

[7] http://en.wikipedia.org/wiki/Advanced_Vector_Extensions

[8] http://people.maths.ox.ac.uk/suli/fem.pdf

[9] http://www.3ds.com/products-services/simulia/products/abaqus/abaqusstandard/

[10] http://www.simulia.com/support/v66/v66_performance.html#s2

[11] http://en.wikipedia.org/wiki/Lanczos_algorithm

[12] http://sbel.wisc.edu/People/schafer/mdexperiments/node13.html

[13] http://en.wikipedia.org/wiki/Newton%27s_method

 

Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license. Intel, the Intel logo, Intel Core, and Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. Copyright © 2015 Intel Corporation. All rights reserved.

*Other names and brands may be claimed as the property of others.

Getting Ready for Intel® Xeon Phi™ x200 Product Family

$
0
0

This article demonstrates some of the techniques application developers can use to best prepare their applications for the upcoming Intel® Xeon Phi™ x200 product family – codename Knights Landing (KNL).

Quick links

1.    Introduction

The Intel® Xeon Phi™ x100 family of coprocessors was the first generation of the Intel® Xeon Phi™ product family. It offered energy efficient scaling, enhanced vectorization capabilities and exploited local memory bandwidth. Some of its important features include more than 60 cores (240+ threads), up to 16 GB GDDR5 memory with 352 GB/s bandwidth, and the ability to run Linux* with standard tools and languages. Some applications used these many-core processors by offloading compute intensive workload while others simultaneously used both the Intel® Xeon® host system and Intel® Xeon Phi™ coprocessors each crunching its own portion of the workload.

There are applications which perform well under this paradigm while there are others where the benefit of accelerated computing is not enough to make up for the cost of moving data between the host and the coprocessor over PCIe. From the application developer’s perspective, this can be a serious problem.

The Intel® Xeon Phi™ x200 product family – codename Knights Landing (KNL) is offered both as a processor and as a coprocessor. As a processor, KNL has no need for a host to support it. It can boot the full operating system by itself. For the applications which were limited by the overhead of data transfer on KNC, all the data processing can be completed on the KNL node itself, either in high bandwidth near memory or slower DDR4, without worrying about moving the data back and forth across a PCIe bus between host and accelerator. The coprocessor version of KNL offers an offload paradigm similar to KNC, the first generation of Intel® Xeon Phi™ coprocessors, but now with an added advantage of improved parallelism and greater single thread performance. However for both the processor and the coprocessor versions of the Intel® Xeon Phi™ x200 product family, it is important that the applications use as many cores as possible in parallel effectively and also explore and utilize the enhanced vectorization capabilities to achieve significant performance gains. Cluster applications must also support fabric scaling. Moreover, it is highly likely that applications optimized for Knights Corner will also perform well for the next generation of Intel® Xeon Phi™ product family.

2.    About This Document

The first part of this document lists important features of the Intel® Xeon Phi™ x200 product family. Secondly, it demonstrates how currently available tools like the Intel® Software Development Emulator and Intel® VTune Amplifier can be used to prepare for upcoming KNL processors and coprocessors.  It also enlists programming/optimization techniques which are already known from the Intel® Xeon Phi™ x100 products and new techniques suitable for the Intel® Xeon Phi™ x200 processors and coprocessors. Wherever possible, pointers are given to already published best known methods which can assist application developers to apply these optimization techniques. This document will not explain any architecture or instruction level details. It is also not intended for system administrators who wants to setup or manage their Knights Landing systems. Most of the discussions in this document will hover around the KNL processor as it will be the first release of Intel® Xeon Phi™ x200 product family.

3.    Intel® Xeon Phi™ x200 Product Overview

Figure 1. KNL package[1] overview

3.1    On Package Micro-architecture Information

Some of the architectural highlights for Intel® Xeon Phi™ x200 product family – codename Knights Landing - are as follows:

  • Up to 72 cores (in 36 Tiles) connected in a 2D Mesh architecture with improved on-package latency
  • 6 Channel of DDR4 supporting up to 384GB with a sustained bandwidth of more than 80 GB/s
  • Up to 16GB of high performance on-package memory (MCDRAM) with a sustained bandwidth of ~500 GB/s, supporting flexible memory modes including cache and flat
  • Each tile can be drawn as follows:

 

Figure 2. KNL Tile

  • Here each core is based on Intel® Atom™ core with many HPC enhancements such as:
    • 4 Threads/core
    • Deep Out-of-Order buffers
    • Gather/scatter in hardware
    • Advanced branch prediction
    • High cache bandwidth
  • 2x 512b Vector Processing Units per core with support for Intel® Advanced Vector Extensions 512 (Intel® AVX-512)
  • 3x Single-thread performance compared to Knights Corner
  • Binary compatible with Intel® Xeon® processors
  • Cache-coherent
  • Support for Intel® Omni Scale™ fabric integration

3.2    Performance

  • 3+ Teraflops of double-precision peak theoretical performance per single KNL node
  • Power efficiency (over 25% better than discrete coprocessor)[2] – over 10 GF/W
  • Standalone bootable processor with ability to run Linux and Windows OS
  • Platform memory capacity comparable to Intel® Xeon® processors

3.3    Programming Standards Supported

  • OpenMP
  • Message Passing Interface (MPI)
  • Fortran
  • Intel® Threading Building Blocks and Intel® Cilk™ Plus
  • C/C++

4.   Application Readiness for Knights Landing

Similar to the first generation of Intel® Xeon Phi™ coprocessors, scaling and vectorization are two fundamental considerations to achieve high performance on Knights Landing. Moreover, the Intel® Xeon Phi™ x200 processors have the ability to use high bandwidth memory (MCDRAM) as a separate addressable memory. For certain memory bound applications, modifying allocations of some data structures to utilize this high bandwidth memory can also boost the performance further.

4.1    Scaling

In order to obtain performance benefits with Intel® Xeon Phi™ product families, it is very important for the application to scale with respect to the increasing number of cores. To check scaling, you can create a graph of performance as you run your application with various numbers of threads either on Intel® Xeon® processors or Intel® Xeon Phi™ x100 coprocessors. Depending on your programming environment, you can either change an appropriate environment variable (for example, OMP_NUM_THREADS for OpenMP) or configuration parameters to vary the number of threads. In some cases, as you increase the number of cores, you may also want to increase the size of the dataset to ensure there is enough work for all the threads and the benefits of parallel performance are not subsumed by overhead in thread creation and maintenance.

4.2    Vectorization

The Intel® Xeon Phi™ x200 product family supports Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions in addition to Intel® SSE, AVX, AVX2 instructions sets. This enables processing of twice the number of data elements as AVX/AVX2 with a single instruction and four times that of SSE. These instructions also represent a significant leap in 512-bit SIMD support which was also available with the first generation Intel® Xeon Phi™ coprocessors.

With AVX-512, the Intel® Xeon Phi™ x200 product family offers higher performance for the most demanding computational tasks. It features the AVX-512 foundation instructions to support 32 vector registers each 512 bits wide, eight dedicated mask registers, 512-bit operations on packed floating point data or packed integer data, embedded rounding controls (override global settings), embedded broadcast, embedded floating-point fault suppression, embedded memory fault suppression, new operations, additional gather/scatter support, high speed math instructions, and compact representation of large displacement value. In addition to foundation instructions, Knights Landing will also support three additional capabilities: Intel® AVX-512 Conflict Detection Instructions (CDI), Intel® AVX-512 Exponential and Reciprocal Instructions (ERI) and Intel® AVX-512 Prefetch Instructions (PFI). These capabilities provide efficient conflict detection to allow more loops to be vectorized, exponential and reciprocal operations, and new prefetch capabilities, respectively.

As part of the application readiness efforts for the Intel® Xeon Phi™ x200 product family, support for AVX-512 can be currently evaluated using the Intel® Software Development Emulator (Intel® SDE) on an Intel® Xeon® processor. It has been extended for AVX-512 and is available at https://software.intel.com/en-us/articles/intel-software-development-emulator. Intel® SDE is a software emulator and it is mainly used to emulate future instructions. It is not cycle accurate and can be very slow (up to 100x). However with the instruction mix report it can give useful information like a dynamic instruction execution count and function based instruction count breakdown for evaluating compiler code generation.

The Compiler Switch to enable AVX512 for KNL is -xMIC_AVX512 (14.0 and later Intel® Compilers)

4.2.1   SDE Example

Sample code, as shown in Appendix B can be used to demonstrate how Intel® SDE can help evaluate differences in compiler code generation for AVX, AVX2 and AVX-512.

  • Download the latest version of Intel® SDE from https://software.intel.com/en-us/articles/intel-software-development-emulator. The version used in the following example is 7.15.
  • Extract the SDE and set the environment variable to use sde/sde64
    $ tar -xjvf sde-external-7.15.0-2015-01-11-lin.tar.bz2
    $ cd sde-external-7.15.0-2015-01-11-lin
    $ export PATH=`pwd`:$PATH 
  • Use the latest Intel® Compilers (14.0+) and compiling with the “-xMIC-AVX512”  knob to generate Knights Landing (KNL) binary
    //Compiling for KNL
    $ icc -openmp -g -O3 -xMIC-AVX512 -o simpleDAXPY_knl simpleDAXPY.c
    //Compiling for Haswell
    $ icc -openmp  -g -O3 -xCORE-AVX2 -o simpleDAXPY_hsw simpleDAXPY.c
    //Compiling for Ivy Bridge
    $ icc -openmp -g -O3 -xCORE-AVX-I -o simpleDAXPY_ivb simpleDAXPY.c
  • In order to simplify the analysis, set the number of threads to 1
    $ export OMP_NUM_THREADS=1
  • Generate instruction mix reports for AVX, AVX2[3]  and AVX-512 to compare performance metrics
    //Generating report for KNL
    $ sde -knl -mix -top_blocks 100 -iform 1 -omix sde-mix-knl.txt -- ./simpleDAXPY_knl 64 40
    
    // Generating report for Ivy Bridge
    $ sde -ivb -mix -top_blocks 100 -iform 1 -omix sde-mix-ivb.txt -- ./simpleDAXPY_ivb 64 40
    
    // Generating report for Haswell
    $ sde -hsw -mix -top_blocks 100 -iform 1 -omix sde-mix-hsw.txt -- ./simpleDAXPY_hsw 64 40
  • We compare dynamic count of total instructions executed to get a rough estimate of overall improvement in application performance by running AVX512 against AVX and AVX2 instructions sets. This can be quickly done by parsing the generated instruction mix reports as follows:
    //Getting instruction count with AVX-512
    $ grep total sde-mix-knl.txt | head -n 1
    *total                                                   5493008680
    
    //Getting instruction count with AVX2
    $ grep total sde-mix-hsw.txt | head -n 1
    *total                                                   6488866275
    
    //Getting instruction count with AVX
    $ grep total sde-mix-ivb.txt | head -n 1
    *total                                                   7850210690
  • Reduction in total dynamic instruction execution count
Change of Instruction setReduction in dynamic instruction count
AVX -> AVX-51230.03%
AVX2 -> AVX-51215.34%

Thus it can be observed that current and future generations of Intel hardware strongly rely on SIMD[4] performance. In order to write efficient and unconstrained parallel programs, it is important that the application developers fully exploit vectorization capabilities of hardware and understand benefits of using explicit vector programming. This can be achieved by either restructuring vector loops, using explicit SIMD directives (#pragma simd) or using compiler intrinsics. Compiler auto-vectorization may also help achieve the goal in most of the cases.

4.3 High Bandwidth Memory and Supported Memory Modes

4.3.1 Introduction to MCDRAM

The next generation of Intel® Xeon Phi™ product family can include up to 16GB of On-Package High Bandwidth Memory – Multi Channel DRAM (MCDRAM). It can provide up to 5x the bandwidth as compared to DDR and 5x the power efficiency compared to GDDR5. MCDRAM supports NUMA [5] and can be configured in Cache, Flat and Hybrid modes. The modes must be selected and configured at boot time.

4.3.2   Cache Mode

In cache mode, all of MCDRAM behaves as a memory-side direct mapped cache in front of DDR4. As a result, there is only a single visible pool of memory and you see MCDRAM as high bandwidth/high capacity L3 cache. Advantage of using MCDRAM as cache is that your legacy application would not require any modifications. So if your application cares about data locality, is not memory bound (i.e., DDR bandwidth bound), and the majority of the critical data structures fit in MCDRAM then this mode will work great for you.

4.3.3   Flat Mode

In flat mode, MCDRAM is used as a SW visible and OS managed addressable memory (as a separate NUMA node), so that memory can be selectively allocated to your advantage on either DDR4 or MCDRAM. With slight modifications to your software to enable use of both types of memory at the same time, the flat model can deliver uncompromising performance. If your application is DDR bandwidth limited, you can certainly boost your application performance by investigating bandwidth critical hotspots and selectively allocating critical data structures to high bandwidth memory.

4.3.4   Hybrid Mode

The hybrid model offers a bit of both worlds – some MCDRAM is configured as addressable memory and some is configured as cache. In order to enable this mode at boot, MCDRAM is configured in flat mode and portion (25% or 50%) of MCDRAM is configured as cache.

4.3.5   DDR Bandwidth Analysis

One of the important steps to decide which memory mode will work best for you, is to analyze memory behavior of your application. This can be done by asking yourself if your application is DDR bandwidth bound. If yes, is it possible to find hotspots where peak bandwidth is often attained for some of the data structures involved i.e. can you identify which data structures are BW critical? Is it possible to fit those bandwidth critical data structures in MCDRAM?

In order to help us answer the above questions, we will use the sample source as shown in Appendix B to demonstrate how Intel® VTune Amplifier can be used to analyze peak DDR bandwidth and also identify bandwidth critical data structures.

4.3.5.1 Sample Kernels

DAXPY – We use a simplified daxpy routine where a vector is multiplied by a constant and added to another vector. It has been modified to use OpenMP parallel with explicit vectorization using simd clause

//A simple DAXPY kernel
void run_daxpy(double A[], double PI, double B[], unsigned long vectorSize){
       unsigned long i = 0;
#pragma omp parallel for simd
        for(i=0; i<vectorSize; i++){
              B[i] = PI*A[i] + B[i];
        }
       return;
}


swap_low_and_high()– A dummy subroutine to do a special swap as given below:

Input ArrayOutput Array
ABCDEFGH
AECGBFDH

A similar kind of rearrangement is commonly seen after the filtering step of Discrete Wavelet Transform to separate low and high frequency elements. 

//Rearranging Odd and Even Position elements into Low and High Vectors
void swap_low_and_high(unsigned long vectorSize, double C[]){
       unsigned long i = 0, j=0;
       unsigned long half = vectorSize/2;
       double temp = 0.0;

#pragma omp parallel for private(temp)
       for(i=0, j=half; i<half; i+=2, j+=2){
              temp = C[i+1];
              C[i+1] = C[half];
              C[half] = temp;
       }
       return;
}

4.3.5.2 Analysis Using Sample Source

  • Set up the environment for Compiler and Intel® VTune Amplifier
$ source /opt/intel/composerxe/bin/compilervars.sh intel64
$ source /opt/intel/vtune_amplifier_xe/amplxe-vars.sh
  • Compile and profile bandwidth for simpleDAXPY application
$ icc -g -O3 -o simpleDAXPY_ddr simpleDAXPY.c -openmp –lpthread
$ amplxe-cl --collect bandwidth -r daxpy_swap_BW -- numactl --membind=0 --cpunodebind=0 ./simpleDAXPY_ddr_debug 512 5

Note
-   In order to simplify the analysis we bind the application to run only on one socket. 
-   Number of array elements selected here is 512M(512 x 1024 x 1024)  and both DAXPY and SWAP_LOW_HIGH is repeated 5 times to generate enough samples for analysis

  • Analyze the bandwidth profile using Intel® VTune Amplifier
$ amplxe-gui daxpy_swap_BW

Figure 3. Bandwidth Profile

From figure 3, it can be seen that the simpleDAXPY application attains single socket bandwidth of ~57 GB/s, which is comparable to practical peak bandwidth for a Haswell[6] system with 4 DDR channels per socket. Hence it can be inferred that the application is DDR memory bandwidth bound.

Note
-   Peak bandwidth here is referenced as per the STREAM benchmark Triad results in GB/s
-   Single socket peak theoretical bandwidth for experimental setup can be given as
2133 (MT/s) * 8 (Bytes/Clock) * 4 (Channels/socket)/1000 = ~68 GB/s

  • Identify bandwidth critical data structures

To identify which data structures should be allocated to high bandwidth memory, it is important to look at some of the core counters which correlate to DDR bandwidth. Two such counters we are interested in are MEM_LOAD_UOPS_RETIRED.L3_MISS and MEM_LOAD_RETIRED.L2_MISS_PS.  These core hardware event counters can be collected by profiling application with Intel® VTune Amplifier as follows:

$ amplxe-cl -collect-with runsa -data-limit=0 -r daxpy_swap_core_counters -knob event-config=UNC_M_CAS_COUNT.RD,UNC_M_CAS_COUNT.WR,CPU_CLK_UNHALTED.THREAD,CPU_CLK_UNHALTED.REF_TSC,MEM_LOAD_UOPS_RETIRED.L3_MISS,MEM_LOAD_UOPS_RETIRED.L2_MISS_PS numactl --membind=0 --cpunodebind=0 ./simpleDAXPY_ddr_debug 512 5
  • Analyze the core hardware event counters using Intel® VTune Amplifier GUI
$ amplxe-gui daxpy_swap_core_counters

With Hardware Event Counts viewpoint selected, we look at PMU Events graph.

In order to see the filtered graph as shown in figure 4, Thread box is unchecked and Hardware Event Count drop-down box is selected to be MEM_LOAD_UOPS_RETIRED.L3_MISS

Now looking at the sources of maximum LLC[7] misses, the data structures that contribute to peak bandwidth can be identified. As seen in Figure 5, region of peak LLC misses can be zoomed in and filtered by selection to get further information about contributors of peak LLC misses.

Figure 4. MEM_LOAD_UOPS_RETIRED.L3_MISS Profile

 

Figure 5. Zoom In and Filter the region of peak L3 cache misses

 

As shown in Figure 6, for the top contributor right click and view the source to get the exact line in source code where the LLC misses are at peak.

Figure 6. Top Contributor of L3 Misses
 

Figure 7. Suggestions for BW Critical data structures

 

  • Changing allocations to high bandwidth memory

HBWMALLOC is a new memory allocation library which abstracts NUMA programming details and helps application developers use high bandwidth memory on Intel® Xeon and Intel® Xeon Phi™ x200 product families in both FORTRAN and C. Use of this interface is as simple as replacing malloc calls with hbw_malloc in C, or placing explicit declarations in Fortran as shown below:

C - Application program interface

//Allocate 1000 floats from DDR
float   *fv;
fv = (float *)malloc(sizeof(float) * 1000);


//Allocate 1000 floats from MCDRAM
float   *fv;
fv = (float *)hbw_malloc(sizeof(float) * 1000);

 

FORTRAN – Application program interface

//Allocate arrays from MCDRAM & DDR
c     Declare arrays to be dynamic
      REAL, ALLOCATABLE:: A(:), B(:), C(:)
!DEC$ ATTRIBUTES, FASTMEMORY :: A
      NSIZE=1024
c
c     allocate array ‘A’ from MCDRAM
c
      ALLOCATE (A(1:NSIZE))
c
c     Allocate arrays that will come from DDR
c
      ALLOCATE  (B(NSIZE), C(NSIZE))

Please refer to the Appendix A for further details about hbwmalloc - high bandwidth memory allocation library.

The above profiling results suggests that array A and array B are the data structures which should be preferentially allocated in High bandwidth memory.

In order to allocate array A and array B in high bandwidth memory, the following modifications are done:

.
.
.
#ifndef USE_HBM      //If not using high bandwidth memory
       double *A = (double *)_mm_malloc(limit * sizeof(double),ALIGNSIZE);
       double *B = (double *)_mm_malloc(limit * sizeof(double), ALIGNSIZE);
       double *C = (double *)_mm_malloc(limit * sizeof(double), ALIGNSIZE);
#else
       //Allocating A and B in High Bandwidth Memory
       double *A, *B, *C;
       hbw_posix_memalign((void**)(&(A)), ALIGNSIZE, limit*sizeof(double));
       if(A == NULL){
              printf("Unable to allocate on HBM: A");
       }
       printf("Allocating array A in High Bandwidth Memory\n");
       hbw_posix_memalign((void**)(&(B)), ALIGNSIZE, limit*sizeof(double));
       if(B == NULL){
              printf("Unable to allocate on HBM: B");
       }
       printf("Allocating array B in High Bandwidth Memory\n");
       C = (double *)_mm_malloc(limit * sizeof(double), ALIGNSIZE);
#endif

.
.
.

#ifndef USE_HBW
       _mm_free(A);
       _mm_free(B);
       _mm_free(C);
#else
       hbw_free(A);
       hbw_free(B);
       _mm_free(C);

#endif

.
.
.

Note
-   In order to compile your application with high bandwidth memory allocations LD_LIBRARY_PATH environment variable must be updated to include paths to memkind and jemalloc libraries
-   For example: $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/build/memkind/lib:$HOME/build/jemalloc/lib

  • Simulation of low bandwidth and high bandwidth performance gap

Figure 8. Simulating KNL NUMA behavior using 2 socket Intel® Xeon®

At this time, since we do not have access to actual KNL hardware, we will study the behavior of high bandwidth memory using the concept of Non Uniform Memory Access (NUMA). We simulate the scenario of low bandwidth and high bandwidth regions by allocating and accessing arrays from two separate NUMA nodes (i.e. near memory and far memory).

Example:

Lets first compile and execute simpleDAXPY without use of high bandwidth memory allocations. Here we bind the application to socket 1 and bind all the memory allocations to Socket 2.

$ icc -g -O3 -o simpleDAXPY_ddr simpleDAXPY.c -openmp –lpthread
$ numactl --membind=1 --cpunodebind=0 ./simpleDAXPY_ddr 512 5

//Output
Running with selected parameters:
No. of Vector Elements : 512M
Repetitions = 5
Threads = 16
Time - DAXPY (ms): 4074
Time – SWAP_LOW_HIGH (ms): 2051

Set the NUMA node 0 (Socket 0) as High Bandwidth Memory Node as follows:

$  export MEMKIND_HBW_NODES=0

Note
-   Explicit configuration of HBW node is only required for simulation. In presence of actual high bandwidth memory (MCDRAM) MEMKIND library will be automatically able to identify high bandwidth memory nodes

Now we will compile and execute simpleDAXPY with high bandwidth allocations using the memkind library as follows. Since we bind memory allocations to node 1 (i.e. socket 1) and bind application to node 0 (i.e. socket 0), by default all the allocations are done in far memory and the memory on NUMA node 0 is selected for allocations only if high bandwidth allocations are explicitly done using hbw_malloc* calls. This simulates the KNL behavior which we might observe when MCDRAM is configured in flat or hybrid mode. 

$ icc -O3 -DUSE_HBM -I$HOME/build/memkind/include -L$HOME/build/memkind/lib/ -L$HOME/build/jemalloc/lib/ -o simpleDAXPY_hbm simpleDAXPY.c -openmp -lmemkind -ljemalloc -lnuma -lpthread
$ numactl --membind=1 --cpunodebind=0 ./simpleDAXPY_hbm 512 5

//Output
Running with selected parameters:
No. of Vector Elements : 512M
Repetitions = 5
Allocating array A in High Bandwidth Memory
Allocating array B in High Bandwidth Memory
Threads = 16
Time - DAXPY (ms): 2355
Time – SWAP_LOW_HIGH (ms): 2068

Note
-   The above performance improvement reported here is due to both reduced latency and improved bandwidth
-   At the time this white paper was written, difference in execution times with and without hbw_malloc could only be observed on systems with RHEL* 7.0 and later. This could be because of some software bug in handling membind for operating systems before RHEL* 7.0

 

4.4 Other Optimization Techniques

In addition to better scaling, increased vectorization intensity and exploring high bandwidth memory, there are a number of possible user-level optimizations which can be applied to improve application performance. These advanced techniques proved successful for the first generation of Intel® Xeon Phi™ coprocessors and should be helpful for application development on the Intel® Xeon Phi™ x200 product family as well. Some of these optimizations aid compilers while others involve restructuring code to extract additional performance for your application. In order to achieve peak performance, the following optimizations should be kept in mind:

Optimization Technique

Related Reading

Cache blocking

Cache-blocking-techniques

Loop unrolling

Optimization and Performance tuning for Intel® Xeon Phi™ Coprocessors

Prefetching

Tiling

Unit-stride memory access

Array of Structures (AoS) to Structure of Arrays (SoA) transformation

Case study comparing AoS and SoA data layouts for compute intensive loop

5.    References

Intel® Xeon Phi™ Coprocessor code named “Knights Landing” - Application Readiness (https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-code-named-knights-landing-application-readiness)

What disclosures has Intel made about Knights Landing (https://software.intel.com/en-us/articles/what-disclosures-has-intel-made-about-knights-landing) 

An Overview of Programming for Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors (https://software.intel.com/sites/default/files/article/330164/an-overview-of-programming-for-intel-xeon-processors-and-intel-xeon-phi-coprocessors_1.pdf) 

Knights Corner: Your Path to Knights Landing (https://software.intel.com/en-us/videos/knights-corner-your-path-to-knights-landing)

Intel® Software Development Emulator (https://software.intel.com/en-us/articles/intel-software-development-emulator)

Intel® Architecture Instruction Set Extensions Programming Reference - Intel® AVX-512 is detailed in Chapters 2-7 (https://software.intel.com/en-us/intel-architecture-instruction-set-extensions-programming-reference)

AVX-512 Instructions (https://software.intel.com/en-us/blogs/2013/avx-512-instructions)

High Bandwidth Memory (HBM): how will it benefit your application? (https://software.intel.com/en-us/articles/high-bandwidth-memory-hbm-how-will-it-benefit-your-application)

GitHub - memkind and jemalloc (https://github.com/memkind)  

 

Appendix A – HBWMALLOC

NAME

       hbwmalloc - The high bandwidth memory interface

SYNOPSIS

       #include <hbwmalloc.h>

       Link with -ljemalloc -lnuma -lmemkind -lpthread

       int hbw_check_available(void);
       void* hbw_malloc(size_t size);
       void* hbw_calloc(size_t nmemb, size_t size);
       void* hbw_realloc (void *ptr, size_t size);
       void hbw_free(void *ptr);
       int hbw_posix_memalign(void **memptr, size_t alignment, size_t size);
       int hbw_posix_memalign_psize(void **memptr, size_t alignment, size_t size, int pagesize);
       int hbw_get_policy(void);
       void hbw_set_policy(int mode);

Installing jemalloc

//jemalloc and memkind can be downloaded from https://github.com/memkind
$ unzip jemalloc-memkind.zip
$ cd jemalloc-memkind
$ autoconf
$ mkdir obj
$ cd obj/
$ ../configure --enable-autogen --with-jemalloc-prefix=je_ --enable-memkind --enable-safe --enable-cc-silence --prefix=$HOME/build/jemalloc
$ make
$ make build_doc
$ make install

Installing memkind

$ unzip memkind-master.zip
$ cd memkind-master
$ ./autogen.sh
$ ./configure --prefix=$HOME/build/memkind --with-jemalloc=$HOME/build/jemalloc
$ make && make install

Update LD_LIBRARY_PATH to include locations of memkind and jemalloc

 

Appendix B – simpleDAXPY.c

/*
 *  Copyright (c) 2015 Intel Corporation.
 *  Intel Corporation All Rights Reserved.
 *
 *  Portions of the source code contained or described herein and all documents related
 *  to portions of the source code ("Material") are owned by Intel Corporation or its
 *  suppliers or licensors.  Title to the Material remains with Intel
 *  Corporation or its suppliers and licensors.  The Material contains trade
 *  secrets and proprietary and confidential information of Intel or its
 *  suppliers and licensors.  The Material is protected by worldwide copyright
 *  and trade secret laws and treaty provisions.  No part of the Material may
 *  be used, copied, reproduced, modified, published, uploaded, posted,
 *  transmitted, distributed, or disclosed in any way without Intel's prior
 *  express written permission.
 *
 *  No license under any patent, copyright, trade secret or other intellectual
 *  property right is granted to or conferred upon you by disclosure or
 *  delivery of the Materials, either expressly, by implication, inducement,
 *  estoppel or otherwise. Any license under such intellectual property rights
 *  must be express and approved by Intel in writing.
 */

#include<stdio.h>
#include<stdlib.h>
#include<sys/time.h>
#include<omp.h>

#define ALIGNSIZE 64

//A simple DAXPY kernel
void run_daxpy(double A[], double PI, double B[], unsigned long vectorSize){
       unsigned long i = 0;
#pragma omp parallel for simd
        for(i=0; i<vectorSize; i++){
              B[i] = PI*A[i] + B[i];
        }
       return;
}

//Rearranging Odd and Even Position into Low and High Vectors
void swap_low_and_high(unsigned long vectorSize, double C[]){
       unsigned long i = 0, j=0;
       unsigned long half = vectorSize/2;
       double temp = 0.0;
#pragma omp parallel for private(temp)
       for(i=0, j=half; i<half; i+=2, j+=2){
              temp = C[i+1];
              C[i+1] = C[half];
              C[half] = temp;
       }
       return;
}


int main  (int argc, char * argv[]){

       struct timeval tBefore, tAfter;
       unsigned long timeDAXPY = 0,timeAverage=0;
       unsigned long i = 0;
       unsigned int j = 0;
       unsigned long limit = 0;
       unsigned int repetitions = 0;

       if (argc < 3){
              printf("Enter Number of Elements in Millions and number of repetitions\nEg: ./simpleDAXPY 64 5\n");
              printf("Running with default settings:\n");
              printf("No. of Vector Elements : 64M\nRepetitions = 1\n");

              limit = 64 * 1024 * 1024;
              repetitions = 1;
       }
       else
       {
              limit = atoi(argv[1]) * 1024 * 1024;
              repetitions = atoi(argv[2]);
              printf("Running with selected parameters:\n");
              printf("No. of Vector Elements : %dM\nRepetitions = %d\n", atoi(argv[1]), atoi(argv[2]));
        }

#ifndef USE_HBW
       double *A = (double *)_mm_malloc(limit * sizeof(double),ALIGNSIZE);
       double *B = (double *)_mm_malloc(limit * sizeof(double), ALIGNSIZE);
       double *C = (double *)_mm_malloc(limit * sizeof(double), ALIGNSIZE);
#else
       double *A, *B, *C;
      //Allocating A and B in High Bandwidth Memory
       hbw_posix_memalign((void**)(&(A)), ALIGNSIZE, limit*sizeof(double));
        if(A == NULL){
              printf("Unable to allocate on HBM: A");
       }
       printf("Allocating array A in High Bandwidth Memory\n");
       hbw_posix_memalign((void**)(&(B)), ALIGNSIZE, limit*sizeof(double));
        if(B == NULL){
              printf("Unable to allocate on HBM: B");
       }
       printf("Allocating array B in High Bandwidth Memory\n");
       C = (double *)_mm_malloc(limit * sizeof(double), ALIGNSIZE);
#endif


#pragma omp parallel for simd
        for(i=0; i<limit; i++){
                A[i] = (double)1.0*i;
                B[i] = (double)2.0*i;
                C[i] = (double)4.0*i;
        }

       double PI = (double)22/7;
       printf("Threads = %d\n", omp_get_max_threads());
       for(j = 0; j<repetitions; j++){

              gettimeofday(&tBefore, NULL);
              run_daxpy(A, PI*(j+1), B, limit);
              gettimeofday(&tAfter, NULL);
              timeDAXPY += ((tAfter.tv_sec - tBefore.tv_sec)*1000L +(tAfter.tv_usec - tBefore.tv_usec)/1000);


              gettimeofday(&tBefore, NULL);
              swap_low_and_high(limit, C);
              gettimeofday(&tAfter, NULL);
              timeAverage += ((tAfter.tv_sec - tBefore.tv_sec)*1000L +(tAfter.tv_usec - tBefore.tv_usec)/1000);

       }

       printf("Time - DAXPY (ms): %ld\n", timeDAXPY);
       printf("Time – SWAP_LOW_HIGH (ms): %ld\n", timeAverage);

#ifndef USE_HBW
       _mm_free(A);
       _mm_free(B);
       _mm_free(C);
#else
       hbw_free(A);
       hbw_free(B);
       _mm_free(C);
#endif

       return 1;
}
 

[1]  This diagram is for conceptual purposes only and only illustrates a processor and memory – it is not to scale and does not include all functional areas of the processor, nor does it represent actual component layout. All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

[2] Projected result based on internal Intel analysis using estimated performance and power consumption of a rack sized deployment of Intel® Xeon® processors and Knights Landing coprocessors as compared to a rack with Knights Landing processors only 

[3]Instruction mix report for AVX2 was generated using non OpenMP version of the code. At this time, the version of SDE available externally fails to generate instruction mix report when used with "-hsw" flag and OpenMP. This issue is fixed and will be released in next version of SDE.  

[4] SIMD – Single Instruction Multiple Data

[5] NUMA – Non Uniform Memory Access

[6]Experiment Setup

  • 2 Socket 14 core Intel® Xeon® CPU E5-2697 v3 @ 2.60GHz

  • 4 DDR channels per socket

  • Red Hat* Enterprise Linux Server release 7.0

[7]LLC – Last Level Cache

 

 

Intel® Parallel Computing Center at the University of Cambridge

$
0
0

Principal Investigator:

Professor Paul Shellard.
Director, Stephen Hawking Centre for Theoretical Cosmology
University of Cambridge

Paul Shellard is Professor of Cosmology at the Department of Applied Mathematics and Theoretical Physics in the University of Cambridge. He studied at the Universities of Sydney and Cambridge and he also held a postdoctoral fellowship at MIT in the Department of Physics. His primary research interest is to advance the confrontation between theories of the early universe and empirical cosmology, focusing especially on the primordial fluctuations which seeded the formation of large-scale structure. He is a Planck Scientist within the Planck satellite consortium which in 2013-15 provided and analysed high resolution maps of the cosmic microwave background (CMB). Since 1997 he has coordinated oversight of COSMOS, the UK National Cosmology Supercomputer, which has been an essential tool for quantitative progress in theoretical cosmology.

Description:

The Intel® Parallel Computing Center at Center for Development of Advanced Computing (C-DAC) is focused to port, scale and optimize popularly used Climate Forecast System version2 (CFSv2) model on hybrid architecture based on Intel® Xeon Phi™ coprocessor used in large scale production grade environments. The objective is to enhance the computational speed and accuracy of the model so as to gain accurate weather predictions well in advance.The COSMOS Intel® Parallel Computing Center is part of the Stephen Hawking Centre for Theoretical Cosmology (CTC) at the University of Cambridge. The Centre is also home to the COSMOS supercomputer with which Intel has a long history of collaboration since 2003. The latest joint venture is centred on the Xeon Phi and work underlying the present IPCC began well before these centres came into existence. We have a particular interest in complex workflows with simulation and data analysis pipelines which are best-suited to heterogeneous platforms.

The scientific interests of the COSMOS Intel® PCC align with those of the CTC in the following areas:

  1. Developing data analysis techniques and software pipelines to extract information from cosmological data sets, including the cosmic microwave background and galaxy surveys.
  2. Characterising the fundamental nature of inflation and the primordial perturbations from which the structure in our universe formed.
  3. Understanding the extreme universe, including violent phase transitions in the very early universe and the mergers of black holes with the gravitational waves they generate.

Our goal is to ensure that this important science program makes full use of new and emerging parallel computing platforms, especially the Intel® Xeon® Phi™ Coprocessor and its future incarnations. To this effect, we have had considerable success porting and optimising our ”home-grown" scientific codes both to the many-core and the multi-core platform, taking advantage of the fact that almost all improvements in performance on the many-core side translate to improvements on the multi-core as well. With our Intel® Xeon® Phi™ Coprocessors embedded within a large shared-memory SGI system, we are able to pursue a wide range of programming paradigms to accelerate development and optimisation of our codes, notably native Intel® Xeon® Phi™ Coprocessor, native Intel® Xeon®, and Intel® Xeon® Phi™ Coprocessor offload.

Related websites:

http://www.cosmos.damtp.cam.ac.uk/

Natural Interaction with Intuitive Computer Control

$
0
0

Transforming the UI—Designing Tomorrow’s Interface Today (5 of 5):

Natural Interaction with Intuitive Computer Control

By Dominic Milano

Human-like senses on devices with Intel® RealSense™ technology are broadening perceptual computing. Explore apps that will let you navigate through virtual worlds with gestures and words.

Advances in perceptual computing are bringing human-like senses to devices. Able to see and hear the world around them, devices equipped with Intel® RealSense™ technology recognize hand and finger gestures, analyze facial expressions, understand and synthesize spoken words, and more. As a result, people are enjoying more natural interactivity—pushing, pulling, lifting, and grabbing virtual objects; starting, stopping, and pausing content creation and entertainment apps; and navigating virtual worlds without touching their devices.

In this article, the last in a five-part series, the Virtual Air Guitar Company (VAGC) and Intel software developers share insights on using Intel RealSense technology to move human-computer interaction (HCI) beyond the keyboard and mouse.

Virtual Air Guitar Company

Founded in 2006 by computer vision and virtual reality researchers based in Espoo, Finland, VAGC specializes in creating unique motion games and applications that utilize full-body actions and precise fingertip control. The indie studio’s portfolio includes console and PC games as well as Windows* and Android*perceptual computing applications.

For Aki Kanerva, lead designer and founder, the natural interaction (NI) models enabled by Intel RealSense technology make computer use easier and more enjoyable. Although most of us learned to use computers that relied on traditional keyboard and mouse or trackpad-based interaction, touch screens opened a world of new possibilities. “Give a touch-screen device to a two-year-old and they’ll be able to use it almost immediately,” Kanerva said. “There’s no learning curve to speak of, which makes devices more fun to use. And even for serious tasks, you can work more intuitively.”

Whether you’re using a mouse, trackpad, or touch screen, the interaction takes place in two dimensions. Natural interaction based on gestures or voice commands provides a greater degree of freedom. When gesturing with your hands, you’re in 3D space, which gives six degrees of freedom and enables new use cases. “Let’s say you’re cooking or using a public terminal and don’t want to touch the device. Gesture and voice commands provide touch-free interaction,” said Kanerva.

Natural interaction can also be used to perform complex tasks, eliminating the need to memorize keyboard shortcuts. Or as Kanerva put it, “Natural interaction enables complex controls without being complicated.”

Simple Isn’t Easy

Kanerva spent years researching human-computer interaction models and knows from experience how difficult it can be to distill complex tasks and make them appear easy to do. The mouse and keyboard act as middlemen and translate user intention. In a game, for example, moving a mouse left or right could translate to directing a character’s movement. “It takes practice to correctly make that motion.” Kanerva said. “With natural interaction, you’re removing the middlemen. With five or six degrees of freedom, tracking the moves you make with your hands—up, down, left, right, forward, back—translates to character motion on the screen in a more intuitive manner.”

“A ‘natural interface’ can combine all six degrees of movement in such a way that the user never has to think about individual commands,” he continued. Creating a UI that responds to such a complex set of user actions is no simple feat. And Kanerva advises against designing UIs capable of responding to every possible motion.

The key is to create experiences in which the interface lets the users know what they can do before they do it. “Limit interaction to only the things that are necessary for your application, and focus your coding efforts on making those things feel responsive.”

Working with Intel, VAGC is developing an app that will let users fly a virtual helicopter through any location (Figure 1) using hand gestures tracked with an Intel®RealSense™ 3D camera. “By design, our helicopter cannot do barrel rolls—rotate around its axis. That could be fun, but adding a control for doing such a roll would have overlapped with the other controls, so we didn’t include that capability.”

Figure 1: The Helicopter’s Point of View

Anatomy of the Gesture-controlled App

The Web-based app will be for Microsoft Internet Explorer, Google Chrome, and Mozilla Firefox on Windows*. The app uses a browser extension to provide localization services, was written in JavaScript and HTML, and runs in the browser. Its code is reliable and lightweight.

“This project has a lot of moving parts,” Kanerva explained. “The Web itself is a moving target, especially when it comes to extensions because we needed to write a different extension for each supported browser.”

Intel supplied VAGC with invaluable feedback and VAGC reciprocated by demonstrating real use cases that helped the Intel team refine the Intel RealSense SDK. “One of the early challenges with the SDK was that it had difficulty seeing a flat hand held vertically,” Kanerva explained (see Figure 2). “You cannot prevent a user from doing that.” Thanks to VAGC’s input, the latest version of the SDK supports that condition.

Figure 2: Camera view of vertical flat hand

Testing, Testing, Testing

Previous motion games by VAGC employed a proprietary automated test suite, but Kanerva prefers a different tact when testing gesture interaction. “With natural interaction, users have a knack for doing the unexpected, so we hire a usability testing firm.” Kanerva said that the best results come from shooting video of experienced and inexperienced users playing through the flight.

Tutorials, in spite of being expensive to produce, are an essential ingredient for NI projects. “Because the range of possible movements is so broad, you need very clear tutorials or animations that give users a starting point,” Kanerva said. “Put hand here. Hold in position 30cm (12 inches) from the screen. Now do this... You must be very specific.” Kanerva advises developers to plan tutorials early in the design process. “It’s tempting to leave tutorials to the last minute, but they are critical. Even if you need to change them several times throughout your development workflow, tutorials should be an integral part of your process.”

Lessons Learned

Summarizing his experience with the Intel RealSense SDK and Intel RealSense3D camera, Kanerva offered these rules of thumb for developers:

  • Design applications that don’t lend themselves to—or aren’t even possible with— traditional input modalities.
  • Emulating traditional input is a license for disaster.
  • Design natural interactions specific to your use case.
  • For control devices in flight simulator designs, arrow keys aren’t as effective as joysticks, and gesture control is an effective alternative.
  • Try to be continuous and minimize control latency.
  • Don’t wait long or expect a gesture to be complete before translating motion into your experience.
  • Provide immediate feedback—user input should be reflected on screen at all times.
  • Remember not to fatigue your users.

For game developers, continuous feedback is key—games typically use button pushes to produce actions. Mapping traditional game input to natural interaction creates too much latency and requires a steep learning curve.

Regarding fatigue, Kanerva counsels that you design controls that don’t require a user to be still for a long time. “I like the term ‘motion control,’ because it’s a reminder that you want users to move around so they’re not getting tired. We don’t ask users to rotate their wrist at right angles. For example, to fly forward, they simply point the hand straight while keeping it relaxed. Avoid strain and remind users it’s their responsibility to take breaks.”

Natural Entertainment

“Natural interaction is great for entertainment apps,” Kanerva concluded. His company got its name from their first app, which let users play air-guitar chords and solos using nothing more than hand gestures. “Hollywood often depicts people using gestures to browse data. It looks cool and there’s a wow factor, but simplicity driven by natural interaction will make computing accessible to a wider user base.”

To that end, the idea of having Intel RealSense technology embedded in tablets, 2 in 1s, and all-in-one devices thrills Kanerva. “Ubiquitous access to natural interaction through Intel RealSense technology will be great for marketing and sales, but it will be priceless to developers.”

Driving Windows with Hand Gestures

Yinon Oshrat spent seven years at Omek Studio, the first developer to use the nascent Intel RealSense SDK, before being purchased by Intel. As a member of the Intel RealSense software development team, Oshrat has been working on a standalone application called Intel® RealSense™ Navigator. “Intel RealSense Navigator controls the Windows UI, enhancing current touch-based experiences and enabling hand gesture-based interaction.” The app allows users to launch programs and scroll through selections using hand gestures (Figure 3).

Figure 3: Natural human-computer interaction

Intel RealSense Navigator relies on the Intel® RealSense™ 3D Camera to track hand gestures at distances up to 60 cm (24 inches). “Think of Navigator as a driver and a mouse,” Oshrat said. “It’s an active controller. To use it, you simply enable it for zooming, mouse simulation, and so on.”

Building Blocks for Developers

In creating Intel RealSense Navigator, the Intel team has been providing building blocks for the Intel RealSense SDK that will give developers the ability to implement the same experiences that Navigator enables in other standalone Windows applications.

Like other modules of the Intel RealSense SDK, the Intel RealSense Navigator module is in both C and C++. JavaScript support is planned for a future release.

Describing Intel RealSense Navigator gesture support, Oshrat said, “Think of it as a language. Tap to select. Pinch to grab. Move left or right to scroll. Move hand forward or back to zoom in or out. A hand wave returns users to the Start Screen.” Intel RealSense Navigator ships with video tutorials that demonstrate how to make accurate hand gestures. “The videos are much more effective than written documentation.”

Inventing a Language

For more than a year, Oshrat and his colleagues designed and tested gestures by literally approaching people on the street and watching them react to a particular motion. “We worked in cycles, experimenting with what worked, what users’ expectations were versus what we thought they’d be. We discovered what felt natural and what didn’t.”

For example, people interpreted three “swipes” in a row as a “wave” and not a swipe. “That taught us to separate those gestures,” Oshrat said. It surprised him that defining effective gestures was so difficult. “It’s not programming. It’s working with user experiences.”

Lessons Learned

For developers interested in implementing gesture control in their applications using Intel RealSense Navigator, Oshrat offered this advice:

  • Utilize the gesture models and building blocks supplied with the SDK; they will save you time by jump-starting your project with ideas that have been carefully vetted.
  • Take advantage of Intel support channels such as user forums.

Like Aki Kanerva, Oshrat advises developers to think of gesture-based interaction as something completely different from mouse and keyboard-based interactions. “Don’t try to convert a mouse/keyboard experience to gestures. Movement takes place at a completely different speed when you’re gesturing versus using a mouse.”

Oshrat also noted that user behavior changes based on input modality. “We built a game where users had to run and jump to the side,” he explained. “On screen, a sign directed users to GO LEFT. When using the game controller, users noticed signs and instructions in the area surrounding the avatar. But when their body was used as a controller (gesture control), they were so focused on the avatar that they ignored everything else in the game. They even ran head-on into a sign despite the fact that instructions filled 90 percent of the screen!”

Current and Future Use Cases

For Oshrat, one of the more practical applications of Intel RealSense Navigator that he saw involved controlling a Power Point* presentation without a mouse or physical controller. “You can just wave your hand and change slides back and forth.” It’s also easy—and handy—to control a computer screen when you’re on the phone.

Asked whether gestures could be useful in controlling a video-editing application, Oshrat said, “Yes! That’s a holy grail. Many people want to do that. We’ve been working with games and gesture control of other applications for eight years. We’re heading in the right direction for video editing and similar experiences.”

What’s next? Oshrat envisions a future in which users no longer have to learn a new gesture vocabulary. “Our devices will know us better and understand more about what we want. Gestures, voice commands, facial analysis, 3D capture and share... all of the capabilities enabled by Intel RealSense technology will open even more exciting possibilities.”

Resources

Explore Intel RealSense technology further, learn about Intel RealSense SDK for Windows, and download a Developer Kit here.

Is your project ready to demonstrate? Join the Intel® Software Innovator Program. It supports developers who have forward-looking projects and provides speakership and demo opportunities.

Read part 1, part 2, part 3, and part 4 of this “Transforming the UI—Designing Tomorrow’s Interface Today” series.


Easy SIMD through Wrappers

$
0
0

By Michael Kopietz

Download PDF

1. Introduction

This article aims to change your thinking on how SIMD programming can be applied in your code. By thinking of SIMD lanes as functioning similarly to CPU threads, you will gain new insights and be able to apply SIMD more often in your code.

Intel has been shipping CPUs with SIMD support for about twice as long as they have been shipping multi core CPUs, yet threading is more established in software development. One reason for this increased support is an abundance of tutorials that introduce threading in a simple “run this entry function n-times” manner, skipping all the possible traps. On the other side, SIMD tutorials tend to focus on achieving the final 10% speed up that requires you to double the size of your code. If these tutorials provide code as an example, you may find it hard to focus on all the new information and at the same time come up with your simple and elegant way of using it. Thus showing a simple, useful way of using SIMD is the topic of this paper.

First the basic principle of SIMD code: alignment. Probably all SIMD hardware either demands or at least prefers some natural alignment, and explaining the basics could fill a paper [1]. But in general, if you're not running out of memory, it is important for you to allocate memory in a cache friendly way. For Intel CPUs that means allocating memory on a 64 byte boundary as shown in Code Snippet 1.

inline void* operator new(size_t size)
{
	return _mm_malloc(size, 64);
}

inline void* operator new[](size_t size)
{
	return _mm_malloc(size, 64);
}

inline void operator delete(void *mem)
{
	_mm_free(mem);
}

inline void operator delete[](void *mem)
{
	_mm_free(mem);
}

Code Snippet 1: Allocation functions that respect cache-friendly 64 byte boundaries

2. The basic idea

The way to begin is simple: assume every lane of a SIMD register executes as a thread. In case of Intel® Streaming SIMD Extensions (Intel® SSE), you have 4 threads/lanes, with Intel® Advanced Ventor Extensions (Intel® AVX) 8 threads/lanes and 16 threads/lanes on Intel® Xeon-p Phi coprocessors .

To have a 'drop in' solution, the first step is to implement classes that behave mostly like primitive data types. Wrap 'int', 'float' etc. and use those wrappers as the starting point for every SIMD implementation. For the Intel SSE version, replace the float member with __m128, int and unsigned int with __m128i and implement operators using Intel SSE intrinsics or Intel AVX intrinsics as in Code Snippet 2.

// SEE 128-bit
inline	DRealF	operator+(DRealF R)const{return DRealF(_mm_add_ps(m_V, R.m_V));}
inline	DRealF	operator-(DRealF R)const{return DRealF(_mm_sub_ps(m_V, R.m_V));}
inline	DRealF	operator*(DRealF R)const{return DRealF(_mm_mul_ps(m_V, R.m_V));}
inline	DRealF	operator/(DRealF R)const{return DRealF(_mm_div_ps(m_V, R.m_V));}

// AVX 256-bit
inline	DRealF	operator+(const DRealF& R)const{return DRealF(_mm256_add_ps(m_V, R.m_V));}
inline	DRealF	operator-(const DRealF& R)const{return DRealF(_mm256_sub_ps(m_V, R.m_V));}
inline	DRealF	operator*(const DRealF& R)const{return DRealF(_mm256_mul_ps(m_V, R.m_V));}
inline	DRealF	operator/(const DRealF& R)const{return DRealF(_mm256_div_ps(m_V, R.m_V));}

Code Snippet 2: Overloaded arithmetic operators for SIMD wrappers

3. Usage Example

Now let’s assume you're working on two HDR images, where every pixel is a float and you blend between both images.

void CrossFade(float* pOut,const float* pInA,const float* pInB,size_t PixelCount,float Factor)

void CrossFade(float* pOut,const float* pInA,const float* pInB,size_t PixelCount,float Factor)
{
	const DRealF BlendA(1.f - Factor);
	const DRealF BlendB(Factor);
	for(size_t i = 0; i < PixelCount; i += THREAD_COUNT)
		*(DRealF*)(pOut + i) = *(DRealF*)(pInA + i) * BlendA + *(DRealF*)(pInB + i) + BlendB;
}

Code Snippet 3: Blending function that works with both primitive data types and SIMD data

The executable generated from Code Snippet 3 runs natively on normal registers as well as on Intel SSE and Intel AVX. It's not really the vanilla way you'd write it usually, but every C++ programmer should still be able to read and understand it. Let’s see whether it's the way you expect. The first and second line of the implementation initialize the blend factors of our linear interpolation by replicating the parameter to whatever width your SIMD register has.

The third line is nearly a normal loop. The only special part is “THREAD_COUNT”. It's 1 for normal registers, 4 for Intel SSE and 8 for Intel AVX, representing the count of lanes of the register, which in our case resembles threads.

The fourth line indexes into the arrays and both input pixel are scaled by the blend factors and summed. Depending on your preference of writing it, you might want to use some temporaries, but there is no intrinsic you need to look up, no implementation per platform.

4. Drop in

Now it's time to prove that it actually works. Let's take a vanilla MD5 hash implementation and use all of your available CPU power to find the pre-image.  To achieve that, we'll replace the primitive types with our SIMD types. MD5 is running several “rounds” that apply various simple bit operations on unsigned integers as demonstrated in Code Snippet 4.

#define LEFTROTATE(x, c) (((x) << (c)) | ((x) >> (32 - (c))))
#define BLEND(a, b, x) SelectBit(a, b, x)

template<int r>
inline DRealU Step1(DRealU a,DRealU b,DRealU c,DRealU d,DRealU k,DRealU w)
{
	const DRealU f = BLEND(d, c, b);
	return b + LEFTROTATE((a + f + k + w), r);
}

template<int r>
inline DRealU Step2(DRealU a,DRealU b,DRealU c,DRealU d,DRealU k,DRealU w)
{
	const DRealU f = BLEND(c, b, d);
	return b + LEFTROTATE((a + f + k + w),r);
}

template<int r>
inline DRealU Step3(DRealU a,DRealU b,DRealU c,DRealU d,DRealU k,DRealU w)
{
	DRealU f = b ^ c ^ d;
	return b + LEFTROTATE((a + f + k + w), r);
}

template<int r>
inline DRealU Step4(DRealU a,DRealU b,DRealU c,DRealU d,DRealU k,DRealU w)
{
	DRealU f = c ^ (b | (~d));
	return b + LEFTROTATE((a + f + k + w), r);
}

Code Snippet 4: MD5 step functions for SIMD wrappers

Besides the type naming, there is really just one change that could look a little bit like magic — the “SelectBit”. If a bit of x is set, the respective bit of b is returned; otherwise, the respective bit of a; in other words, a blend. The main MD5 hash function is shown in Code Snippet 5.

inline void MD5(const uint8_t* pMSG,DRealU& h0,DRealU& h1,DRealU& h2,DRealU& h3,uint32_t Offset)
{
	const DRealU w0  =	Offset(DRealU(*reinterpret_cast<const uint32_t*>(pMSG + 0 * 4) + Offset));
	const DRealU w1  =	*reinterpret_cast<const uint32_t*>(pMSG + 1 * 4);
	const DRealU w2  =	*reinterpret_cast<const uint32_t*>(pMSG + 2 * 4);
	const DRealU w3  =	*reinterpret_cast<const uint32_t*>(pMSG + 3 * 4);
	const DRealU w4  =	*reinterpret_cast<const uint32_t*>(pMSG + 4 * 4);
	const DRealU w5  =	*reinterpret_cast<const uint32_t*>(pMSG + 5 * 4);
	const DRealU w6  =	*reinterpret_cast<const uint32_t*>(pMSG + 6 * 4);
	const DRealU w7  =	*reinterpret_cast<const uint32_t*>(pMSG + 7 * 4);
	const DRealU w8  =	*reinterpret_cast<const uint32_t*>(pMSG + 8 * 4);
	const DRealU w9  =	*reinterpret_cast<const uint32_t*>(pMSG + 9 * 4);
	const DRealU w10 =	*reinterpret_cast<const uint32_t*>(pMSG + 10 * 4);
	const DRealU w11 =	*reinterpret_cast<const uint32_t*>(pMSG + 11 * 4);
	const DRealU w12 =	*reinterpret_cast<const uint32_t*>(pMSG + 12 * 4);
	const DRealU w13 =	*reinterpret_cast<const uint32_t*>(pMSG + 13 * 4);
	const DRealU w14 =	*reinterpret_cast<const uint32_t*>(pMSG + 14 * 4);
	const DRealU w15 =	*reinterpret_cast<const uint32_t*>(pMSG + 15 * 4);

	DRealU a = h0;
	DRealU b = h1;
	DRealU c = h2;
	DRealU d = h3;

	a = Step1< 7>(a, b, c, d, k0, w0);
	d = Step1<12>(d, a, b, c, k1, w1);
	.
	.
	.
	d = Step4<10>(d, a, b, c, k61, w11);
	c = Step4<15>(c, d, a, b, k62, w2);
	b = Step4<21>(b, c, d, a, k63, w9);

	h0 += a;
	h1 += b;
	h2 += c;
	h3 += d;
}

Code Snippet 5: The main MD5 function

The majority of the code is again like a normal C function, except that the first lines prepare the data by replicating our SIMD registers with the parameter passed. In this case we load the SIMD registers with the data we want to hash. One specialty is the “Offset” call, since we don't want every SIMD lane to do exactly the same work, this call offsets the register by the lane index. It's like a thread-id you would add. See Code Snippet 6 for reference.

Offset(Register)
{
	for(i = 0; i < THREAD_COUNT; i++)
		Register[i] += i;
}

Code Snippet 6: Offset is a utility function for dealing with different register widths

That means, our first element that we want to hash is not [0, 0, 0, 0] for Intel SSE or [0, 0, 0, 0, 0, 0, 0, 0] for Intel AVX. Instead the first element is [0, 1, 2, 3] and [0, 1, 2, 3, 4, 5, 6, 7], respectively. This replicates the effect of running the function in parallel by 4 or 8 threads/cores, but in case of SIMD, instruction parallel.

We can see the results for our 10 minutes of hard work to get this function SIMD-ified in Table 1.

Table 1: MD5 performance with primitive and SIMD types

TypeTimeSpeedup

x86 integer

379.389s

1.0x

SSE4

108.108s

3.5x

AVX2

51.490s

7.4x

 

5. Beyond Simple SIMD-threads

The results are satisfying, not linearly scaling, as there is always some non-threaded part (you can easily identify it in the provided source code). But we're not aiming for the last 10% with twice the work. As a programmer, you'd probably prefer to go for other quick solutions that maximize the gain. Some considerations always arise, like: Would it be worthwhile to unroll the loop?

MD5 hashing seems to be frequently dependent on the result of previous operations, which is not really friendly for CPU pipelines, but you could become register bound if you unroll. Our wrappers can help us to evaluate that easily. Unrolling is the software version of hyper threading, we emulate twice the threads running by repeating the execution of operations on twice the data than SIMD lanes available. Therefore create a duplicate type alike and implement unrolling inside by duplicating every operation for our basic operators as in Code Snippet 7.

struct __m1282
{
	__m128		m_V0;
	__m128		m_V1;
	inline		__m1282(){}
	inline		__m1282(__m128 C0, __m128 C1):m_V0(C0), m_V1(C1){}
};

inline	DRealF	operator+(DRealF R)const
	{return __m1282(_mm_add_ps(m_V.m_V0, R.m_V.m_V0),_mm_add_ps(m_V.m_V1, R.m_V.m_V1));}
inline	DRealF	operator-(DRealF R)const
	{return __m1282(_mm_sub_ps(m_V.m_V0, R.m_V.m_V0),_mm_sub_ps(m_V.m_V1, R.m_V.m_V1));}
inline	DRealF	operator*(DRealF R)const
	{return __m1282(_mm_mul_ps(m_V.m_V0, R.m_V.m_V0),_mm_mul_ps(m_V.m_V1, R.m_V.m_V1));}
inline	DRealF	operator/(DRealF R)const
	{return __m1282(_mm_div_ps(m_V.m_V0, R.m_V.m_V0),_mm_div_ps(m_V.m_V1, R.m_V.m_V1));}

Code Snippet 7: These operators are re-implemented to work with two SSE registers at the same time

That's it, really, now we can again run the timings of the MD5 hash function.

Table 2: MD5 performance with loop unrolling SIMD types

TypeTimeSpeedup

x86 integer

379.389s

1.0x

SSE4

108.108s

3.5x

SSE4 x2

75.659s

4.8x

AVX2

51.490s

7.4x

AVX2 x2

36.014s

10.5x

 

The data in Table 2 shows that it's clearly worth unrolling. We achieve speed beyond the SIMD lane count scaling, probably because the x86 integer version was already stalling the pipeline with operation dependencies.

6. More complex SIMD-threads

So far our examples were simple in the sense that the code was the usual candidate to be vectorized by hand. There is nothing complex beside a lot of compute demanding operations. But how would we deal with more complex scenarios like branching?

The solution is again quite simple and widely used: speculative calculation and masking. Especially if you've worked with shader or compute languages, you'll likely have encountered this before. Let’s take a look at a basic branch of Code Snippet 8 and rewrite it to a ?: operator as in Code Snippet 9.

int a = 0;
if(i % 2 == 1)
	a = 1;
else
	a = 3;

Code Snippet 8: Calculates the mask using if-else

int a = (i % 2) ? 1 : 3;

Code Snippet 9: Calculates the mask with ternary operator ?:

If you recall our bit-select operator of Code Snippet 4, we can also use it to achieve the same with only bit operations in Code Snippet 10.

int Mask = (i % 2) ? ~0 : 0;
int a = SelectBit(3, 1, Mask);

Code Snippet 10: Use of SelectBit prepares for SIMD registers as data

Now, that might seem pointless, if we still have an ?: operator to create the mask, and the compare does not result in true or false, but in all bits set or cleared. Yet this is not a problem, because all bits set or cleared are actually what the comparison instruction of Intel SSE and Intel AVX return.

Of course, instead of assigning just 3 or 1, you can call functions and select the returned result you want. That might lead to performance improvement even in non-vectorized code, as you avoid branching and the CPU never suffers of branch misprediction, but the more complex the functions you call, the more a misprediction is possible. Even in vectorized code, we'll avoid executing unneeded long branches, by checking for special cases where all elements of our SIMD register have the same comparison result as demonstrated in Code Snippet 11.

int Mask = (i % 2) ? ~0 : 0;
int a = 0;
if(All(Mask))
	a = Function1();
else
if(None(Mask))
	a = Function3();
else
	a = BitSelect(Function3(), Function1(), Mask);

Code Snippet 11: Shows an optimized branchless selection between two functions

This detects the special cases where all of the elements are 'true' or where all are 'false'. Those cases run on SIMD the same way as on x86, just the last 'else' case is where the execution flow would diverge, hence we need to use a bit-select.

If Function1 or Function3 modify any data, you'd need to pass the mask down the call and explicitly bit select the modifications just like we've done here. For a drop-in solution, that's a bit of work, but it still results in code that’s readable by most programmers.

7. Complex example

Let's again take some source code and drop in our SIMD types. A particularly interesting case is raytracing of distance fields. For this, we'll use the scene from Iñigo Quilez's demo [2] with his friendly permission, as shown in Figure 1.

Figure 1: Test scene from Iñigo Quilez's raycasting demo

The “SIMD threading” is placed at a spot where you'd add threading usually. Every thread handles a pixel, traversing the world until it hits something, subsequently a little bit of shading is applied and the pixel is converted to RGBA and written to the frame buffer.

The scene traversing is done in an iterative way. Every ray has an unpredictable amount of steps until a hit is recognized. For example, a close up wall is reached after a few steps while some rays might reach the maximum trace distance not hit anything at all. Our main loop in Code Snippet 12 handles both cases using the bit select method we've discussed in the previous section.

DRealU LoopMask(RTrue);
for(; a < 128; a++)

{
      DRealF Dist             =     SceneDist(O.x, O.y, O.z, C);
      DRealU DistU            =     *reinterpret_cast<DRealU*>(&Dist) & DMask(LoopMask);
      Dist                    =     *reinterpret_cast<DRealF*>(&DistU);
      TotalDist               =     TotalDist + Dist;
      O                       +=    D * Dist;
      LoopMask                =     LoopMask && Dist > MinDist && TotalDist < MaxDist;
      if(DNone(LoopMask))
            break;
}

Code Snippet 12: Raycasting with SIMD types

The LoopMask variable identifies that a ray is active by ~0 or 0 in which case we are done with that ray. At the end of the loop, we test whether no ray is active anymore and in this case we break out of the loop.

In the line above we evaluate our conditions for the rays, whether we're close enough to an object to call it a hit or whether the ray is already beyond the maximum distance we want to trace. We logically AND it with the previous result, as the ray might be already terminated in one of the previous iterations.

“SceneDist” is the evaluation function for our tracing - It's run for all SIMD-lanes and is the heavy weighted function that returns the current distance to the closest object. The next line sets the distance elements to 0 for rays that are not active anymore and steps this amount further for the next iteration.

The original “SceneDist” had some assembler optimizations and material handling that we don't need for our test, and this function is reduced to the minimum we need to have a complex example. Inside are still some if-cases that are handled exactly the same as before. Overall, the “SceneDist” is quite large and rather complex and would take a while to rewrite it by hand for every SIMD-platform again and again. You might need to convert it all in one go, while typos might generate completely wrong results. Even if it works, you'll have only a few functions that you really understand, and maintenance is much higher. Doing it by hand should be the last resort. Compared to that, our changes are relatively minor. It's easy to modify and you are able to extend the visual appearance, without the need to worry about optimizing it again and being the only maintainer that understands the code, just like it would be by adding real threads.

But we've done that work to see results, so let’s check the timings in Table 3.

Table 3: Raytracing performance with primitive and SIMD types, including loop unrolling types

TypeFPSSpeedup

x86

0.992FPS

1.0x

SSE4

3.744FPS

3.8x

SSE4 x2

3.282FPS

3.3x

AVX2

6.960FPS

7.0x

AVX2 x2

5.947FPS

6.0x

 

You can clearly see the speed up is not scaling linearly with the element count, which is mainly because of the divergence. Some rays might need 10 times more iterations than others.

8. Why not let the compiler do it?

Compilers nowadays can vectorize to some degree, but the highest priority for the generated code is to deliver correct results, as you would not use 100 time faster binaries that deliver wrong results even 1% of the time. Some assumptions we make, like the data will be aligned for SIMD, and we allocate enough padding to not overwrite consecutive allocations, are out of scope for the compiler. You can get annotations from the Intel compiler about all opportunities it had to skip because of assumptions it could not guarantee, and you can try to rearrange code and make promises to the compiler so it'll generate the vectorized version. But that's work you have to do every time you modify your code and in more complex cases like branching, you can just guess whether it will result in branchless bit selection or serialized code.

The compiler has also no inside knowledge of what you intend to create. You know whether threads will be diverging or coherent and implement a branched or bit selecting solution. You see the point of attack, the loop that would make most sense to change to SIMD, whereas the compiler can just guess whether it will run 10times or 1 million times.

Relying on the compiler might be a win in one place and pain in another. It's good to have this alternative solution you can rely on, just like your hand placed thread entries.

9. Real threading?

Yes, real threading is useful and SIMD-threads are not a replacement — both are orthogonal. SIMD-threads are still not as simple to get running as real threading is, but you'll also run into less trouble about synchronization and seldom bugs. The really nice advantage is that every core Intel sells can run your SIMD-thread version with all the 'threads'. A dual core CPU will run 4 or 8 times faster just like your quad socket 15-core Haswell-EP. Some results for our benchmarks in combination with threading are summarized in Table 4 through Table 7.1

Table 4: MD5 Performance on Intel® Core™ i7 4770K with both SIMD and threading

ThreadsTypeTimeSpeedup

1T

x86 integer

311.704s

1.00x

8T

x86 integer

47.032s

6.63x

1T

SSE4

90.601s

3.44x

8T

SSE4

14.965s

20.83x

1T

SSE4 x2

62.225s

5.01x

8T

SSE4 x2

12.203s

25.54x

1T

AVX2

42.071s

7.41x

8T

AVX2

6.474s

48.15x

1T

AVX2 x2

29.612s

10.53x

8T

AVX2 x2

5.616s

55.50x

 

Table 5: Raytracing Performance on Intel® Core™ i7 4770K with both SIMD and threading

ThreadsTypeFPSSpeedup

1T

x86 integer

1.202FPS

1.00x

8T

x86 integer

6.019FPS

5.01x

1T

SSE4

4.674FPS

3.89x

8T

SSE4

23.298FPS

19.38x

1T

SSE4 x2

4.053FPS

3.37x

8T

SSE4 x2

20.537FPS

17.09x

1T

AVX2

8.646FPS

4.70x

8T

AVX2

42.444FPS

35.31x

1T

AVX2 x2

7.291FPS

6.07x

8T

AVX2 x2

36.776FPS

30.60x

 

Table 6: MD5 Performance on Intel® Core™ i7 5960X with both SIMD and threading

ThreadsTypeTimeSpeedup

1T

x86 integer

379.389s

1.00x

16T

x86 integer

28.499s

13.34x

1T

SSE4

108.108s

3.51x

16T

SSE4

9.194s

41.26x

1T

SSE4 x2

75.694s

5.01x

16T

SSE4 x2

7.381s

51.40x

1T

AVX2

51.490s

3.37x

16T

AVX2

3.965s

95.68x

1T

AVX2 x2

36.015s

10.53x

16T

AVX2 x2

3.387s

112.01x

 

Table 7: Raytracing Performance on Intel® Core™ i7 5960X with both SIMD and threading

ThreadsTypeFPSSpeedup

1T

x86 integer

0.992FPS

1.00x

16T

x86 integer

6.813FPS

6.87x

1T

SSE4

3.744FPS

3.774x

16T

SSE4

37.927FPS

38.23x

1T

SSE4 x2

3.282FPS

3.31x

16T

SSE4 x2

33.770FPS

34.04x

1T

AVX2

6.960FPS

7.02x

16T

AVX2

70.545FPS

71.11x

1T

AVX2 x2

5.947FPS

6.00x

16T

AVX2 x2

59.252FPS

59.76x

 

1Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.

As you can see, the threading results vary depending on the CPU, the SIMD-thread results scale similar. But it's striking that you can reach speed up factors in the higher two digits if you combine both ideas. It makes sense to go for the 8x speed up on a dual core, but so does it make sense to go for an additional 8x speed up on highly expensive server hardware.

Join me, SIMD-ify your code!

About the Author

Michael Kopietz is Render Architect at Crytek's R&D and leads a team of engineers developing the rendering of CryEngine(R) and also guides students during their thesis. He worked among other things on the cross platform rendering architecture, software rendering and on highly responsive server, always high-performance and reusable code in mind. Prior, he was working on ship-battle and soccer simulation game rendering. Having his root in assembler programming on the earliest home consoles, he still wants to make every cycle count.

Code License

All code in this article is © 2014 Crytek GmbH, and released under the https://software.intel.com/en-us/articles/intel-sample-source-code-license-agreement license. All rights reserved.

References

[1] Memory Management for Optimal Performance on Intel® Xeon Phi™ Coprocessor: Alignment and Prefetching https://software.intel.com/en-us/articles/memory-management-for-optimal-performance-on-intel-xeon-phi-coprocessor-alignment-and

[2] Rendering Worlds with Two Triangles by Iñigo Quilez http://www.iquilezles.org/www/material/nvscene2008/nvscene2008.htm

Analyzing Media Workloads Using Intel® Integrated Native Developer Experience (Intel® INDE)

$
0
0

Intel® Integrated Native Developer Experience (Intel® INDE) is a cross-architecture productivity suite that provides developers with the tools needed to analyze media workloads. If you need to verify that the workload does in fact take advantage of Intel hardware as intended, the System Analyzer tool (included with Intel INDE) can provide that insight. For the purpose of this series, we will use Intel INDE in a Windows* environment.

The Intel INDE starter edition can be downloaded for free via the following link: https://registrationcenter.intel.com/RegCenter/ComForm.aspx?ProductID=2329

For more information regarding Intel INDE go to: https://software.intel.com/en-us/intel-inde

In this tutorial, we will use System Analyzer version 2014 R4 14.4.238908. Note that future versions might introduce updates that will cause specific parts of this tutorial to deviate.

With the System Analyzer tool, you can display real-time statistics. You also have the option to export these statistics to a CSV file to log the information.

To start, you must first launch the "Graphics Monitor". It will appear in your task bar. Right-clicking this icon will bring up a set of options. Select "System Analyzer". Once it opens, you will be prompted with the image below.

Make sure you connect to "This Machine". Next, select "System View" so that you see the image below.

For the scope of this tutorial, we can minimize every metric except "Media". Under "Media" you should see the following:

EU Engine Usage
GHAL3D
GPU Overall Usage
MFX Decode Usage
MFX Encode Usage
MFX Engine Usage
OpenCL/MDF Usage
VPP DXVA1
VPP DXVA2
VPPDXVAHD

Dragging any of the above metrics into the right panes will allow you to see its real-time usage. To learn more about each metric, go to: https://software.intel.com/sites/products/documentation/gpa/14.4/win/index.htm

And then expand "Analyzing Windows* OS Graphics Applications" -> "Performance Metrics Reference" -> "Media Metrics".

In the example below, I am transcoding a video file. In this context, transcoding converts my video file from one format to another. As you can see, this operation is being hardware accelerated by the media fixed function (MFX) hardware on my Intel® Core™ processor-based system. The decoding and encoding metrics are showing 27.3% and 34.5% usage respectively.

The System Analyzer tool also gives you the option to export the real-time data to a file. When the "CSV" icon is activated it will glow red. During this time, all the data from active metrics will be exported to a file called "System View_<date and time>.csv", located in Documents folder under GPA_2014_R4.

This is very helpful if you would like to see the data over time or calculate averages.

Conclusion

In summary, you have been exposed to one of the many different usages the Intel INDE tool can provide. With the System Analyzer, you can monitor applications in real-time to determine the validity of the hardware enabled feature claims. This will allow you to test and validate apps in a more effective manner.

In the next article, we will take a look at a different application called "VTune™", that provides unique metrics not covered in the System Analyzer.

Other Related Articles

https://software.intel.com/en-us/blogs/2014/06/16/getting-started-with-intel-inde

https://software.intel.com/en-us/videos/intel-inde-tools-for-developers

https://software.intel.com/en-us/articles/intel-inde-2015-release-notes

About the Author

Leland Martin is a technical marketing engineer in the Software Solutions Group. He works in the Developer Relations Division to ultimately help enable developers to create applications that are optimized to take advantage of Intel hardware.

Intel® Parallel Computing Center at the University of Chicago

$
0
0

Principal Investigators:

Mark F. Adams received my Ph.D. in Civil Engineering in 1998. He works in the Scalable Solvers Group at Lawrence Berkeley National Laboratory, and as an adjunct research scientist in the Applied Physics and Applied Mathematics Department at Columbia University.

 

Matthew G. Knepley received his B.S. in Physics from Case Western Reserve University in 1994, an M.S. in Computer Science from the University of Minnesota in 1996, and a Ph.D. in Computer Science from Purdue University in 2000. In 2009, he joined the Computation Institute as a Senior Research Associate.

 

Jed Brown received his doctor of science degree from ETH, Zurich, in 2011. He is an assistant computational mathematician at Argonne National Laboratory and is an adjunct assistant professor at the University of Colorado Boulder.

 

Description:

PETSc will provide a new solver interface to structured adaptive mesh refinement (SAMR), enabling the efficient representation of multi-scale phenomenon, while maintaining the simplicity of structured grid kernel computations. We will use the most asymptotically efficient solvers for strongly nonlinear equations: matrix-free full approximation scheme (FAS) nonlinear full multi-grid (FMG) methods. These formulations have a unique opportunity to leverage modern architectures to deliver fast, accurate, versatile solvers for complex, multi-physics application. In particular, we are working closely with the premier open source, state-of-the-art massively parallel subsurface flow and reactive transport code -- PFLOTRAN. This work is significant in that it radically changes algorithms, data access, and low level computational organization in order to maximize performance and scalability on modern Intel architectures, and encapsulates this knowledge in the PETSc libraries for the broadest possible impact.

Related websites:
http://www.mcs.anl.gov/petsc/

 

Adding RESTful XMLHttpRequest To an HTML5 Web App Using Intel® XDK

$
0
0

Adding RESTful XMLHttpRequest To an HTML5 Web App Using Intel® XDK

 

Introduction

This article is a continuation of the prior document I published on creating an HTML5 Web App Using Intel® XDK. This document demonstrates how web developers can use JavaScript* and an XMLHttpRequest object to send an HTTP request to an Apache* Tomcat backend server. It outlines the steps developers need to take to use JavaScript* and an XMLHttpRequest object to receive and process the JSON response from the server. It also illustrates to developers how to enable a cross origin request to allow access to cross domain-boundaries.

In the previous article, we reviewed how to create a log-in form using HTML5, JavaScript and CSS3. The log-in form is used in the Little Chef Restaurant web based application to allow customers or managers to log in. When the customers log in, they can see their order history, reward points, and coupons. If the users are managers, they can update customer information, add a new customer or delete an existing customer. We will continue using Intel® XDK to develop, test, and debug the web app to the backend server.

 

Building a RESTful Web Service with Spring IO*

Spring IO* provides a framework that allows a web service client and an Apache Tomcat server to communicate over a REST interface. The server side uses Ubuntu* Linux* 14 for the operating system, MongoDB* for the database, and the Spring IO platform for the REST service. The article below shows how to set up the server and client for the Restaurant Android* app to access the MongoDB database. Whether the client is an Android app or a web app, the environment setup for the server side is the same. Accessing the database and parsing the JSON response for a web app is different and we’ll talk about it later in this article. To set up the environment for the Apache Tomcat backend server side, see the link below:

Accessing a REST Based Database Backend From an Android* App

 

Enabling Cross Origin Requests

Enabling cross origin resource sharing is needed when the client web application that loaded in one domain interacts with resources in a different domain. Setting Access-Control-Allow-Origin to “*” indicates that any origin of the requests is allowed. SimpleCORSFilter only responds to all requests that defined in the headers. Access-Control-Allow-Methods defines the list of HTTP POST, GET, PUT, OPTIONS and DELETE requests from clients that are allowed. See the steps below for enabling cross origin requests on the server side.

After setting up the Spring starter project from the article Accessing a REST Based Database Backend From an Android* App, change into the initial source directory

  • cd gs-accessing-mongodb-data-rest/initial/src/main/java/hello

  • Create the SimpleCORSFilter object (SimpleCORSFilter.java)

package hello;
import java.io.IOException;
import javax.servlet.Filter;
import javax.servlet.FilterChain;
import javax.servlet.FilterConfig;
import javax.servlet.ServletException;
import javax.servlet.ServletRequest;
import javax.servlet.ServletResponse;
import javax.servlet.http.HttpServletResponse;
import org.springframework.stereotype.Component;

@Component
public class SimpleCORSFilter implements Filter {

 public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain) throws IOException, ServletException {
  HttpServletResponse response = (HttpServletResponse) res;
  response.setHeader("Access-Control-Allow-Origin", "*");
  response.setHeader("Access-Control-Allow-Methods", "POST, GET, PUT, OPTIONS, DELETE");
  response.setHeader("Access-Control-Max-Age", "3600");
  response.setHeader("Access-Control-Allow-Headers", "Origin, x-requested-with, Content-Type, Accept");
  chain.doFilter(req, res);
 }

 public void init(FilterConfig filterConfig) {}

 public void destroy() {}
}

Code Example 1: SimpleCORSFilter.java - Enable Cross Origin Requests

 

  • Add @ComponentScan to the Application object (Application.java)

package hello;

import org.springframework.boot.SpringApplication;

import org.springframework.boot.autoconfigure.EnableAutoConfiguration;

import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.ComponentScan;

import org.springframework.context.annotation.Import;

import org.springframework.data.mongodb.repository.config.EnableMongoRepositories;

import org.springframework.data.rest.webmvc.config.RepositoryRestMvcConfiguration;

 
@Configuration

@EnableMongoRepositories

@Import(RepositoryRestMvcConfiguration.class)

@EnableAutoConfiguration

@ComponentScan

public class Application {

 
       public static void main(String[] args) {

              SpringApplication.run(Application.class, args);

       }
}

Code Example 2: Application.java - Enable Cross Origin Requests

 

The main() method in Application.java uses Spring Boot’s SpringApplication.run() method to launch an application.  @ComponentScan tells Spring to look for other components, configurations, and services in the hello package, allowing it to find the SimpleCORSFilter.java. For further information about enabling cross origin requests, visit:

Enabling Cross Origin Requests for a RESTful Web service

 

Communicating with the Server

To communicate with the Apache Tomcat server, the web app sends HTTP requests in JavaScript via XMLHttpRequest. The methods for a request-response between a client and server are GET, POST, PUT and DELETE. The HTTP GET is for the login call, HTTP POST is for the register call, HTTP PUT is for the update call, and HTTP DELETE is for the delete call. When the customer or manager logs in to our Restaurant web app, we’ll send the REST requests to the server to retrieve the user information.

Web app Log-in

Figure 1: Web app Log-in

Making a GET Call to the Server

When the user logs in, we make a GET call to our Apache Tomcat backend server. The send() method will send the request to the server. It accepts any of the following types: DOMString, Document, FormData, File, Blob and ArrayBufferView. The example below demonstrates sending data using DOMString, which is the default. The "true" flag means it is an asynchronous request. If the request is asynchronous, this method returns as soon as the request is sent, which lets the browser continue to work as normal while the server handles the request. If the request is synchronous, this method doesn’t return until the response has arrived.

After a successful request, the XMLHttpRequest’s response property will contain the requested data as a DOMString. The XMLHttpRequest’s response will be null if the request was unsuccessful. In the example below, the array of all the users will be returned in XMLHttpRequest.responseText after the successful GET call. Later we will call validateUsernamePassword() to validate the user information.

usersList.readyState stores the state of the XMLHttpRequest and its value varies between 0 and 4. If the request is complete and the response is ready, the readyState will hold the value 4. usersList.status stores the HTTP status code of the response of the request.

var url_user  = "http://10.2.3.55:8080/users";

function xmlhttpRequestUser() {
        var usersList = new XMLHttpRequest();

        usersList.onreadystatechange = function() {
 if (usersList.readyState === 4 && usersList.status === 200) {
         // Save the resonseText for later use
        usersRespObj.value = usersList.responseText;
        …
 }
        };

        // The "true" flag means it is an asynchronous request
        usersList.open("GET", url_user, true);
        usersList.send();
}

Code Example 3: GET Call to Rest Based Database Backend Server **

Returning a JSON Response

For the GET call to our Spring IO server, XMLHttpRequest.responseText will return a JSON response. The following JSON response example of the GET call defines an users object with an array of two user records. This user information is saved in the REST database in the Apache Tomcat server.

{
  "_links" : {
    "self" : {
      "href" : "http://192.168.128.33:8080/users{?page,size,sort}",
      "templated" : true
    },
    "search" : {
      "href" : "http://192.168.128.33:8080/users/search"
    }
  },
  "_embedded" : {
    "users" : [ {
      "userId" : 12345,
      "email" : "dplate@gmail.com",
      "password" : "password123",
      "firstName" : Don,
      "lastName" : Plate,
      "accessLevel" : “manager”,
      "_links" : {
        "self" : {
          "href" : "http://192.168.128.33:8080/users/5500e58e0cf26b1628989a4c"
        }
      }
    }, {
      "userId" : 12346,
      "email" : "jsmith@gmail.com",
      "password" : "password",
      "firstName" : "Joe",
      "lastName" : "Smith",
      "accessLevel" : "customer",
      "_links" : {
        "self" : {
          "href" : "http://192.168.128.33:8080/users/5500e79d0cf26b1628989a4d"
        }
      }
    } ]
  },
  "page" : {
    "size" : 20,
    "totalElements" : 3,
    "totalPages" : 1,
    "number" : 0
  }
}

JSON Response Example in XMLHttpRequest.responseText **

 

Parsing the JSON Response

The JavaScript JSON.parse() converts a JSON string into a JavaScript object arr. We then use dot or “[]” to retrieve its members. To access the users array, we use arr["_embedded"].users[index]. validateUsernamePassword() function below is an example of looping through the users array to access the user information. If the username and password of the user are found in the response text, validateUsernamePassword() function will return true. The validateUsernamePassword() function also saves the href of the user for the PUT and DELETE calls.

function validateUsernamePassword () {
    // usersRespObj.value is XMLHttpRequest.responseText, a string as JSON returned from the
    // above GET call to the REST based database backend server.
    var arr = JSON.parse(usersRespObj.value);
    var index = 0;

    for (index = 0; index < arr["_embedded"].users.length; index ++) {
       
        // The type of userId in the server is integer.
        var userId = parseInt(userObj.username);
        if ((arr["_embedded"].users[index].userId === userId) &&
            (arr["_embedded"].users[index].password.toLowerCase() === userObj.password.toLowerCase())) {
            console.log("Successfully logged in ");
           
            // Retrieve and save the user info except the password. Save href of the user for the PUT call.
            userObj.accesslevel = arr["_embedded"].users[index].accessLevel;
            userObj.firstname = arr["_embedded"].users[index].firstName;
            userObj.lastname = arr["_embedded"].users[index].lastName;
            userObj.username = arr["_embedded"].users[index].userId;
            userObj.id = arr["_embedded"].users[index]._links.self.href;
            userObj.email = arr["_embedded"].users[index].email;
           
            return true;
        }
    }
    return false;
}

Code Example 4: Parsing the JSON response from the GET call **

 

Making a POST Call to the Server

To perform a first-time registration, we’ll make a POST call to our Apache Tomcat backend server to add the new user to the REST database. The user information will be stored in a MongoDB on the Apache Tomcat server. We send the proper header information along with the request. The following example demonstrates sending data using a JSON format.

function xmlhttpRequestRegister(admin_form) {   
    var user_list = new XMLHttpRequest();

    user_list.onreadystatechange = function() {
 if (user_list.readyState === 4 && user_list.status === 201) {
              alert("Registration Complete");
                         handleLoginSuccess(admin_form);
 }
    };
   
    var firstnameTag = "firstName";
    var lastnameTag = "lastName";
    var passwordTag = "password";
    var usernameTag = "userId";
    var accessLevelTag = "accessLevel";
    var emailTag = "email";
  
    // The user registration information was retrieved from registration pop-up form and saved
    // in the userObj. accessLevel was default to customer.
    var params = '{"' + firstnameTag + '":"' + userObj.firstname + '","' + lastnameTag + '":"' + userObj.lastname + '","' + usernameTag + '":"' + userObj.username + '","' + passwordTag + '":"' + userObj.password + '","' + accessLevelTag + '":"' + userObj.accesslevel + '","' + emailTag + '":"' + userObj.email + '"}';
   
    // The "true" flag means it is an asynchronous request
    user_list.open("POST", url_user, true);
    // Send the proper header information
    user_list.setRequestHeader("Content-Type", "application/json");
    user_list.send(params);
}

Code Example 5: POST Call to Rest Based Database Backend Server **

 

Making a PUT Call to the Server

When the existing user needs to reset the password or the manager needs to update the user information, we make a GET call to the Apache Tomcat server to find out the href and followed by a PUT call to update the user information in the REST based database. Similarly to the POST call, we send the data using JSON format. We also send the proper header information along with the request.

function xmlhttpRequestResetPassword(admin_form) {     
    var user_list = new XMLHttpRequest();

    user_list.onreadystatechange = function() {
        if (user_list.readyState === 4 && user_list.status === 204) {
            usersRespObj.value = user_list.responseText;
            alert("Reset Password Complete");
           
            // If the usersTable exists, just update the table. Otherwise, display the table
            var tableExist = document.getElementById("usersTable");
            if (tableExist) {
                UpdateUserRow();
                close_popup("modal_reset_pw_wrapper");
            } else {
                handleLoginSuccess(admin_form);
            }
        }
    };
   
    var firstnameTag = "firstName";
    var lastnameTag = "lastName";
    var passwordTag = "password";
    var usernameTag = "userId";
    var accessLevelTag = "accessLevel";
    var emailTag = "email";
   
    var params = '{"' + firstnameTag + '":"' + userObj.firstname + '","' + lastnameTag + '":"' + userObj.lastname + '","' + usernameTag + '":"' + userObj.username + '","' + passwordTag + '":"' + userObj.newPassword + '","' + accessLevelTag + '":"' + userObj.accesslevel + '","' + emailTag + '":"' + userObj.email + '"}';
   
    // The "true" flag means it is an asynchronous request
    user_list.open("PUT", userObj.id, true);
    user_list.setRequestHeader("Content-Type", "application/json");
    user_list.send(params);
}

Code Example 6: PUT Call to Rest Based Database Backend Server **

 

Making a DELETE Call to Server

Only the existing manager has authority to remove a user account. Similarly to the POST call, the DELETE call also requires the href of the user on the server. First we do a GET call to the Apache Tomcat server to retrieve the href. Then we perform a DELETE call to the server to remove the user information from the REST based database.

function xmlhttpRequestUserDelete() {
    var del_user = new XMLHttpRequest();

    del_user.onreadystatechange = function() {
    if (del_user.readyState === 4 && del_user.status === 200) {
            DeleteUserRow(row);
        }
    };

    // href of the user was saved in userObj.id from the GET call right before this DELETE call
    var url_del = userObj.id;
    console.log("url_del = " + url_del);
    del_user.open("DELETE", url_del, true);
    del_user.send();   
}

Code Example 7: DELETE Call to Rest Based Database Backend Server **

 

Summary

This article demonstrated how a client web app communicates with the Apache Tomcat server using an HTTP call via XMLHttpRequest. It parses the JSON response received from the server. It also included how to enable cross origin request to allow client web apps access cross domain-boundaries.

 

References

http://json.org/

HTML5 Web App Using Intel XDK

Accessing a REST Based Database Backend From an Android* App

Enabling Cross Origin Requests for a RESTful Web service

 

About the Author

Nancy Le is a software engineer at Intel in Software and Services Group working on the Intel® AtomTM processor scale enabling projects.

 

*Other names and brands may be claimed as the property of others.

**This sample source code is released under the Intel Sample Source Code License

Use cases benefiting from the optimization of small networking data packets using Intel® DPDK an open source solution

$
0
0

Intel® Data Plane Development Kit (Intel® DPDK) is a set of optimized data plane software libraries and drivers that can be used to accelerate packet processing on Intel® architecture. The performance of Intel DPDK scales with improvements in processor technology from Intel® Atom™ to Intel® Xeon® processors. In April 2013 6WIND established dpdk.org an Open Source Project where Intel DPDK is offered under the open source BSD* license. Whether using the open source solution or the Intel DPDK, developers now have the ability to accelerate network applications across a broad spectrum, including telecom, enterprise, and cloud applications. The advantages of combining Intel DPDK with Intel hardware include portability, scalability, and integration with other Intel hardware solutions for even more performance gains. This blog will cover various use cases including virtual switching, big data, and next-generation firewalls where Intel DPDK packet handling has been of value.

Network Function Virtualization

Intel DPDK can be very useful when incorporated within virtualized environments. For example, a recent trend in Software Defined Networks (SDN) is increasing demand for fast host-based packet handling and a move towards Network Functions Virtualization (NFV). NFV is a new way to provide network functions such as firewalls, domain name service, and network address translation as a fully virtualized infrastructure. One example of this is Open vSwitch*, which is an open source solution capable of providing virtual switching. Intel DPDK has been combined with Open vSwitch to provide an accelerated experience.

Telecommunications Industry

The telecommunications industry is increasingly moving towards virtualization in an effort to provide more agility, flexibility, and standardization within its network environments, which over time have traditionally grown in a more heterogeneous way. In the white paper “Carrier Cloud Telecoms – Exploring the Challenges of Deploying Virtualization and SDN in Telecoms Networks”, Tieto in collaboration with Intel showed a Cloud telecom’s use case that combines SDN, NFV, Intel DPDK, Openflow*, and Open vSwitch. They looked at multiple scenarios that include dynamic provisioning of 4G/LTE traffic and resources in a virtualized SDN environment, high-performance and energy-efficient packet processing, and protocol distribution using the Intel DPDK and the Tieto IP stack (TIP), 4G/LTE to 3G video stream handover scenario, and a Packet Data Network Gateway scenario where SDN is used for the handover of Internet traffic.

The white paper NEC* Virtualized EPC Innovation Powered by Multi Core Intel® Architecture Processors, discusses how NEC was able to deploy a virtualized Evolved Packet Core (vEPC), which is a framework for converging data and voice on 4G Long-Term Evolution (LTE) networks, on a common Intel architecture server platform with Intel DPDK and achieve carrier grade service. NEC adopted the Intel DPDK for its vEPC in order to significantly improve the data plane forwarding performance in a virtualization environment.

Next-generation Firewalls

The need for continued refinements in network security have led to improved implementations of firewalls, which is another growing segment that can benefit from Intel DPDK. These next-generation firewalls may also be part of an NFV solution. Basic firewalls that are used for simple packet filtering have evolved in recent years to perform more advanced applications such as intrusion detection and prevention (IPS), network antivirus, IPsec, SSL, application control, and more. These features all reside in the data plane and require deep packet inspection of the data streams, cryptographic and compression capabilities, and heavy processing of the packet contents. A next-generation firewall was designed using Wind River Network Acceleration Platform with Intel DPDK and Intel® QuickAssist Technology§. Intel provides the hardware to receive and transmit network traffic efficiently, along with fast CPUs and large caches, which are ideal for operating with these data-intensive applications. Intel DPDK provides the mechanisms that support high-performance alternatives to Linux* system calls, bypassing the generic issues of the Linux kernel. Finally, Wind River Network Acceleration Platform builds on the Intel infrastructure to provide acceleration of native Linux applications such as an Apache server and provide even higher acceleration for security applications ported onto the network acceleration engine. For more information on this use case see the white paper “Multi-Core Networking For The New Data Plane” and watch a live demonstration here, which provides an example of a next-generation firewall capable of analytics to monitor user traffic applications, and content inspection for malware. On a related note, Intel DPDK is combined with Hyperscan and other Intel technologies for a next-generation IPS solution, which is included as part of the Intel® Security Controller.

Big Data Analytics

For a use case involving Big Data analytics, Aspera and Intel investigated ultra-high-speed data transfer solutions built on Aspera’s fasp* transport technology and the Intel® Xeon® processor E5-2600 v3 product family. The solution was able to achieve predictable ultra-high WAN transfer speeds on commodity Internet connections on both bare metal and virtualized hardware platforms, including over networks with hundreds of milliseconds of round-trip time and several percentage points of packet loss characteristic of typical global-distance WANs. By using Intel DPDK, software engineers were able to reduce the number of memory copies needed to send and receive a packet. This enabled Aspera to boost single stream data transfer speeds to 37.75 Gbps on the tested system1, which represents network utilization of 39 Gbps when Ethernet framing and IP packet headers are accounted for. The team also began preliminary investigation of the transfer performance on virtualized platforms by testing on a kernel-based virtual machine (KVM) hypervisor and obtained initial transfer speeds of 16.1 Gbps. The KVM solution was not yet NUMA or memory optimized, and thus the team expects to obtain even faster speeds as they apply these optimizations in the future. For details about performance findings, system specifications, software specifications, etc. see the white paper Big Data Technologies for Ultra-High-Speed data Transfer and Processing.

Power Grid

As the world’s largest electric utility company, the State Grid Corporation of China (SGCC) provides power to about 1.1 billion people. SGCC relies on a high performance computing cluster to ensure the power grid’s safe and stable operation. With the expansion of China’s power grid size, SGCC’s Advanced Digital Power System Simulator* (ADPSS*) had to be enhanced to meet the state’s increasing power supply demands. The white paper “An integrated Intel® architecture based solution for power grid simulation” explores these challenges. Intel DPDK was one of the key ingredients used to reduce the latency to within 50 microseconds for ADPSS. This was a requirement for creating a large scale power system simulation for 3,000 generators and 30,000 grid transmission lines.

Summary

Intel DPDK can help with a broad spectrum of use cases including NFV, next-generation firewalls, and big data across different industries such as telecommunications, energy, and information technology. It can provide optimization anytime you have high performance applications utilizing small (64 byte) networking data packets. It offers a simple software programming model that scales from Intel Atom processors to the latest Intel Xeon processors, providing flexible system configurations to meet any customer requirements for performance and scalable I/O. The benefits provided by Intel DPDK can be combined with other Intel technologies for additional improvements. These can include Intel QuickAssist Technology, a cryptographic accelerator and Hyperscan, a Deep Packet Inspection solution.

Resources

Open-Source Project
dpdk.org
Intel® DPDK: Overview
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/dpdk-packet-processing-ia-overview-presentation.html
Intel® DPDK: Installation and Configuration Guide
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/intel-dpdk-getting-started-guide.html.
Intel® DPDK: Programmer’s Guide
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/intel-dpdk-programmers-guide.html.
Intel® DPDK: API Reference Documentation
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/intel-dpdk-api-reference.html.
Intel DPDK: Sample Applications
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/intel-dpdk-sample-applications-user-guide.html.
Intel® DPDK: Latest Source Code Packages for the Intel® DPDK Library
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/dpdk-source-code.html

About the Author

David Mulnix has been a software engineer with Intel Corporation for over 15 years. His areas of focus have included software automation, server power and performance analysis, and cloud security.

Intel® Xeon Phi™ Processor x200 Product Family Architecture Articles

$
0
0

Currently there are no articles.

Use Edison SDK + chroot to build up performance application

$
0
0

Extract more performance via Intel software techniques

Edison is a powerful IoT platform with dual-core 500MHz Atom CPU inside. By deploying Intel software techniques, more power can be extracted from it. The following comparison test is performed on Edison and the result clearly presents the performance benefit gain by using Intel optimized libraries and compiler software techniques. Can't wait to try it on your own Edison? Check it on http:// http://software.intel.com/en-us/intel-system-studio and follow the steps in this article.

 

* detection time varies among different cases of pictures/video stream.
* bottom labels are for image resolution
* ipp stands for Intel® Performance Primitives
* tbb stands for Intel® Threading Building Block
* icc stands for Intel® C++ Compiler
 

Edison SDK + chroot to facilitate the development works

In addition, Edison is designed for quick prototype/product development. Therefore, it provides its own SDK to facilitate the application development. Without using SDK, you can still build up libraries and sample application directly on the Edison target. It normally takes half day long to get the jobs done.

Edison SDK provides cross-compile setup script and here we provide another tip by using chroot linux utility + Edison SDK to ease the development environment setup. This only takes less than 3 minutes on our Core i7@3.3GHz machine to build all opencv libs and applications. Check the figure below to review final binaries.

Where to get Edison SDK? You can either download it from Edison support website (search “SDK”) or build your own one. BSP reference guide contains detailed information about how to generate Edison SDK and how to customize your Edison image. The following figure shows the final output of SDK image, one single file combined the installation script and bzip format payloads.

Once you successfully executed this installation shell script above, you should find SDK under the following directory structure.

By applying chroot to this specified root folder “…/coer2-32-poky-linux”, you will have a Linux shell working in the same root directories structures as in Edison. For example, you could update certain library packages via opkg to build or to even debug the codes in current desktop environment first (with the running kernel of current desktop machine) before deploying the test software to the real Edison target.

More information and detailed steps to build up Edison opencv application.

This attached txt file below contains the detailed commands and steps to build up an opencv face detection application under Edison SDK + chroot setup. To use Intel® C++ compiler, you have to install the latest version of Intel system studio via http://software.intel.com/en-us/intel-system-studio. Intel® C++ compiler is part of Intel System Studio.

DownloaddetailSteps.txt

Before getting this Edison SDK setup ready, you may want to know more information about Edison;
* Edison Hardware information
* How to assemble Edison boards –  Video
* How to flash the image
* How to customize Edison kernel

See also

- Building Yocto* Applications using Intel® C++ Compiler with Yocto Project* Application Development Toolkit
- Improved sysroot support in Intel C++ Compiler for cross compile
- Build and Debug Applications for Intel® Edison with Intel® System Studio

 


Easy SIMD through Wrappers

$
0
0

By Michael Kopietz,  Crytek Render Architect

Download PDF

1. Introduction

This article aims to change your thinking on how SIMD programming can be applied in your code. By thinking of SIMD lanes as functioning similarly to CPU threads, you will gain new insights and be able to apply SIMD more often in your code.

Intel has been shipping CPUs with SIMD support for about twice as long as they have been shipping multi core CPUs, yet threading is more established in software development. One reason for this increased support is an abundance of tutorials that introduce threading in a simple “run this entry function n-times” manner, skipping all the possible traps. On the other side, SIMD tutorials tend to focus on achieving the final 10% speed up that requires you to double the size of your code. If these tutorials provide code as an example, you may find it hard to focus on all the new information and at the same time come up with your simple and elegant way of using it. Thus showing a simple, useful way of using SIMD is the topic of this paper.

First the basic principle of SIMD code: alignment. Probably all SIMD hardware either demands or at least prefers some natural alignment, and explaining the basics could fill a paper [1]. But in general, if you're not running out of memory, it is important for you to allocate memory in a cache friendly way. For Intel CPUs that means allocating memory on a 64 byte boundary as shown in Code Snippet 1.

inline void* operator new(size_t size)
{
	return _mm_malloc(size, 64);
}

inline void* operator new[](size_t size)
{
	return _mm_malloc(size, 64);
}

inline void operator delete(void *mem)
{
	_mm_free(mem);
}

inline void operator delete[](void *mem)
{
	_mm_free(mem);
}

Code Snippet 1: Allocation functions that respect cache-friendly 64 byte boundaries

2. The basic idea

The way to begin is simple: assume every lane of a SIMD register executes as a thread. In case of Intel® Streaming SIMD Extensions (Intel® SSE), you have 4 threads/lanes, with Intel® Advanced Ventor Extensions (Intel® AVX) 8 threads/lanes and 16 threads/lanes on Intel® Xeon-p Phi coprocessors .

To have a 'drop in' solution, the first step is to implement classes that behave mostly like primitive data types. Wrap 'int', 'float' etc. and use those wrappers as the starting point for every SIMD implementation. For the Intel SSE version, replace the float member with __m128, int and unsigned int with __m128i and implement operators using Intel SSE intrinsics or Intel AVX intrinsics as in Code Snippet 2.

// SEE 128-bit
inline	DRealF	operator+(DRealF R)const{return DRealF(_mm_add_ps(m_V, R.m_V));}
inline	DRealF	operator-(DRealF R)const{return DRealF(_mm_sub_ps(m_V, R.m_V));}
inline	DRealF	operator*(DRealF R)const{return DRealF(_mm_mul_ps(m_V, R.m_V));}
inline	DRealF	operator/(DRealF R)const{return DRealF(_mm_div_ps(m_V, R.m_V));}

// AVX 256-bit
inline	DRealF	operator+(const DRealF& R)const{return DRealF(_mm256_add_ps(m_V, R.m_V));}
inline	DRealF	operator-(const DRealF& R)const{return DRealF(_mm256_sub_ps(m_V, R.m_V));}
inline	DRealF	operator*(const DRealF& R)const{return DRealF(_mm256_mul_ps(m_V, R.m_V));}
inline	DRealF	operator/(const DRealF& R)const{return DRealF(_mm256_div_ps(m_V, R.m_V));}

Code Snippet 2: Overloaded arithmetic operators for SIMD wrappers

3. Usage Example

Now let’s assume you're working on two HDR images, where every pixel is a float and you blend between both images.

void CrossFade(float* pOut,const float* pInA,const float* pInB,size_t PixelCount,float Factor)

void CrossFade(float* pOut,const float* pInA,const float* pInB,size_t PixelCount,float Factor)
{
	const DRealF BlendA(1.f - Factor);
	const DRealF BlendB(Factor);
	for(size_t i = 0; i < PixelCount; i += THREAD_COUNT)
		*(DRealF*)(pOut + i) = *(DRealF*)(pInA + i) * BlendA + *(DRealF*)(pInB + i) + BlendB;
}

Code Snippet 3: Blending function that works with both primitive data types and SIMD data

The executable generated from Code Snippet 3 runs natively on normal registers as well as on Intel SSE and Intel AVX. It's not really the vanilla way you'd write it usually, but every C++ programmer should still be able to read and understand it. Let’s see whether it's the way you expect. The first and second line of the implementation initialize the blend factors of our linear interpolation by replicating the parameter to whatever width your SIMD register has.

The third line is nearly a normal loop. The only special part is “THREAD_COUNT”. It's 1 for normal registers, 4 for Intel SSE and 8 for Intel AVX, representing the count of lanes of the register, which in our case resembles threads.

The fourth line indexes into the arrays and both input pixel are scaled by the blend factors and summed. Depending on your preference of writing it, you might want to use some temporaries, but there is no intrinsic you need to look up, no implementation per platform.

4. Drop in

Now it's time to prove that it actually works. Let's take a vanilla MD5 hash implementation and use all of your available CPU power to find the pre-image.  To achieve that, we'll replace the primitive types with our SIMD types. MD5 is running several “rounds” that apply various simple bit operations on unsigned integers as demonstrated in Code Snippet 4.

#define LEFTROTATE(x, c) (((x) << (c)) | ((x) >> (32 - (c))))
#define BLEND(a, b, x) SelectBit(a, b, x)

template<int r>
inline DRealU Step1(DRealU a,DRealU b,DRealU c,DRealU d,DRealU k,DRealU w)
{
	const DRealU f = BLEND(d, c, b);
	return b + LEFTROTATE((a + f + k + w), r);
}

template<int r>
inline DRealU Step2(DRealU a,DRealU b,DRealU c,DRealU d,DRealU k,DRealU w)
{
	const DRealU f = BLEND(c, b, d);
	return b + LEFTROTATE((a + f + k + w),r);
}

template<int r>
inline DRealU Step3(DRealU a,DRealU b,DRealU c,DRealU d,DRealU k,DRealU w)
{
	DRealU f = b ^ c ^ d;
	return b + LEFTROTATE((a + f + k + w), r);
}

template<int r>
inline DRealU Step4(DRealU a,DRealU b,DRealU c,DRealU d,DRealU k,DRealU w)
{
	DRealU f = c ^ (b | (~d));
	return b + LEFTROTATE((a + f + k + w), r);
}

Code Snippet 4: MD5 step functions for SIMD wrappers

Besides the type naming, there is really just one change that could look a little bit like magic — the “SelectBit”. If a bit of x is set, the respective bit of b is returned; otherwise, the respective bit of a; in other words, a blend. The main MD5 hash function is shown in Code Snippet 5.

inline void MD5(const uint8_t* pMSG,DRealU& h0,DRealU& h1,DRealU& h2,DRealU& h3,uint32_t Offset)
{
	const DRealU w0  =	Offset(DRealU(*reinterpret_cast<const uint32_t*>(pMSG + 0 * 4) + Offset));
	const DRealU w1  =	*reinterpret_cast<const uint32_t*>(pMSG + 1 * 4);
	const DRealU w2  =	*reinterpret_cast<const uint32_t*>(pMSG + 2 * 4);
	const DRealU w3  =	*reinterpret_cast<const uint32_t*>(pMSG + 3 * 4);
	const DRealU w4  =	*reinterpret_cast<const uint32_t*>(pMSG + 4 * 4);
	const DRealU w5  =	*reinterpret_cast<const uint32_t*>(pMSG + 5 * 4);
	const DRealU w6  =	*reinterpret_cast<const uint32_t*>(pMSG + 6 * 4);
	const DRealU w7  =	*reinterpret_cast<const uint32_t*>(pMSG + 7 * 4);
	const DRealU w8  =	*reinterpret_cast<const uint32_t*>(pMSG + 8 * 4);
	const DRealU w9  =	*reinterpret_cast<const uint32_t*>(pMSG + 9 * 4);
	const DRealU w10 =	*reinterpret_cast<const uint32_t*>(pMSG + 10 * 4);
	const DRealU w11 =	*reinterpret_cast<const uint32_t*>(pMSG + 11 * 4);
	const DRealU w12 =	*reinterpret_cast<const uint32_t*>(pMSG + 12 * 4);
	const DRealU w13 =	*reinterpret_cast<const uint32_t*>(pMSG + 13 * 4);
	const DRealU w14 =	*reinterpret_cast<const uint32_t*>(pMSG + 14 * 4);
	const DRealU w15 =	*reinterpret_cast<const uint32_t*>(pMSG + 15 * 4);

	DRealU a = h0;
	DRealU b = h1;
	DRealU c = h2;
	DRealU d = h3;

	a = Step1< 7>(a, b, c, d, k0, w0);
	d = Step1<12>(d, a, b, c, k1, w1);
	.
	.
	.
	d = Step4<10>(d, a, b, c, k61, w11);
	c = Step4<15>(c, d, a, b, k62, w2);
	b = Step4<21>(b, c, d, a, k63, w9);

	h0 += a;
	h1 += b;
	h2 += c;
	h3 += d;
}

Code Snippet 5: The main MD5 function

The majority of the code is again like a normal C function, except that the first lines prepare the data by replicating our SIMD registers with the parameter passed. In this case we load the SIMD registers with the data we want to hash. One specialty is the “Offset” call, since we don't want every SIMD lane to do exactly the same work, this call offsets the register by the lane index. It's like a thread-id you would add. See Code Snippet 6 for reference.

Offset(Register)
{
	for(i = 0; i < THREAD_COUNT; i++)
		Register[i] += i;
}

Code Snippet 6: Offset is a utility function for dealing with different register widths

That means, our first element that we want to hash is not [0, 0, 0, 0] for Intel SSE or [0, 0, 0, 0, 0, 0, 0, 0] for Intel AVX. Instead the first element is [0, 1, 2, 3] and [0, 1, 2, 3, 4, 5, 6, 7], respectively. This replicates the effect of running the function in parallel by 4 or 8 threads/cores, but in case of SIMD, instruction parallel.

We can see the results for our 10 minutes of hard work to get this function SIMD-ified in Table 1.

Table 1: MD5 performance with primitive and SIMD types

TypeTimeSpeedup

x86 integer

379.389s

1.0x

SSE4

108.108s

3.5x

AVX2

51.490s

7.4x

 

5. Beyond Simple SIMD-threads

The results are satisfying, not linearly scaling, as there is always some non-threaded part (you can easily identify it in the provided source code). But we're not aiming for the last 10% with twice the work. As a programmer, you'd probably prefer to go for other quick solutions that maximize the gain. Some considerations always arise, like: Would it be worthwhile to unroll the loop?

MD5 hashing seems to be frequently dependent on the result of previous operations, which is not really friendly for CPU pipelines, but you could become register bound if you unroll. Our wrappers can help us to evaluate that easily. Unrolling is the software version of hyper threading, we emulate twice the threads running by repeating the execution of operations on twice the data than SIMD lanes available. Therefore create a duplicate type alike and implement unrolling inside by duplicating every operation for our basic operators as in Code Snippet 7.

struct __m1282
{
	__m128		m_V0;
	__m128		m_V1;
	inline		__m1282(){}
	inline		__m1282(__m128 C0, __m128 C1):m_V0(C0), m_V1(C1){}
};

inline	DRealF	operator+(DRealF R)const
	{return __m1282(_mm_add_ps(m_V.m_V0, R.m_V.m_V0),_mm_add_ps(m_V.m_V1, R.m_V.m_V1));}
inline	DRealF	operator-(DRealF R)const
	{return __m1282(_mm_sub_ps(m_V.m_V0, R.m_V.m_V0),_mm_sub_ps(m_V.m_V1, R.m_V.m_V1));}
inline	DRealF	operator*(DRealF R)const
	{return __m1282(_mm_mul_ps(m_V.m_V0, R.m_V.m_V0),_mm_mul_ps(m_V.m_V1, R.m_V.m_V1));}
inline	DRealF	operator/(DRealF R)const
	{return __m1282(_mm_div_ps(m_V.m_V0, R.m_V.m_V0),_mm_div_ps(m_V.m_V1, R.m_V.m_V1));}

Code Snippet 7: These operators are re-implemented to work with two SSE registers at the same time

That's it, really, now we can again run the timings of the MD5 hash function.

Table 2: MD5 performance with loop unrolling SIMD types

TypeTimeSpeedup

x86 integer

379.389s

1.0x

SSE4

108.108s

3.5x

SSE4 x2

75.659s

4.8x

AVX2

51.490s

7.4x

AVX2 x2

36.014s

10.5x

 

The data in Table 2 shows that it's clearly worth unrolling. We achieve speed beyond the SIMD lane count scaling, probably because the x86 integer version was already stalling the pipeline with operation dependencies.

6. More complex SIMD-threads

So far our examples were simple in the sense that the code was the usual candidate to be vectorized by hand. There is nothing complex beside a lot of compute demanding operations. But how would we deal with more complex scenarios like branching?

The solution is again quite simple and widely used: speculative calculation and masking. Especially if you've worked with shader or compute languages, you'll likely have encountered this before. Let’s take a look at a basic branch of Code Snippet 8 and rewrite it to a ?: operator as in Code Snippet 9.

int a = 0;
if(i % 2 == 1)
	a = 1;
else
	a = 3;

Code Snippet 8: Calculates the mask using if-else

int a = (i % 2) ? 1 : 3;

Code Snippet 9: Calculates the mask with ternary operator ?:

If you recall our bit-select operator of Code Snippet 4, we can also use it to achieve the same with only bit operations in Code Snippet 10.

int Mask = (i % 2) ? ~0 : 0;
int a = SelectBit(3, 1, Mask);

Code Snippet 10: Use of SelectBit prepares for SIMD registers as data

Now, that might seem pointless, if we still have an ?: operator to create the mask, and the compare does not result in true or false, but in all bits set or cleared. Yet this is not a problem, because all bits set or cleared are actually what the comparison instruction of Intel SSE and Intel AVX return.

Of course, instead of assigning just 3 or 1, you can call functions and select the returned result you want. That might lead to performance improvement even in non-vectorized code, as you avoid branching and the CPU never suffers of branch misprediction, but the more complex the functions you call, the more a misprediction is possible. Even in vectorized code, we'll avoid executing unneeded long branches, by checking for special cases where all elements of our SIMD register have the same comparison result as demonstrated in Code Snippet 11.

int Mask = (i % 2) ? ~0 : 0;
int a = 0;
if(All(Mask))
	a = Function1();
else
if(None(Mask))
	a = Function3();
else
	a = BitSelect(Function3(), Function1(), Mask);

Code Snippet 11: Shows an optimized branchless selection between two functions

This detects the special cases where all of the elements are 'true' or where all are 'false'. Those cases run on SIMD the same way as on x86, just the last 'else' case is where the execution flow would diverge, hence we need to use a bit-select.

If Function1 or Function3 modify any data, you'd need to pass the mask down the call and explicitly bit select the modifications just like we've done here. For a drop-in solution, that's a bit of work, but it still results in code that’s readable by most programmers.

7. Complex example

Let's again take some source code and drop in our SIMD types. A particularly interesting case is raytracing of distance fields. For this, we'll use the scene from Iñigo Quilez's demo [2] with his friendly permission, as shown in Figure 1.

Figure 1: Test scene from Iñigo Quilez's raycasting demo

The “SIMD threading” is placed at a spot where you'd add threading usually. Every thread handles a pixel, traversing the world until it hits something, subsequently a little bit of shading is applied and the pixel is converted to RGBA and written to the frame buffer.

The scene traversing is done in an iterative way. Every ray has an unpredictable amount of steps until a hit is recognized. For example, a close up wall is reached after a few steps while some rays might reach the maximum trace distance not hit anything at all. Our main loop in Code Snippet 12 handles both cases using the bit select method we've discussed in the previous section.

DRealU LoopMask(RTrue);
for(; a < 128; a++)

{
      DRealF Dist             =     SceneDist(O.x, O.y, O.z, C);
      DRealU DistU            =     *reinterpret_cast<DRealU*>(&Dist) & DMask(LoopMask);
      Dist                    =     *reinterpret_cast<DRealF*>(&DistU);
      TotalDist               =     TotalDist + Dist;
      O                       +=    D * Dist;
      LoopMask                =     LoopMask && Dist > MinDist && TotalDist < MaxDist;
      if(DNone(LoopMask))
            break;
}

Code Snippet 12: Raycasting with SIMD types

The LoopMask variable identifies that a ray is active by ~0 or 0 in which case we are done with that ray. At the end of the loop, we test whether no ray is active anymore and in this case we break out of the loop.

In the line above we evaluate our conditions for the rays, whether we're close enough to an object to call it a hit or whether the ray is already beyond the maximum distance we want to trace. We logically AND it with the previous result, as the ray might be already terminated in one of the previous iterations.

“SceneDist” is the evaluation function for our tracing - It's run for all SIMD-lanes and is the heavy weighted function that returns the current distance to the closest object. The next line sets the distance elements to 0 for rays that are not active anymore and steps this amount further for the next iteration.

The original “SceneDist” had some assembler optimizations and material handling that we don't need for our test, and this function is reduced to the minimum we need to have a complex example. Inside are still some if-cases that are handled exactly the same as before. Overall, the “SceneDist” is quite large and rather complex and would take a while to rewrite it by hand for every SIMD-platform again and again. You might need to convert it all in one go, while typos might generate completely wrong results. Even if it works, you'll have only a few functions that you really understand, and maintenance is much higher. Doing it by hand should be the last resort. Compared to that, our changes are relatively minor. It's easy to modify and you are able to extend the visual appearance, without the need to worry about optimizing it again and being the only maintainer that understands the code, just like it would be by adding real threads.

But we've done that work to see results, so let’s check the timings in Table 3.

Table 3: Raytracing performance with primitive and SIMD types, including loop unrolling types

TypeFPSSpeedup

x86

0.992FPS

1.0x

SSE4

3.744FPS

3.8x

SSE4 x2

3.282FPS

3.3x

AVX2

6.960FPS

7.0x

AVX2 x2

5.947FPS

6.0x

 

You can clearly see the speed up is not scaling linearly with the element count, which is mainly because of the divergence. Some rays might need 10 times more iterations than others.

8. Why not let the compiler do it?

Compilers nowadays can vectorize to some degree, but the highest priority for the generated code is to deliver correct results, as you would not use 100 time faster binaries that deliver wrong results even 1% of the time. Some assumptions we make, like the data will be aligned for SIMD, and we allocate enough padding to not overwrite consecutive allocations, are out of scope for the compiler. You can get annotations from the Intel compiler about all opportunities it had to skip because of assumptions it could not guarantee, and you can try to rearrange code and make promises to the compiler so it'll generate the vectorized version. But that's work you have to do every time you modify your code and in more complex cases like branching, you can just guess whether it will result in branchless bit selection or serialized code.

The compiler has also no inside knowledge of what you intend to create. You know whether threads will be diverging or coherent and implement a branched or bit selecting solution. You see the point of attack, the loop that would make most sense to change to SIMD, whereas the compiler can just guess whether it will run 10times or 1 million times.

Relying on the compiler might be a win in one place and pain in another. It's good to have this alternative solution you can rely on, just like your hand placed thread entries.

9. Real threading?

Yes, real threading is useful and SIMD-threads are not a replacement — both are orthogonal. SIMD-threads are still not as simple to get running as real threading is, but you'll also run into less trouble about synchronization and seldom bugs. The really nice advantage is that every core Intel sells can run your SIMD-thread version with all the 'threads'. A dual core CPU will run 4 or 8 times faster just like your quad socket 15-core Haswell-EP. Some results for our benchmarks in combination with threading are summarized in Table 4 through Table 7.1

Table 4: MD5 Performance on Intel® Core™ i7 4770K with both SIMD and threading

ThreadsTypeTimeSpeedup

1T

x86 integer

311.704s

1.00x

8T

x86 integer

47.032s

6.63x

1T

SSE4

90.601s

3.44x

8T

SSE4

14.965s

20.83x

1T

SSE4 x2

62.225s

5.01x

8T

SSE4 x2

12.203s

25.54x

1T

AVX2

42.071s

7.41x

8T

AVX2

6.474s

48.15x

1T

AVX2 x2

29.612s

10.53x

8T

AVX2 x2

5.616s

55.50x

 

Table 5: Raytracing Performance on Intel® Core™ i7 4770K with both SIMD and threading

ThreadsTypeFPSSpeedup

1T

x86 integer

1.202FPS

1.00x

8T

x86 integer

6.019FPS

5.01x

1T

SSE4

4.674FPS

3.89x

8T

SSE4

23.298FPS

19.38x

1T

SSE4 x2

4.053FPS

3.37x

8T

SSE4 x2

20.537FPS

17.09x

1T

AVX2

8.646FPS

4.70x

8T

AVX2

42.444FPS

35.31x

1T

AVX2 x2

7.291FPS

6.07x

8T

AVX2 x2

36.776FPS

30.60x

 

Table 6: MD5 Performance on Intel® Core™ i7 5960X with both SIMD and threading

ThreadsTypeTimeSpeedup

1T

x86 integer

379.389s

1.00x

16T

x86 integer

28.499s

13.34x

1T

SSE4

108.108s

3.51x

16T

SSE4

9.194s

41.26x

1T

SSE4 x2

75.694s

5.01x

16T

SSE4 x2

7.381s

51.40x

1T

AVX2

51.490s

3.37x

16T

AVX2

3.965s

95.68x

1T

AVX2 x2

36.015s

10.53x

16T

AVX2 x2

3.387s

112.01x

 

Table 7: Raytracing Performance on Intel® Core™ i7 5960X with both SIMD and threading

ThreadsTypeFPSSpeedup

1T

x86 integer

0.992FPS

1.00x

16T

x86 integer

6.813FPS

6.87x

1T

SSE4

3.744FPS

3.774x

16T

SSE4

37.927FPS

38.23x

1T

SSE4 x2

3.282FPS

3.31x

16T

SSE4 x2

33.770FPS

34.04x

1T

AVX2

6.960FPS

7.02x

16T

AVX2

70.545FPS

71.11x

1T

AVX2 x2

5.947FPS

6.00x

16T

AVX2 x2

59.252FPS

59.76x

 

1Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.

As you can see, the threading results vary depending on the CPU, the SIMD-thread results scale similar. But it's striking that you can reach speed up factors in the higher two digits if you combine both ideas. It makes sense to go for the 8x speed up on a dual core, but so does it make sense to go for an additional 8x speed up on highly expensive server hardware.

Join me, SIMD-ify your code!

About the Author

Michael Kopietz is Render Architect at Crytek's R&D and leads a team of engineers developing the rendering of CryEngine(R) and also guides students during their thesis. He worked among other things on the cross platform rendering architecture, software rendering and on highly responsive server, always high-performance and reusable code in mind. Prior, he was working on ship-battle and soccer simulation game rendering. Having his root in assembler programming on the earliest home consoles, he still wants to make every cycle count.

Code License

All code in this article is © 2014 Crytek GmbH, and released under the https://software.intel.com/en-us/articles/intel-sample-source-code-license-agreement license. All rights reserved.

References

[1] Memory Management for Optimal Performance on Intel® Xeon Phi™ Coprocessor: Alignment and Prefetching https://software.intel.com/en-us/articles/memory-management-for-optimal-performance-on-intel-xeon-phi-coprocessor-alignment-and

[2] Rendering Worlds with Two Triangles by Iñigo Quilez http://www.iquilezles.org/www/material/nvscene2008/nvscene2008.htm

AES-GCM Encryption Performance on Intel® Xeon® E5 v3 Processors

$
0
0

This case study examines the architectural improvements made to the Intel® Xeon® E5 v3 processor family in order to improve the performance of the Galois/Counter Mode of AES block encryption. It looks at the impact of these improvements on the nginx* web server when backed by the OpenSSL* SSL/TLS library. With this new generation of Xeon processors, web servers can obtain significant increases in maximum throughput by switching from AES in CBC mode with HMAC+SHA1 digests to AES-GCM.

Background

The goal of this case study is to examine the impact of the microarchitecture improvements made in the Intel Xeon v3 line of processors on the performance of an SSL web server. Two significant enhancements relating to encryption performance were latency reductions in the Intel® AES New Instructions (Intel® AES-NI) instructions and a latency reduction in the PCLMULQDQ instruction. These changes were designed specifically to increase the performance of the Galois/Counter Mode of AES, commonly referred to as AES-GCM.

One of the key features of AES-GCM is that the Galois field multiplication that is used for message authentication can be computed in parallel with the block encryption. This permits a much higher level of parallelization than is possible with chaining modes of AES, such as the popular Cipher Block Chaining (CBC) mode. The performance gain of AES-GCM over AES-CBC with HMAC+SHA1 digests was significant even on older generation CPU’s such as the Xeon v2 family, but the architectural improvements to the Xeon v3 family further widen the performance gap.

Figure 1 shows the throughput gains realized from OpenSSL’s speed tests by choosing the aes-128-gcm EVP over aes-128-cbc-hmac-sha1 on both Xeon E5 v2 and Xeon E5 v3 systems. The hardware and software configuration behind data test is given in Table 1. What this shows is that AES-GCM outperforms AES-CBC with HMAC+SHA1 on Xeon E5 v2 by as much as 2.5x, but on Xeon E5 v3 that jumps to nearly 4.5x. The performance gap between GCM and CBC nearly doubles from Xeon E5 v2 to v3.


Table 1. Hardware and software configurations for OpenSSL speed tests.

In order to assess how this OpenSSL raw performance translates to SSL web server throughput, this case study looks at the maximum throughput achievable by the nginx web server when using these two encryption ciphers.


Figure 1. Relative OpenSSL 1.0.2a speed results for the aes-128-gcm and aes-128-cbc-hamc-sha1 EVP's on Xeon E5 v2 and v3 processors

The Test Environment

The performance limits of nginx were tested for the two ciphers by generating a large number of parallel connection requests, and repeating those connections as fast as possible for a total of two minutes. At the end of those two minutes, the maximum latency across all requests was examined along with the resulting throughput. The number of simultaneous connections was adjusted between runs to find the maximum throughput that nginx could achieve for the duration without connection latencies exceeding 2 seconds. This latency limit was taken from the research paper “A Study on tolerable waiting time: how long are Web users willing to wait?”, which concluded that two seconds is the maximum acceptable delay in loading a small web page.

Nginx  was installed on a pre-production, two-socket Intel Xeon server system populated with two production E5-2697 v3 processors clocked at 2.60 GHz with Turbo on and Hyper-Threading off. The system was running Ubuntu* Server 13.10. Each E5 processor had 14 cores for a total of 28 hardware threads. Total system RAM was 64 GB.

The SSL capabilities for nginx were provided by the OpenSSL library. OpenSSL is an Open Source library that implements the SSL and TLS protocols in addition to general purpose cryptographic functions and the 1.0.2 branch is optimized for the Intel Xeon v3 processor. More information on OpenSSL can be found at http://www.openssl.org/. The tests in this case study were made using 1.0.2-beta3 as the production release was not yet available at the time these tests were run.

The server load was generated by up to six client systems as needed; a mixture of Xeon E5 and Xeon E5 v2 class hardware. Each system was connected to the nginx server with multiple 10 Gbit direct connect links. The server had two 4x10 Gbit network cards, and two 2x10 Gbit network cards. Two of the clients had 4x10 Gbit cards, and the remaining four had a single 10 Gbit NIC.

The network diagram for the test environment is shown in Figure 1.


Figure 2. Test network diagram.

The actual server load was generated using multiple instances of the Apache* Benchmark tool, ab, an Open Source utility that is included in the Apache server distribution. A single instance of Apache Benchmark was not able to create a load sufficient to reach the server’s limits, so it had to be split across multiple processors and, due to client CPU demands, across multiple hosts.

Because each Apache Benchmark instance is completely self-contained, however, there is no built-in mechanism for distributed execution. A synchronization server and client wrapper were written to coordinate the launching of multiple instances of ab across the load clients, their CPU’s, and their network interfaces, and then collate the results.

The Test Plan

The goal of the tests were to determine the maximum throughput that nginx could sustain over 2-minutes of repeated, incoming connection requests for a target file, and to compare the results for the AES128-SHA cipher to those of the AES128-GCM-SHA256 cipher on the Xeon E5-2697 v3 platform. Note that in the GCM cipher suites, the _SHA suffix refers to the SHA hashing function used as the Pseudo Random Function algorithm in the cipher, in this case SHA-256.


Table 2. Selected TLS Ciphers

Each test was repeated for a fixed target file size, starting at 1 MB and increasing by powers of four up to 4 GB, where 1 GB = 1024 MB, 1 MB = 1024 KB, and 1 KB = 1024 bytes. The use of files 1MB and larger minimized the impact of the key exchange on the session throughput. Keep-alives were disabled so that each connection resulted in fetching a single file.

Tests for each cipher were run for the following hardware configurations:

  • 2 cores enabled (1 core per socket)
  • 4 cores enabled (2 cores per socket)
  • 8 cores enabled (4 cores per socket)
  • 16 cores enabled (8 cores pre socket)

Hyper-threading was disabled in all configurations. Reducing the system to one active core per socket, the minimum configuration in the test system, effectively simulates a low-core-count system and ensures that nginx performance is limited by the CPU rather than other system resources. These measurements can be used to estimate the overall performance per core, as well as estimate the projected performance of a system with many cores.

The many-core runs test the scalability of the system, and also introduces the possibility of system resource limits beyond just CPU utilization.

System Configuration and Tuning

Nginx was configured to operate in multi-process mode, with one worker for each physical thread on the system.

An excerpt from the configuration file, nginx.conf, is shown in Figure 3.

worker_processes 16; # Adjust this to match the core count

events {
        worker_connections 8192;
        multi_accept on;
}

Figure 3. Excerpt from nginx configuration

To support the large number of simultaneous connections that might occur at the smaller target file sizes, some system and kernel tuning was necessary. First, the number of file descriptors was increased via /etc/security/limits.conf:

*               soft    nofile          150000
*               hard    nofile          180000

Figure 4. Excerpt from /etc/security/limits.conf

And several kernel parameters were adjusted (some of these settings are more relevant to bulk encryption):

net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30

# Increase system IP port limits to allow for more connections

net.ipv4.ip_local_port_range = 2000 65535
net.ipv4.tcp_window_scaling = 1

# number of packets to keep in backlog before the kernel starts
# dropping them
net.ipv4.tcp_max_syn_backlog = 3240000

# increase socket listen backlog
net.ipv4.tcp_max_tw_buckets = 1440000

# Increase TCP buffer sizes
net.core.rmem_default = 8388608
net.core.wmem_default = 8388608
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_mem = 16777216 16777216 16777216
net.ipv4.tcp_rmem = 16777216 16777216 16777216
net.ipv4.tcp_wmem = 16777216 16777216 16777216

Figure 5. Excerpt from /etc/sysctl.conf

Some of these parameters are very aggressive, but the assumption is that this system is a dedicated SSL/TLS web server.

No other adjustments were made to the stock Ubuntu 13.10 server image.

Results

The maximum throughput in Gbps achieved for each cipher by file size is shown in Figure 6. At the smallest file size, 1 MB, the differences between the GCM and CBC ciphers are modest because the SSL handshake dominates the transaction but for the larger file sizes the GCM cipher outperforms the CBC cipher from 2 to 2.4x. Raw GCM performance is roughly 8 Gbps/core. This holds true up until 8 cores, when the maximum throughput is no longer CPU-limited. This is the point where other system limitations prevent the web server from achieving higher transfer rates, more dramatically revealed in the 16-core case. Here, the both ciphers see only a modest increase in throughput, though the CBC cipher realizes a larger benefit.

This is more clearly illustrated in Figure 7 which plots the maximum CPU utilization of nginx during the 2-minute run for each case. In the 2- and 4- cases, %CPU for both ciphers is in the high 90’s and in the 8-core case it ranges from 80% for large files to 98% for smaller ones.


Figure 6. Maximum nginx throughput by file size for given core counts

It is the 16-core case where system resource limits begin to show significantly, along with large differences in the performance of the ciphers themselves. Here, the total throughput has only increased incrementally from the 8-core case, and it’s apparent that this is because the additional cores simply cannot be put to use. The GCM cipher is using only 50 to 70% of the available CPU. It’s also clear that the GCM cipher is doing more—specifically, providing a great deal more throughput than the CBC cipher—with significantly less compute power.


Figure 7. Max %CPU utilization at maximum nginx throughput

Conclusions

The architectural changes to the Xeon v3 family of processors have a significant impact on the performance of the AES-GCM cipher, and they provide a very compelling argument for choosing it over AES-CBC with HMAC+SHA1 digests for SSL/TLS web servers.

In the raw OpenSSL speed tests, the performance gap between GCM and CBC nearly doubles from the Xeon E5 v2 family. In the web server tests, the use of the AES-GCM cipher led to roughly 2 to 2.4x the throughput of the AES-CBC cipher, and absolute data rates of about 8 Gbps/core. The many-core configurations are able to achieve total data transfer rates in excess of 50 Gbps before hitting system limits. This level of throughput was achieved on an off-the-shelf Linux installation with very minimal system tuning.

It may be necessary to continue to support AES-CBC with HMAC+SHA1 digests due to the large number of clients that cannot take advantage of AES-GCM, but AES-GCM should certainly be enabled on web servers running on Xeon v3 family processors in order to provide the best possible performance, not to mention the added security, that this cipher offers.

 

Optimizing Android* Game mTricks Looting Crown on the Intel® Atom™ Platform

$
0
0

Abstract

Games for smartphones and tablets are the most popular category on app stores. In the early days, mobile devices had significant CPU and GPU constraints that affected performance. So most games had to be simple. Now that CPU and GPU performance has increased, more high-end games are being produced. Nevertheless, a mobile processor still has less performance than a PC processor.

With the growth in the mobile market, many PC game developers are now making games for the mobile platform. However, traditional game design decisions and the graphic resources of a PC game are not a good fit for mobile processors and may not perform well. This article shows how to analyze and improve the performance of a mobile game and how to optimize graphic resources for a mobile platform, using mTricks Looting Crown as an example. The looting crown IA version is now released with the following link.

https://play.google.com/store/apps/details?id=com.barunsonena.looting

mTricks Looting Crown
Figure 1. mTricks Looting Crown

1. Introduction

mTricks has significant experience in PC game development using a variety of commercial game engines. While planning its next project, mTricks forecasted that the mobile market was ready for a complex MMORPG, given the performance growth of mobile CPUs and GPUs. So it changed the game target platform for its new project from the PC to mobile.

mTricks first ported the PC codebase to Android*. However, the performance was less than expected on the target mobile platforms, including an Intel® Atom™ processor-based platform (code named Bay Trail).

mTricks was encountering two problems that often face PC developers who transition to mobile:

  1. The low processing power of the mobile processor means that traditional PC graphic resources and designs are unsuitable.
  2. Due to capability and performance variations among mobile CPUs and GPUs, game display and performance vary on different target platforms.

2. Executive summary

Looting Crown is SNRPG (Social Network + RPG) style game, supporting full 3D graphics and various multi-play modes (PvP, PvE and Clan vs Clan). mTricks developed and optimized on a Bay Trail reference design, and the specification is listed in Table 1.

Table 1. Bay Trail reference design specification and 3DMark score

 Bay Trail reference design 10”
CPUIntel® Atom™ processor Quad Core 1.46 Ghz
RAM2GB
Resolution2560 x 1440
3DMark ICE Storm Unlimited Score15,094
Graphics score13,928
Physics score21,348

mTricks used Intel® Graphics Performance Analyzers (Intel® GPA) to find CPU and GPU bottlenecks during development and used the analysis to solve issues of graphic resources and performance.

The baseline performance was 23 fps, and Figure 2 shows GPU Busy and Target App CPU Load statistics during a 2 minute run. The average of GPU Busy is about 91%, and the Target App CPU Load is about 27%.

Intel® GPA System Analyzer
Figure 2. Comparing CPU and GPU load of the baseline version with Intel® GPA System Analyzer

3. Where is the bottleneck between CPU and GPU?

There are two ways to know where the bottleneck is between CPU and GPU. One is to use an override mode, and the other is to change CPU frequency.

Intel GPA System Analyzer provides the “Disable Draw Calls” override mode to help developers find where the bottleneck is between CPU and GPU. After running this override mode, compare each result with/without the override mode and check the following guidelines:

Table 2. How to analyze games with Disable Draw Calls override mode

Performance change for “Disable Draw Calls” override modeBottleneck
If FPS doesn’t change muchThe game is CPU bound; use the Intel® GPA Platform Analyzer or Intel® VTune™ Amplifier to determine which functions are taking the most time
If FPS improvesThe game is GPU bound; use the Intel GPA Frame Analyzer to determine which draw calls are taking the most time

Intel GPA System Analyzer can simulate the application performance with various CPU settings, which is useful for bottleneck analysis. To determine whether your application performance is CPU bound, do the following:

  1. Verify that your application is not Vertical Sync (Vsync) bound.
    Check the Vsync status. Vsync is enabled if you see the gray highlight  mTricks vsync in the Intel GPA System Analyzer Notification pane.
    • If Vsync is disabled, proceed to step 2.
    • If Vsync is enabled, review the frame rate in the top-right corner of the Intel GPA System Analyzer window. If the frame rate is around 60 FPS, your application is Vsync bound, and there is no opportunity to increase FPS. Otherwise, proceed to step 2.
  2. Force a different CPU frequency using the sliders in the Platform Settings pane (Figure 3) of the Intel GPA System Analyzer window. If the FPS value changes when you modify the CPU frequency, the application is likely to be CPU bound.

Platform Settings pane
Figure 3. Modify the CPU frequency in the Platform Settings pane

Table 3 shows the simulation results for Looting Crown. With “Disable Draw Calls” override on, the FPS remained unchanged. This would normally indicate the game was CPU bound. However, the “Highest CPU freq” override also didn’t change FPS, implying that Looting Crown was GPU bound. To resolve this, we returned to the data in Figure 2, which showed that the GPU load was about 91% and CPU load was about 27% on the Bay Trail device. The CPU could not be utilized well due to the GPU bottleneck. We proceeded with the plan to optimize the GPU usage first and then retest.

Table 3. The FPS result of the baseline version with Disable Draw Calls and Highest CPU Frequency.

Bay Trail deviceFPS
Original23
Disable Draw Calls23
Highest CPU freq.23

4. Identifying GPU bottlenecks

We found that the performance bottleneck was in the GPU. As a next step, we analyzed the cause of the GPU bottleneck with Intel GPA Frame analyzer. Figure 4 shows the captured frame information of the baseline version.

 Intel® GPA Frame Analyzer
Figure 4. Intel® GPA Frame Analyzer view of the baseline version

4.1 Decrease the number of draw calls by merging hundreds static mesh into one static mesh and using bigger texture.

4 and 5 show the information captured by Intel GPA Frame analyzer.

Table 4. The captured frame information of the baseline version

Total Ergs1,726
Total Primitive Count122,204
GPU Duration, ms23 ms
Time to show frame, ms48 ms

Table 5. Draw call cost of the baseline version

TypeErgTime(ms)%
Clear00.2 ms0.5 %
Ocean16 ms13.7 %
Terrain2~97720 ms41.9 %
Grass19~97718 ms39.0 %
Character, building and effect978~167619 ms40.6 %
UI1677~17251 ms3.4 %

Total time of “Terrain” is 20 ms while the time of “Grass” in the “Terrain” is 18 ms. It’s about 90% of “Terrain” processing time. So we analyzed further to see why it takes a lot of time for “Grass” processing.

Figures 5 and 6 show the output of the ergs for “Terrain” and “Grass”.

the terrain
Figure 5. The terrain

texture of grass
Figure 6. Texture of “Grass”

Looting Crown drew the terrain by drawing a small grass quad repeatedly. So the number of draw calls in “Terrain” was 960. The drawing time of one small grass is very small; however, the draw call itself has overhead, which makes it an expensive operation. So we recommended to decrease the number of draw calls by merging hundreds of static mesh into one static mesh and using bigger texture. Table 6 shows the changed result.

Table 6. Comparison of draw cost between small and big texture

Small texture, ms18 ms
Number of ergs960
Big texture, ms6 ms
Number of ergs1

the changed terrain
Figure 7. The changed terrain

Though we simplified, the tile-based terrain required a lot of draw calls, so we decreased the number of draw calls and saved 12 ms on drawing the “Grass”.

4.2 Optimizing graphics resources

Tables 7 and 8 show the new information captured by Intel GPA Frame analyzer after applying the big texture for grass.

Table 7. The captured frame information of the 1st optimization version

Total Ergs179
Total Primitive Count27,537
GPU Duration, ms24 ms
Time to show frame, ms27 ms

Table 8. Draw call cost of the 1st optimization version

TypeErgTime(ms)%
Clear02 ms10.4 %
Ocean186 ms23.6 %
Terrain1~17, 19, 23~9614 ms54.3 %
Grass196 ms23.2 %
Character, building and effect20~22, 97~1311 ms5.9 %
UI132~1781 ms5.7 %

We checked if the game is still GPU bound. We did the same measurement with “Disable Draw Calls” and “Highest CPU Frequency” simulation.

Table 9. The FPS result of 1st optimization version with “Disable Draw Calls” and “Highest CPU Frequency”

Bay Trail deviceFPS
Original40
Disable Draw Calls60
Highest CPU freq.40

In Table 9, “Disable Draw Calls” simulation increased the FPS number while “Highest CPU Frequency” simulation didn’t change the FPS number. So, we knew Looting Crown was still GPU bound. And we also checked CPU load and GPU Busy again.

 Intel® GPA System Analyzer
Figure 8. CPU and GPU load of the 1st optimization version with Intel® GPA System Analyzer

Figure 8 shows GPU load is about 99% and CPU load is about 13% on Bay Trail. CPU still could not be a source of speedup due to GPU bottleneck on Bay Trail.

Looting Crown was originally developed for PCs, so the existing graphic resources were not suitable for mobile devices, which have lower GPU and CPU processing power. We did several optimizations to the graphic resources as follows.

  1. Minimizing Draw Calls
    1. Reduced the number of materials: The number of object materials was reduced from 10 to 2.
    2. Reduced the number of particle layers.
  2. Minimizing the number of polygons
    1. Applied LOD (level of detail) for characters using the “Simplygon” tool.
      progressively reduced LOD

      Figure 9. A character with progressively reduced LOD

    2. Minimized number of polygons used for terrain: First, we minimized the number of polygons for faraway mountains that did not require much detail. Second, we minimized the number of polygons for flat terrain that could be represented by two triangles.
  3. Using optimized light maps
    1. Removed the dynamic lights for “Time of Day”.
    2. Minimized the light map size of each mesh: Reduced the number of light maps used for the background.
  4. Minimizing the changes of render states
    1. Reduced the number of materials, which also reduced render state changes and texture changes.
  5. Decoupling the animation part in static mesh
    1. Havok engine didn’t support a partial update of an animated part of an object. An object with only a small moving mesh was being updated even for the static mesh part of the object. So, we separated the animated part (smoke, red circle on Figure 10) from the rest of the object, dividing it into two separate object models.

decoupled animation
Figure 10. Decoupled animation of the smoke from the static mesh

4.3 Apply Z-culling efficiently

When an object is rendered by the 3D graphics card, the three-dimensional data is changed into two-dimensional data (x-y), and the Z-buffer or depth buffer is used to store the depth information (z coordinate) of each screen pixel. If two objects of the scene must be rendered in the same pixel, the GPU compares the two depths. The GPU overrides the current pixel if the new object is closer to the observer. So Z-buffer will reproduce the usual depth perception correctly. The process of Z-culling is drawing the closest objects first so that a closer object hides a farther one. Z-culling provides performance improvement on rendering of hidden surfaces.

In Looting Crown, there were two kinds of terrain drawing: Ocean drawing and Grass drawing. Because large portions of ocean were behind grass, lots of ocean areas were hidden. However, the ocean was rendered earlier than grass, which prevented efficient Z-culling. Figures 11 and 12 show the GPU duration time of drawing ocean and grass, respectively; erg 18 is for ocean and erg 19 is for grass. If grass is rendered before ocean, then the depth test would indicate that the ocean pixels would not need to be drawn. It would result in decreased GPU duration of drawing ocean. Figure 13 shows the ocean drawing cost on the second optimization. The GPU duration decreased from 6 ms to 0.3 ms.

ocean drawing cost first optimization
Figure 11. Ocean drawing cost of 1st optimization

grass drawing cost of first optimization
Figure 12. Grass drawing cost of 1st optimization

Ocean draw cost of second optimization
Figure 13. Ocean draw cost of 2nd optimization

Results

By taking these steps, mTricks changed all graphics resources to be optimized for mobile device without compromising graphics quality. Erg numbers were decreased from 1,726 to 124; Primitive count was decreased from 122,204 to 9,525.

mTricks Looting Crown
Figure 14. The change of graphics resource

Figure 15 and Table 10 show the outcome of all these optimizations. After optimizations, FPS changed from 23 FPS to 60 FPS on the Bay Trail device.

FPS Increase
Figure 15. FPS Increase

Table 10. Changed FPS, GPU Busy, and App CPU Load

 Baseline1st Optimization2nd Optimization
FPS23 FPS45 FPS60 FPS
GPU Busy(%)91%99%71%
App CPU Load(%)27%13%22%

After the first optimization, Bay Trail still was GPU bound. We did the second optimization to reduce the GPU workload by optimizing the graphic resources and z-buffer usage. Finally the Bay Trail device hit the maximum (60) FPS. Because Android uses Vsync, 60 FPS is the maximum performance on the Android platform.

Conclusion

When you start to optimize a game, first determine where the application bottleneck is. Intel GPA can help you do this with some powerful analytic tools.If your game is CPU bound, then Intel VTune Amplifier is a helpful tool. If your game is GPU bound, then you can find more detail using Intel GPA.To fix GPU bottlenecks, you can try to find an efficient way of reducing draw calls, polygon count, and render state changes. You can also check the right size of terrain texture, animation objects, light maps, and the right order of z-buffer culling.

About the Authors

Tai Ha is an application engineer focusing on enabling online games in APAC region. He has been working for Intel since 2005 covering Intel® Architecture optimization on Healthcare, Server, Client, and Mobile platforms. Before joining Intel, Tai worked for biometric companies based in Santa Clara, USA as a security middleware architect since 1999. He received his BS in Computer Science from Hanyang University, Korea.

Jackie Lee is an Applications Engineer with Intel's Software Solutions Group, focused on performance tuning of applications on Intel® Atom™ platforms. Prior to Intel, Jackie Lee worked at LG in the electronics CTO department. He received his MS and BS in Computer Science and Engineering from The ChungAng University.

References

The looting crown IA version is now released on Google Play:

https://play.google.com/store/apps/details?id=com.barunsonena.looting

Intel® Graphics Performance Analyzers
https://software.intel.com/en-us/vcsource/tools/intel-gpa

Havok
http://www.havok.com

mTricks
https://www.facebook.com/mtricksgame

Intel, the Intel logo, and Atom are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2014 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.

Intel® System Studio - Solutions, Tips and Tricks

$
0
0

“Flashcard” Speech Recognition App Using Intel® RealSense™ SDK

$
0
0

Download PDF[PDF 603KB]

Download Zip File[ZIP 798KB]

Abstract

The Flashcard code sample demonstrates some of the speech recognition features in the Intel® RealSense™ SDK for Windows*. The SDK includes speech modules for integrating dictation and verbal command control in your applications. These two modes of operation provide the following:

  • Dictation - The SDK module returns the user’s dictated sentence.
  • Command and Control - The application defines a list of words as the command list and the SDK module recognizes speech based solely on the command list.

The flashcard app uses the Command and Control mode to accept verbal input from the user. It does not demonstrate any Dictation features. The app displays simple multiplication problems and matches the user’s spoken response to the correct answer.

Introduction

This code sample demonstrates the basics of using the Command and Control speech recognition capabilities of the SDK. The app displays randomly generated multiplication problems and waits for verbal input from the user.

Figure 1: The Flashcard sample recognizes spoken numbers as input

If the user says the correct answer as shown in Figure 1, the app responds by displaying the user’s answer in green and indicating “Correct!” on the screen. After a short delay, the app displays another randomly created multiplication problem and awaits a response from the user.

Figure 2: Incorrect answers are displayed in red

If the user says the incorrect answer as shown in Figure 2, the app responds by displaying the user’s answer in red and shows the correct answer on the screen.

Purpose

The purpose of this code sample is to distill the complexities of the SDK down to the basics of using the speech recognition module and present this information in a simple use case scenario.

Development Environment

The sample app can be built using Microsoft Visual Studio* Express 2013 for Windows Desktop or the professional versions of Visual Studio 2013.

Configuring the Speech Recognition Module

A method named ConfigureRealSense() is called on startup to prepare the app for accepting speech commands from the user. This method performs the following actions:

  • Instantiates session and audio source objects
  • Selects the audio device
  • Sets the audio recording volume
  • Creates a speech recognition instance
  • Initializes the speech recognition module
  • Builds and sets the active grammar
  • Displays device information

The sample app selects the first audio device (index 0) from the audio source device list; however, the SDK provides a mechanism to scan and enumerate audio devices on the computer to allow a user to select the desired input device. This technique is shown in the SDK documentation.

In the sample app the recording volume is set to a fixed value, but in a full-featured app it is recommended to provide a control for setting this parameter and give visual feedback indicating if the user's volume is adjusted adequately.

Handling Speech Recognition Events

An OnRecognition() event handler is implemented to capture data from the speech recognition module when active recognition results are available. The RecognitionData structure passed to the handler describes details of the recognition event (e.g., confidence, sentence, etc.)

The sample app uses a fixed threshold for evaluating the confidence level returned by the speech recognition module; however, the SDK documentation suggests that you “use thresholding to increase or decrease certain aspect of voice recognition. For example, your application may expose a graphical user interface control to let the user adjust what is the acceptable recognition rate. The application can use 50% as the baseline.”

Setting the Active Grammar

When using the Command and Control mode, the speech recognition module uses a list of commands (referred to as the “grammar”) and ignores any words or phrases not contained in the list. The commands can be loaded using either the BuildGrammarFromStringList() method to define the list programmatically or the BuildGrammarFromFile() method to read the grammar from a Java* Speech Grammar Format (JSGF) file. We use the latter method so that we can take advantage of a shorthand for our grammar, and not have to enter all possible answer numbers as distinct strings.

The Flashcard app uses the SDK’s BuildGrammarFromFile() method to open the grammar.jsgf file and build its grammar using the file’s contents. (For more information on the JSGF file format refer to http://www.w3.org/TR/jsgf/). The contents of grammar.jsgf are shown in the following table.

#JSGF V1.0;
grammar Digits;
public <Digits> = ( <digit> ) + ;<digit> = ( zero | one | two | three | four | five | six | seven | eight | nine | ten | eleven | twelve | thirteen | fourteen | fifteen | sixteen | seventeen | eighteen | nineteen | twenty | thirty | forty | fifty | sixty | seventy | eighty | ninety );

The notation used in this code sample is similar to examples shown in the SDK’s documentation (RSSDK_DIR\doc\PDF\sdkmanuals.pdf), and you are encouraged to review this for a more thorough explanation of the different formatting options that are available. The <digit> rule identifies the grammar that will be used by the speech recognition module. The “+” sign signifies that whatever comes before it should occur one or more times. This format permits not only single words like “four” to be recognized, but also accepts phrases like “forty four”.

Check It Out

Download the app and learn more about how speech recognition works in the Intel RealSense SDK for Windows.

About Intel® RealSense™ Technology

To get started and learn more about the Intel RealSense SDK for Windows, go to https://software.intel.com/en-us/intel-realsense-sdk

About the Author

Bryan Brown is a software applications engineer in the Developer Relations Division at Intel.

Viewing all 3384 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>