Optmization Techniques for the Intel® MIC Architecture: Part 2 of 3

July 6, 2015, 4:04 pm

Latest and popular articles on Intel Technologies

≫ Next: Elevating Head of the Order* Gameplay with Gesture Control

≪ Previous: Intel® Trace Analyzer and Collector 9.1 Initial Release Readme

Abstract

This is part 2 of a 3-part educational series of publications introducing select topics on optimization of applications for Intel’s multi-core and manycore architectures (Intel® Xeon® processors and Intel® XeonPhi™ coprocessors).

In this paper we discuss data parallelism. Our focus is automatic vectorization and exposing vectorization opportunities to the compiler. For a practical illustration, we construct and optimize a micro-kernel for particle binning particles.

Similar workloads occur applications in Monte Carlo simulations, particle physics software, and statistical analysis.

The optimization technique discussed in this paper leads to code vectorization, which results in an order of magnitude performance improvement on an Intel Xeon processor. Performance on Xeon Phi coprocessor compared to that on a high-end Intel Xeon is 1.4x greater in single precision and 1.6x greater in double precision.

Download the full articleTélécharger Download

↧

Elevating Head of the Order* Gameplay with Gesture Control

July 6, 2015, 4:33 pm

Latest and popular articles on Intel Technologies

≫ Next: How to develop a cross platform native application with Intel® INDE for Windows* and Android*?

≪ Previous: Optmization Techniques for the Intel® MIC Architecture: Part 2 of 3

By Edward J. Correia

Intel is aiming to revolutionize the way users interface with traditional PCs, and Jacob Pennock is among the movement's primary champions. Back in 2013, Pennock won the Intel Perceptual Computing Challenge with Head of the Order*, a game that cast a spell on perceptual computing. Originally built with the Intel® Perceptual Computing SDK and Creative Senz3D* camera, the game has since evolved with the implementation of the new Intel® RealSense™ SDK and the Intel® RealSense™ 3D (front facing) camera. Pennock's experiences—and those of his coworkers—are plotting a course through the new APIs and creating a navigational aid that other developers can use to steer their own perceptual apps.

Armed with the new Intel RealSense SDK and a new company, Livid Interactive, Pennock and his team set out to transform the user experience and enhance Head of the Order (Figure 1) by implementing improved gesture controls and 3D hand-tracking points in the Intel RealSense SDK that were not possible with the previous Intel Perceptual Computing SDK.

Figure 1: Livid Interactive’s Head of the Order* trailer.

From Perceptual Computing to the Intel® RealSense™ SDK

Gesture Control Improvements

The Head of the Order team was particularly interested in the new hand- and finger-tracking capabilities of the Intel RealSense SDK. These capabilities provide 3D positioning of each joint of the user's hand, with 22 points of hand and joint tracking (Figure 2) for greater precision. Control of the hands is everything to this game; hands are used to craft and cast off spells and to combine multiple spells to form more powerful ones.

Figure 2: The hand can be tracked in 3-D using 22 landmark data points. [Image source]

3D Hand Tracking

With the original SDK, a user’s hands could only be represented as flat, 2D images superimposed on the screen (Figure 3—left). To achieve this, Pennock had to create a hand-rendering system that resampled the low-resolution 2D hand images, and then add them to the game’s rendering stack at multiple depths through custom processing with his own code.

According to Pennock, the implementation of the fine-grained hand tracking in the new Intel RealSense SDK allowed the gameplay experience to become far more life-like and engaging, with much better 3D positioning. Hands are now seen as 3D models (Figures 3 and 4—right) that interact within the game space.

Figure 3: Original 2D spell crafting (left) and the improved 3D hand rendering (right).

This capability enhances the immersive nature of the game and allows Head of the Order to run on virtual-reality headsets. The Intel RealSense SDK also greatly improves the position tracking of all the finger joints, providing much more depth and accuracy when it comes to casting spells and navigating the virtual game space.

Figure 4: Original 2D hand spell casting off (left) and the improved 3D hand rendering (right).

Switching from the Intel Perceptual Computing SDK to the Intel RealSense SDK wasn't an entirely smooth ride; it took time for the functionality to ramp up in the new SDK. But by the time the Intel RealSense SDK Gold R2 version release was ready, Pennock and his team had replicated and extended what they had achieved with the predecessor SDK.

Challenges

Head of the Order is controlled entirely with hand gestures, and spells are created by drawing simple shapes in the air. Over time, players learn how to master the art of combining gestures to craft the most powerful spells. The learning curve for gesture-based input in general can be steep, particularly for players accustomed to traditional mouse and keyboard interfaces or game hand controllers. Some of the biggest challenges Pennock and his team faced were communicating to the players what they wanted them to do.

Because many user movements can be picked up by the Intel RealSense 3D camera's gesture-recognition capabilities, Pennock and his team noticed that if a player makes random or unrecognizable movements—or is too far from or too close to the camera—the camera can’t process what the player is attempting to do. For those players who are accustomed to traditional interfaces, this situation can cause them to become frustrated. "Even if it's working perfectly well,” said Pennock, “they may be interacting in an unexpected way so it appears that the system isn’t functioning. This can be difficult to amend on the development end.”

To resolve this issue, Pennock and his team created a 5-minute guided tutorial with narration and video examples to demonstrate proper input techniques (Figure 5) and step the players through gameplay scenarios. This idea came about during the first contest, as early testers had trouble realizing that three steps were required to create a spell.

Figure 5:The tutorial demonstrates proper input technique.

Because visual cues also play a key role during gameplay, Pennock addressed this issue by factoring gesture speed into the visualization. Now, a trail is drawn on the screen only when the hand movement speed is within acceptable limits. "Other than that, when we're actually tracking for a particular gesture, your hands glow," he said.

For users who want to play an “easy” game, Head of the Order offers characters that perform very simple gestures.

Advice for Developers

Testing

With gesture-based apps, perhaps more than for those using traditional input types, Pennock stresses the importance of letting new users try the app and observing how they interact with it. He also emphasizes the use of outsiders to test the app as opposed to those involved with its development because it's easy to make something work correctly if you already know how to use the app. The advantage to having new users test the app is being able to see what doesn’t feel right to them and then considering ways to address the problem.

Not surprisingly, tests with younger audiences are generally more successful. "For the kids who grew up with motion controls such as Microsoft Kinect*, it's intuitive to them and they don't usually have a big issue with gesture control," said Pennock, adding that testing the system on the more mature crowds at trade shows is when most problems arise.

Performance Over Implementation

Pennock and his team acknowledge that performance can be an issue: the enormous data streams coming from the cameras create latency. This is particularly true if using more than one tracking module at a time or opting for a large number of tracking points.

It initially took some time for functionality to ramp up in the new RealSense SDK, but Pennock said that, “The Gold R2 release was a great update.” In this latest RealSense SDK, the level of noise in tracking finger joints is improved, and smoothing functions are better.

Looking Forward

The target systems on which Head of the Order can actually be played are continuing to emerge. The software is designed only for systems equipped with a natural user interface such as that provided by the Intel RealSense 3D camera. Intel offers versions tailored to the desired application. Today, this includes tablets, conventional and two-in-one laptops, and all-in-one PCs equipped with Intel RealSense technology. What's more, an increasing number of technology companies such as Acer, Asus, Dell, Fujitsu, Hewlett-Packard, Lenovo, and NEC currently offer or have announced systems that feature Intel RealSense technology.

The Intel RealSense SDK and the technologies it's tied to provide facial detection and tracking, emotion detection, depth-sensing photography, 3D scanning, background removal, and the tracking of 22 joints in each hand for accurate touch-free gesture recognition. In the future, Pennock believes that Intel RealSense technology could find applications in automotive control, robotics, home automation systems, and on the industrial side, "I see a wealth of opportunity for measurement devices."

About the Developer

The original version of Head of the Order was built and submitted to the Intel Perceptual Computing contest under Pennock's development company—Unicorn Forest Games. That company has since combined forces with Helios Interactive, where Pennock was employed as a developer. The result was a new entity: San Francisco-based Livid Interactive.

According to Michael Schaiman, managing partner at Helios, his company was asked to develop concepts for the Intel® Experience, a set of "hands-on experience zones" being set up at 50 Best Buy flagship stores across the United States. The zones are designed to showcase cutting-edge Intel® technologies for people of all ages and technical abilities. Schaiman was tasked with demonstrating Intel RealSense technology. "One concept we came up with was to build a special version of Head of the Order that consumers could play with," said Schaiman. "They really loved that idea." The Head of the Order demo is set to hit the stores sometime in June 2015, with the game to follow later in the summer.

Intel Resources

Throughout the development process, the Head of the Order team held monthly "innovator calls" with engineers at Intel. These calls allowed Livid developers to stay abreast of Intel RealSense SDK features, sample code, and documentation that were about to be released and to have a structure for providing feedback on what had come before.

For more information, check out the Intel RealSense SDKand Supportpages.

Get more information on Livid Interactive here.

↧

How to develop a cross platform native application with Intel® INDE for Windows* and Android*?

July 8, 2015, 12:37 am

Latest and popular articles on Intel Technologies

≫ Next: What is Code Modernization?

≪ Previous: Elevating Head of the Order* Gameplay with Gesture Control

Hello, I am going to show you how you can develop a native app for Windows* and Android* that can reuse the shared app logic in C++. This is going to be a simple hello world app for you to understand the process. I am using Android Studio installed by Intel® INDE 2015 Update 2 for building the Android UI. (You can update it to 1.2.2 without losing Intel INDE plugins) And Visual Studio 2013 for building the Windows UI and both are going to call the native C++ API that is shared between these two apps. This article assumes that you are aware of the basic process of building and emulating apps on Android Studio and Visual Studio.

Step 1:

The structure of your workspace directory should look like this.

Workspace/CoreCommon (Shared C/C++ files)

Hello.c (has print_hello() that is called by both the apps)
Hello.h

Workspace/CrossPlatformAndroid

Workspace/CrossPlatformWindows

Step 2:

Letz start with building the Android project in Android Studio.

New Project -> Choose the directory that you just created for Android Project

And choose Intel® INDE plugin “Blank Activity with NDK”

How does Intel® INDE help with building this app?

First of all, the process of NDK build is made very simple in Android Studio with the IDE Integration feature of Intel® INDE. The template has the boiler plate code for your simple hello-jni app that can be readily compiled. The Gradle build script that comes with the feature invokes the NDK-BUILD as part of the app compilation process and app compilation is made dependent on the success of the ndk-build. i.e, only if the C++ library is built successfully, the compilation of the rest of app is successful. You don’t have to invoke the ndk-build separately.

task ndkBuild(type: Exec) {


    if (Os.isFamily(Os.FAMILY_WINDOWS)) {
        def ndkDir = System.getenv("ANDROID_NDK_ROOT")


        commandLine 'cmd', '/C', "$ndkDir/ndk-build",'NDK_PROJECT_PATH=build','APP_BUILD_SCRIPT=src/main/jni/Android.mk','NDK_APPLICATION_MK=src/main/jni/Application.mk','NDK_APP_LIBS_OUT = src/main/jnilibs'

    } else {

        def ndkDir = System.getenv("NDKROOT")
        commandLine "$ndkDir/ndk-build",'NDK_PROJECT_PATH=build','APP_BUILD_SCRIPT=src/main/jni/Android.mk','NDK_APPLICATION_MK=src/main/jni/Application.mk','NDK_APP_LIBS_OUT = src/main/jnilibs'


    }
}
tasks.withType(JavaCompile) {
    compileTask -> compileTask.dependsOn ndkBuild
}

Step 3:

Now, change the jni/Android.mk file to include the files under CoreCommon to build the MyLib library. The above gradle script then builds the MyLib.so file which can be located at app/src/jnilibs

Step 4:

Now go to NativeCode.c file and make the highlighted modification. Basically, print_hello() function is called from the shared library.

If you look at src/main/java/MainActivity.java, the above jni function “getStringFromNative” is called which in turn calls from the shared library to print the hello string.

Step 5:

Build -> Rebuild Project and you will see in Gradle Build terminal (abbreviated):

:app:generateDebugSources

:app:ndkBuild

[x86] Compile : MyLib <= NativeCode.c

[x86] Compile : MyLib <= hello.c

[x86] SharedLibrary : libMyLib.so

:app:compileDebugAndroidTestNdk UP-TO-DATE

:app:compileDebugAndroidTestSources

Information:BUILD SUCCESSFUL

Step 6: Run the app in the Android Emulator.

On similar lines, letz start the Windows* App from Visual Studio 2013.

First, build the shared C++ library.

Step 1: Create an empty C++ project

Open Solution Explorer and add NativeCode.h to the Header Files and NativeCode.cpp to SourceFiles. Contents below:

NativeCode.h:

#include <Windows.h>

#include "hello.h"

using namespace std;
#if defined(_BUILD_DLL)

#   define DLLAPI           __declspec(dllexport) //Export when building DLL

#else

#   define DLLAPI           __declspec(dllimport) //Import when using in other project

#endif

#ifdef __cplusplus

extern "C" {

#endif

DLLAPI char* DisplayHelloFromDLL();

#ifdef __cplusplus

}

#endif


NativeCode.cpp:

#include "NativeCode.h"

char* DisplayHelloFromDLL()

{

       char* str = print_hello();

       return str;

}

Step 2: Add hello.c from CoreCommon to the Source Files of this project and hello.h to Header Files of this project.

Right click Header Files -> Add -> Existing Item -> ../CoreCommon/hello.h

Right click Source Files -> Add -> Existing Item -> ../CoreCommon/hello.c

Remember, you may have to include the CoreCommon directory in your Project Properties:

Step 3: Right click project -> Build.

If your build is successful, you will see the dll under ..\Workspaces\CrossPlatformWindows\Debug

Your shared library is successfully built for your Windows app.

Step 4: Now, letz build the UI for the Windows App. It can be a simple text box to display the string from the hello.c file.

Right click Solution Explorer -> Add -> New Project -> Visual C#

Right Click the project and “Set as StartUp Project”

Step 5:

Open the main .cs file and make reference to the dll that was created in Step 3. Remember to include System.Runtime.InteropServices

Add a simple UI and an event handler that calls the shared CPP code to print the "Hello World" like below:

Step 6:

BUILD -> Rebuild Solution.

If your build is successful, you should see:

1>------ Rebuild All started: Project: NativeCPPLib, Configuration: Debug Win32 ------

2>------ Rebuild All started: Project: HelloCSharp, Configuration: Debug Any CPU ------

2>C:\Program Files (x86)\MSBuild\12.0\bin\Microsoft.CSharp.CurrentVersion.targets(448,9): warning MSB3052: The parameter to the compiler is invalid, '/define:/clr' will be ignored.

2> HelloCSharp -> C:\INDE\Workspaces\CrossPlatformWindows\HelloCSharp\bin\Debug\HelloCSharp.exe

1> hello.c

1> NativeCode.cpp

1> Creating library C:\INDE\Workspaces\CrossPlatformWindows\Debug\NativeCPPLib.lib and object C:\INDE\Workspaces\CrossPlatformWindows\Debug\NativeCPPLib.exp

1> NativeCPP.vcxproj -> C:\INDE\Workspaces\CrossPlatformWindows\Debug\NativeCPPLib.dll

========== Rebuild All: 2 succeeded, 0 failed, 0 skipped ==========

Step 7:

Run the solution (press F5) and voila, you will see your Windows app in action!

You have successfully built your cross platform app for Windows and Android. And this is how you reuse the shared Native App Logic across different operating systems. Native App Logic is the place where many of Intel® INDE’s SDKs like Media SDK and libraries like Integrated Performance Primitives and Threading Building Blocks fit in to make your apps run in the most optimized way for the underlying architecture.

The code for this application is uploaded from the root directory level for your reference. Please post your feedback or questions below.

↧

What is Code Modernization?

July 8, 2015, 8:26 am

Latest and popular articles on Intel Technologies

≫ Next: Analyze and Optimize Windows* Game Applications Using Intel® INDE Graphics Performance Analyzers (GPA)

≪ Previous: How to develop a cross platform native application with Intel® INDE for Windows* and Android*?

Modern high performance computers are built with a combination of resources including: multi-core processors, many core processors, large caches, high speed memory, high bandwidth inter-processor communications fabric, and high speed I/O capabilities. High performance software needs to be designed to take full advantage of these wealth of resources. Whether re-architecting and/or tuning existing applications for maximum performance or architecting new applications for existing or future machines, it is critical to be aware of the interplay between programming models and the efficient use of these resources. Consider this a starting point for information regarding Code Modernization. When it comes to performance, your code matters!

Building parallel versions of software can enable applications to run a given data set in less time, run multiple data sets in a fixed amount of time, or run large-scale data sets that are prohibitive with un-optimized software. The success of parallelization is typically quantified by measuring the speedup of the parallel version relative to the serial version. In addition to that comparison, however, it is also useful to compare that speedup relative to the upper limit of the potential speedup. That issue can be addressed using Amdahl's Law and Gustafson's Law.

Good code design takes into consideration several levels of parallelism.

The first level of parallelism is Vector parallelism (within a core) where identical computational instructions are performed on large chunks of data. Both scalar and parallel portions of code will benefit from the efficient use of vector computing.
A second level of parallelism called Thread parallelism, is characterized by a number of cooperating threads of a single process, communicating via shared memory and collectively cooperating on a given task.
The third level is when many codes have been developed in the style of independent cooperating processes, communicating with each other via some message passage system. This is called distributed memory Rank parallelism, so named as each process is given a unique rank number.

Developing code which uses all three levels of parallelism effectively, efficiently, and with high performance is optimal for modernizing code.

Factoring into these considerations is the impact of the memory model of the machine: amount and speed of main memory, memory access times with respect to location of memory, cache sizes and numbers, and requirements for memory coherence.

Poor data alignment for vector parallelism will generate a huge performance impact. Data should be organized in a cache friendly way. If it is not, performance will suffer, when the application requests data that’s not in the cache. The fastest memory access occurs when the needed data is already in cache. Data transfers to and from cache are in cache-lines, and as such if the next piece of data is not within the current cache-line or is scattered amongst multiple cache-lines, the application may have poor cache efficiency.

Divisional and transcendental math functions are expensive even when directly supported by the instruction set. If your application uses many division and square root operations within the run-time code, the resulting performance may be degraded because of the limited functional units within the hardware; the pipeline to these units may be dominated. Since these instructions are expensive, the developer may wish to cache frequently used values to improve performance.

There is no “one recipe, one solution” technique. A great deal depends on the problem being solved and the long term requirements for the code, but a good developer will pay attention to all levels of optimization, both for today’s requirements and for the future.

Intel has built a full suite of tools to aid in code modernization - compilers, libraries, debuggers, performance analyzers, parallel optimization tools and more. Intel even has webinars, documentation, training examples, and best known methods and case studies which are all based on over thirty years of experience as a leader in the development of parallel computers.

Code Modernization 5 Stage Framework for Multi-level Parallelism

The Code Modernization optimization framework takes a systematic approach to application performance improvement. This framework takes an application though five optimization stages, each stage iteratively improving the application performance. But before you start the optimization process, you should consider if the application needs to be re-architected (given the guidelines below) to achieve the highest performance, and then follow the Code Modernization optimization framework.

By following this framework, an application can achieve the highest performance possible on Intel® Architecture. The stepwise approach helps the developer achieve the best application performance in the shortest possible time. In another words, it allows the program to maximize its use of all parallel hardware resources in the execution environment. The stages:

Leverage optimization tools and libraries: Profile the workload using Intel® VTune™ Amplifier to identify hotspots. Use Intel C++ compiler to generate optimal code and apply optimized libraries such as Intel® Math Kernel Library, Intel® TBB, and OpenMP* when appropriate.
Scalar, serial optimization: Maintain the proper precision, type constants, and use appropriate functions and precision flags.
Vectorization: Utilize SIMD features in conjunction with data layout optimizations Apply cache-aligned data structures, convert from arrays of structures to structure of arrays, and minimize conditional logic.
Thread Parallelization: Profile thread scaling and affinitize threads to cores. Scaling issues typically are a result of thread synchronization or inefficient memory utilization.
Scale your application from multicore to many core (distributed memory Rank parallelism): Scaling is especially important for highly parallel applications. Minimize the changes and maximize the performance as the execution target changes from one flavor of the Intel architecture (Intel® Xeon® processor) to another (Intel® Xeon Phi™ Coprocessor).

5 Stages of code modernization

Code Modernization – The 5 Stages in Practice

Stage 1
At the beginning of your optimization project, select an optimizing development environment. The decision you make at this step will have a profound influence in the later steps. Not only will it affect the results you get, it could substantially reduce the amount of work to do. The right optimizing development environment can provide you with good compiler tools, optimized, ready-to-use libraries, and debugging and profiling tools to pinpoint exactly what the code is doing at the runtime.

Stage 2
Once you have exhausted the available optimization solutions, in order to extract greater performance from your application you will need to begin the optimization process on the application source code. Before you begin active parallel programming, you need to make sure your application delivers the right results before you vectorize and parallelize it. Equally important, you need to make sure it does the minimum number of operations to get that correct result. You should look at the data and algorithm related issues such as:

Choosing the right floating point precision
Choosing the right approximation method accuracy; polynomial vs. rational
Avoiding jump algorithms
Reducing the loop operation strength by using iteration calculations
Avoiding or minimizing conditional branches in your algorithms
Avoiding repetitive calculations, using previously calculated results.

You may also have to deal with language-related performance issues. If you have chosen C/C++, the language related issues are:

Use explicit typing for all constants to avoid auto-promotion
Choose the right types of C runtime function, e.g. doubles vs. floats: exp() vs. expf(); abs() vs. fabs()
Explicitly tell compiler about point aliases
Explicitly Inline function calls to avoid overhead

Stage 3
Try vector level parallelism. First try to vectorize the inner most loop. For efficient vector loops, make sure that there is minimal control flow divergence and that memory accesses are coherent. Outer loop vectorization is a technique to enhance performance. By default, compilers attempt to vectorize innermost loops in nested loop structures. But, in some cases, the number of iterations in the innermost loop is small. In this case, inner-loop vectorization is not profitable. However, if an outer loop contains more work, a combination of elemental functions, strip-mining, and pragma/directive SIMD can force vectorization at this outer, profitable level.

SIMD performs best on “packed” and aligned input data, and by its nature penalizes control divergences. In addition, good SIMD and thread performance on modern hardware can be obtained if the application implementation puts a focus on data proximity.
If the innermost loop does not have enough work (e.g., the trip count is very low; the performance benefit of vectorization can be measured) or there are data dependencies that prevent vectorising the innermost loop, try vectorising the outer loop. The outer loop is likely to have control flow divergence; especially of the trip count of the inner loop is different for each iteration of the outer loop. This will limit the gains from vectorization. The memory access of the outer loop is more likely to be divergent than that of an inner loop. This will result in gather / scatter instructions instead of vector loads and stores and will significantly limit scaling due to vectorization. Data transformations, such as transposing a two dimensional array, may alleviate these problems, or look at switching from Arrays of Structures to Structures of Arrays.
When the loop hierarchy is shallow, the above guideline may result in a loop that needs to be both parallelized and vectorized. In that case, that loop has to both provide enough parallel work to compensate for the overhead and also maintain control flow uniformity and memory access coherence.
Check out the Vectorization Essentials for more details.

Stage 4
Now we get to thread level parallelization. Identify the outermost level and try to parallelize it. Obviously, this requires taking care of potential data races and moving data declaration to inside the loop as necessary. It may also require that the data be maintained in a cache efficient manner, to reduce the overhead of maintaining the data across multiple parallel paths. The rationale for the outermost level is to try to provide as much work as possible to each individual thread. Amdahl’s law states: The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program. Since the amount of work needs to compensate for the overhead of parallelization, it helps to have as large a parallel effort in each thread as possible. If the outermost level cannot be parallelized due to unavoidable data dependencies, try to parallelize at the next-outermost level that can be parallelized correctly.

If the amount of parallel work achieved at the outermost level appears sufficient for the target hardware and likely to scale with a reasonable increase of parallel resources, you are done. Do not add more parallelism, as the overhead will be noticeable (thread control overhead will negate any performance improvement) and the gains are unlikely.
If the amount of parallel work is insufficient, e.g. as measured by core scaling that only scales up to a small core count and not to the actual core count, attempt to parallelize additional layer, as outmost as possible. Note that you don’t necessarily need to scale the loop hierarchy to all the available cores, as there may be additional loop hierarchies executing in parallel.
If step 2 did not result in scalable code, there may not be enough parallel work in your algorithm. This may mean that partitioning a fixed amount of work among many threads gives each thread too little work, so the overhead of starting and terminating threads swamps the useful work. Perhaps the algorithms can be scaled to do more work, for example by trying on a bigger problem size.
Make sure your parallel algorithm is cache efficient. If it is not, rework it to be cache efficient, as cache inefficient algorithms do not scale with parallelism.
Check out the Intel Guide for Developing Multithreaded Applications series for more details.

Stage 5
Lastly we get to multi-node (Rank) parallelism. To many developers message passing interface (MPI) is a black box the “just works” behind the scenes, to transfer data from one MPI task (process) to another. The beauty of MPI for the developer is that the algorithmic coding is hardware independent. The concern that developers have, is that with the many core architecture with 60+ cores, the communication between tasks may create a communication storm either within a node or across nodes. To mitigate these communication bottlenecks, applications should employ hybrid techniques, employing a few MPI tasks and many OpenMP threads.

Check out the Parallelization using Intel® MPI for more details.

A well-optimized application should address vector parallelization, multi-threading parallelization, and multi-node (Rank) parallelization. However to do this efficiently it is helpful to use a standard step-by-step methodology to ensure each stage level is considered. The stages described here can be (and often are) reordered depending upon the specific needs of each individual application; you can iterate in a stage more than once to achieve the desired performance.

Experience has shown that all stages must at least be considered to ensure an application delivers great performance on today’s scalable hardware as well as being well positioned to scale effectively on upcoming generations of hardware.

Give it a try!

↧

Analyze and Optimize Windows* Game Applications Using Intel® INDE Graphics Performance Analyzers (GPA)

July 2, 2015, 4:41 pm

Latest and popular articles on Intel Technologies

≫ Next: Performance of Multibuffer AES-CBC on Intel® Xeon® Processors E5 v3

≪ Previous: What is Code Modernization?

Download Intel INDE Graphics Performance Analyzer.pdf

Intel® INDE Graphics Performance Analyzers (GPA) are powerful, agile tools enabling game developers to utilize the full performance potential of their gaming platform. GPA visualizes performance data from your application, enabling you to understand system-level and individual frame performance issues, as well as allowing you to perform “what-if” experiments to estimate potential performance gains from optimizations. GPA tools are available as part of the Intel® INDE tool suite or as standalone from here.

This article describes the GPA tools and walks through a sample game application for Windows*, showing individual frame performance issues and optimizing with the Graphics Frame Analyzer for DirectX*.

Graphics Monitor

Graphics Monitor is used to view, graph, and configure metrics in-game. You can also take trace and frame captures as well as enable graphics pipeline overrides and experiments in real-time.

The sample game application used in this article (CitiRacer.exe) comes as part of the installation and is used as the example throughout this article.

Once you download and install GPA (see link above), click Analyze Application as shown below, and the Analyze Application window opens.

1. Graphics monitor

2 Analyze Application window to launch the game

Click the Run button, and you can start analyzing the application. The application automatically loads and displays the FPS (frames per second) as shown below. Press CTRL + F1 three times to see the screenshot shown below with different settings and metrics displayed.

3. Game running with all the metrics shown

Now we will take a trace capture of one particular frame and analyze it using the Graphics Frame Analyzer for DirectX tool that is installed with the Intel GPA toolkit. You can take the trace capture by pressing CTRL + SHIFT + C or using the System Analyzer tool that is described below.

System Analyzer

System Analyzer provides access to system-wide metrics for your game, including CPU, GPU, API, and the graphics driver. The metrics available vary depending on your platform, but you will find a large collection of useful metrics to help quantify key aspects of your application's use of system resources. In the System Analyzer you can also perform various "what-if" experiments to diagnose at a high level where your game's performance bottlenecks are concentrated.

If the System Analyzer finds that your game is CPU-bound, perform additional fine-tuning of your application using Platform Analyzer.

If the System Analyzer finds that you game is GPU-bound, use the Graphics Frame Analyzer for DirectX*/OpenGL* to drill down within a single graphics frame to pinpoint specific rendering problems, such as texture bandwidth, pixel shader performance, level-of-detail issues, or other bottlenecks within the rendering pipeline

Open System Analyzer, installed as part of Intel INDE.

4. Connecting using the System Analyzer

If the application you are analyzing is running on the same machine where Intel INDE is installed, click Connect or if the application is running on a remote machine enter the IP address of that machine and click Connect.

You will see the application in the System Analyzer as shown below.

5. Click the application to open the System Analyzer

Once the next screen opens, you can drag and drop the metrics that you are interested in. In this example, we are monitoring the Aggregated CPU Load, GPU duration, GPU Busy, and GPU frequency metrics. Press the CTRL key and drag multiple metrics simultaneously. Click the Camera button to capture a frame that’s taking more GPU and giving less FPS. We are going to capture this frame and analyze it using the Graphics Frame Analyzer for DirectX.

6. Capturing a frame using the System Analyzer

Analyzing a frame using the Intel® INDE Graphics Frame Analyzer for DirectX*

Once you open the Frame Analyzer, the captured frames will be automatically loaded. Select the latest frame that you captured and want to analyze and click Open.

7. Opening the captured frame with the Graphics Frame Analyzer for DirectX*

Now let’s start analyzing this particular frame that we captured.

8. Captured frame when opened with the Graphics Frame Analyzer for DirectX*

On the left-hand side RT0, RT1, RT2, RT3 are the render targets that are generated during this frame. Different games can have a different number of render targets used to build the whole frame and for this frame we have four render targets.

What we see on the graph below are the draw calls during that frame. They are called “ergs,” which is the scientific unit of measurement.

9. Graphical view of the ergs with GPU duration on X and Y Axes

You can filter the metrics that are shown. X and Y axes show GPU duration by default. You can change the X and Y axis metrics the dropdown. This is a quick glance that gives how long each erg takes on GPU and can quickly shows us the readings that show us to dive into the ergs that might need some optimization.

Right-click on RT1 and choose “Select ergs in this render target,” which highlights all the ergs used to generate this render target. You can analyze metrics on how long it took to generate the render target. An example of a render target is shown below.

10. Selecting all the ergs in the render target

Let’s dive further in to this render target. Click on the erg that takes the longest GPU duration to see the details of just this erg. If you click the Geometry tab, you can see what geometry is rendered as shown below. If we click the Shaders tab, it will show the vertex and fragment shaders for this erg.

11. Geometry rendered for the selected erg

Let’s explore the tabs at the bottom of the screen. “Selected” means the erg you have selected. If we select “Highlighted,” it shows the highlighted erg that corresponds to the Geometry we see on the right-hand side.

“Other” indicates all other ergs of the render target. Selecting “Hidden” means don’t show them at all. “Draw only to last selected” will draw for this render target only the ergs up to the erg we have selected. If we unselect it, all the ergs for this render target are shown.

12. How to highlight the selected erg

13. Highlighted erg shown in blue color

If you click the Texture tab, you can see what textures are bound with this erg. It is possible that all the textures may not be used in this same erg—they might have been used by a previous erg. But in general, the Texture tab shows what textures are used and how big they are. It’s a good way to find uncompressed images that may take more GPU duration, so we can go back and compress that particular texture.

14. Textures bound with this erg

Experiments

Now let’s talk about the Experiments tab. It allows you to override what the GPU does and look at your net results. In this example, the entire frame runs at 27 ms or 37 FPS as shown in the top right corner box (indicated by the arrow). You can toggle between FPS and GPU duration by clicking that box.

15. Click the top-right toggle button to switch between GPU duration and FPS readings

Now, if you click the Frame Overview tab, you’ll see stats for the entire frame. The Details tab provides the stats for only the erg you selected. In this example in the Frame Overview tab you can see the different metrics you can experiment with as shown below.

16. Frame Overview tab that gives the stats for the entire frame

Let’s click the Experiments tab and try completely disabling this erg so that this erg does not even render.

17. Experiment tab: Before disabling the erg

18. Experiment tab: After disabling the erg

If you go to the Frame Overview, you can see the difference in the GPU duration, execution units, etc. We can look at the general overall performance to see how much difference there is between the old and new values. In the example shown below, the delta value for GPU duration is -8 ms, the new value of the GPU duration is around 18 ms, and the percentage decrease in the GPU duration is around 35%.

19. Frame Overview and difference in GPU duration after disabling the selected erg

Anything significantly bad is marked in red. Most of these ergs are draw calls. If there is nothing highlighted when you select an erg, it can be an indication of a clear call. Sometimes the clear calls can be unnecessary. If everything in the render target is rendered without that clear call, you can try disabling it and see if there is any improvement in the GPU duration.

The API Log tab shows the draw calls being used for the ergs you have selected or if the erg is a clear call.

You can also filter by the primitive count and see how many primitives are being rendered and how many triangles are being rendered. You can set the X-axis to the GPU duration and Y-axis to the primitive count as shown below. Then you can look at the Geometry tab to see the ergs with more primitives.

20. Selecting primitive count on Y-Axis

21. Primitive count for the selected render target

You can also sort by render targets to see how long each render target takes. It’s worth experimenting to see what the hardware is doing and disable and change things to see if the performance increases or decreases by looking at the Frame Overview and seeing the delta of the performance.

About Author

Praveen Kundurthy works in the Software & Services Group at Intel Corporation. He has a Masters degree in Computer Engineering. His main interest is mobile technologies, Windows and game development.

Notices

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others

↧

Performance of Multibuffer AES-CBC on Intel® Xeon® Processors E5 v3

July 7, 2015, 4:32 am

Latest and popular articles on Intel Technologies

≫ Next: Video Quality Caliper quick overview

≪ Previous: Analyze and Optimize Windows* Game Applications Using Intel® INDE Graphics Performance Analyzers (GPA)

This paper examines the impact of the multibuffer enhancements to OpenSSL* on the Intel® Xeon® processor E5 v3 family when performing AES block encryption in CBC mode. It focuses on the performance gains seen by the Apache* web server when managing a large number of simultaneous HTTPS requests using the AES128-SHA and AES128-SHA256 ciphers, and how they stack up against the more modern AES128-GCM-SHA256 cipher. With the E5 v3 generation of processors, web servers such as Apache can obtain significant increases in maximum throughput when using multibuffer-enabled algorithms for CBC mode encryption.

Background

One of the performance-limiting characteristics of the CBC mode of AES encryption is that it is not parallelizable. Each block of plaintext in the stream to be encrypted depends upon the encryption of the previous block as an input, as shown in Figure 1. Only the first block has no such dependency and substitutes an initialization vector, or IV, in its place.

Figure 1. CBC mode encryption

Mathematically, this is defined as:

C_i= E_k (P_iXOR C_i-1)

Where

C₀ = IV

From this definition, it’s clear that there are no opportunities for parallelization within the algorithm for the encryption of a single data stream. To perform encryption on any given data block, P_n, it must first be XOR’d with the previous cipher block, and that means that all previous blocks must be encrypted, in order, from 1 to n. The CBC mode of encryption is a classically serial operation.

The multibuffer approach introduced in “Processing Multiple Buffers in Parallel for Performance” describes a procedure for parallelizing algorithms such as CBC that are serial in nature. Operations are interleaved such that the latencies incurred while processing one data block are masked by active operations on another, independent data block. Through careful ordering of the machine instructions, the multiple execution units within a CPU core can be utilized to process more than one data stream in parallel within a single thread.

Multibuffer solutions generally require a job scheduler and an asynchronous application model, but the OpenSSL library is a synchronous framework so a job scheduler is not an option. The solution in this case is to break down the application buffer into TLS records of equal size that can be processed in parallel due to the explicit IV as described in “Improving OpenSSL Performance”. For a web server, the implication is that only file downloads from server to client—page fetches, media downloads, etc.—will see a performance boost. File uploads to the server will not.

The Test Environment

The performance limits of Apache were tested by generating a large number of parallel connection requests and repeating those connections as rapidly as possible for a total of two minutes. At the end of those two minutes, the maximum connection latency across all requests was examined along with the resulting throughput. The number of simultaneous connections was adjusted between runs to find the maximum throughput that apache could achieve for the duration without connection latencies exceeding two seconds. This latency limit was taken from the research paper “A Study on tolerable waiting time: how long are Web users willing to wait?” which concluded that two seconds is the maximum acceptable delay in loading a small web page.

Apache was installed on a pre-production, two-socket Intel Xeon processor-based server system populated with two production E5-2697 v3 processors clocked at 2.60 GHz with Intel® Turbo Boost Technology on and Intel® Hyper-Threading Technology (Intel® HT Technology) off. The system was running SUSE Linux* Enterprise Server 12. Each E5 processor had 14 cores for a total of 28 hardware threads. Total system RAM was 64 GB. Networking for the server load was provided by a pair of Intel® Ethernet Converged Network Adapters, XL710-QDA2 (Intel® Ethernet CNA XL710-QDA2).

The SSL capabilities for Apache were provided by the OpenSSL library. OpenSSL is an open source library that implements the SSL and TLS protocols in addition to general-purpose cryptographic functions. The 1.0.2 release is optimized for the Intel Xeon processor v3 and contains the multibuffer enhancements. For more information on OpenSSL see http://www.openssl.org/. The tests in this case study were made using 1.0.2a.

Two versions of OpenSSL 1.0.2a were built so that the performance of the multibuffer enhancements could be compared to unenhanced code on the same release. Multibuffer support was forcibly removed by defining the preprocessor symbol OPENSSL_NO_MULTIBLOCK:

$ ./Configure –DOPENSSL_NO_MULTIBLOCK options

The server load was generated by up to six client systems as needed, a mixture of Intel Xeon processor E5 v2 and E5 v3 class hardware. Load generators were connected to the Apache server through 40 Gbps links. Two of the clients had a single Intel Ethernet CNA XL710-QDA2 card and were connected to one of the dual-port Intel Ethernet CNA XL710-QDA2 cards on the server. The remaining four load clients each had a single port 10 Gbit card and their bandwidth was aggregated via a 40 Gbit switch.

The network diagram for the test environment is shown in Figure 2.

Figure 2. Test network diagram.

The actual server load was generated using multiple instances of the Apache* Benchmark tool, ab, an open source utility included in the Apache server distribution. A single instance of Apache Benchmark was not able to create a load sufficient to reach the server’s limits, so it had to be split across multiple processors and, due to network bandwidth and client CPU limitations, across multiple hosts.

Because each Apache Benchmark instance is completely self-contained, however, there is no built-in mechanism for distributed execution. A synchronization server and client wrapper were written to coordinate the launching of multiple instances of ab across the load clients, their CPUs, and their network interfaces, and then collate the results. Loads were distributed based on a simple weighting system that accounted for an individual client’s network bandwidth and processing power.

The Test Plan

The goal of the tests was to determine the maximum throughput that Apache could sustain throughout two minutes of repeated, incoming connection requests for a target file, and to compare the results for the multibuffer-enabled version of OpenSSL against the unenhanced version. Multibuffer benefits CBC mode encryption, so the AES128-SHA and AES128-SHA256 ciphers were chosen for analysis.

The secondary goal was to compare the multibuffer results against the more modern GCM mode of block encryption. For that comparison the AES128-GCM-SHA256 cipher was chosen.

This resulted in the following cases:

AES128-SHA, multibuffer ON
AES128-SHA, multibuffer OFF
AES128-SHA256, multibuffer ON
AES128-SHA256, multibuffer OFF
AES128-GCM-SHA256

For each case, performance tests were repeated for a fixed target file size, starting at 1 MB and increasing by powers of four up to 4 GB, where 1 GB = 1024 MB, 1 MB = 1024 KB, and 1 KB = 1024 bytes. The use of 1 MB files and larger minimized the impact of the key exchange on the session throughput. Keep-alives were disabled so that each connection resulted in fetching a single file.

Tests for each cipher were run for the following hardware configurations:

2 cores enabled (1 core per socket)
4 cores enabled (2 cores per socket)
8 cores enabled (4 cores per socket)
16 cores enabled (8 cores pre socket)
28 (all) cores enabled (14 cores per socket)

Intel HT Technology was disabled in all configurations. Reducing the system to one active core per socket, the minimum configuration in the test system, effectively simulates a low-core-count system and ensures that Apache performance is limited by the CPU rather than other system resources. These measurements can be used to estimate the overall performance per core, as well as estimate the projected performance of a system with many cores.

The many-core runs test the scalability of the system and introduce the possibility of system resource limits beyond just CPU utilization.

System Configuration and Tuning

Apache was configured to use the event Multi-Processing Module (MPM), which implements a hybrid multi-process, multi-threaded server. This is Apache’s highest performance MPM and the default on systems that support both multiple threads and thread-safe polling.

To support the large number of simultaneous connections that might occur at the smaller target file sizes, some system and kernel tuning was necessary. First, the number of file descriptors was increased via /etc/security/limits.conf:

Figure 3. Excerpt from /etc/security/limits.conf

And several kernel parameters were adjusted (some of these settings are more relevant to bulk encryption):

Figure 4. Excerpt from /etc/sysctl.conf

Some of these parameters are very aggressive, but the assumption is that this system is a dedicated TLS web server.

No other adjustments were made to the stock SLES 12 server image.

System Performance Limits

Before running the tests, the throughput limit of the server system was explored using unencrypted HTTP. Tests on the same target file sizes, with all cores active and the same constraint of a 2-second maximum connection latency, saw a maximum achievable throughput of just over 77 Gbps (with very little CPU utilization).

The exact reason for this performance limit is not known. A cursory investigation suggested that there may have been a configuration issue with the dual-port NIC and the use of both ports simultaneously, leading to much less than the maximum throughput for the adapter. In-depth debugging was not done however due to time constraints.

Results

The maximum throughputs in Gbps achieved for the AES128-SHA and AES128-SHA256 ciphers by file size are shown in Figure 5 and Figure 6. At the smallest file size, 1 MB, the multibuffer enhancements result in about a 44% gain on average, and at the larger file sizes this gain is as high as 115%. This holds true up through 8 cores. At 16 cores, the gains begin to drop off as the throughput reaches the ceiling of 77 Gbps. In the 28-core case, the unenhanced code has nearly reached the throughput ceiling, but with significantly higher CPU utilization as shown in Figure 7.

Figure 5. Maximum throughput on Apache* server by file size for given core counts using the AES128-SHA cipher

The AES128-SHA256 cipher shows even larger gains for the multibuffer enhanced code, with about a 65% improvement for 1 MB files and jumping to 130% at larger file sizes. Because the SHA256 hashing is more CPU intensive, the overall throughput is significantly lower than the SHA1-based cipher. A side effect of this lower performance is that the multibuffer code scales through the 16-core case, and the unenhanced code never reaches the throughput ceiling even when all 28 cores are active.

Figure 6. Maximum throughput on Apache* server by file size for given core counts using the AES128-SHA256 cipher

Figure 7. Maximum CPU utilization for Apache* server: AES128-SHA cipher and 28-cores

The performance of the multibuffer-enhanced ciphers is compared to the AES128-GCM-SHA256 cipher in Figure 8. The GCM cipher outperforms both of the multibuffer-enhanced ciphers, though AES128-SHA stays within about 20% of the GCM throughput. The AES128-SHA256 cipher is the lowest performer due to the larger CPU demands of the SHA256 hashing.

Figure 8. Maximum throughput on Apache* server by file size for given core counts, comparing CBC + multibuffer to GCM encryption

Conclusions

The multibuffer enhancements to AES CBC encryption in OpenSSL 1.0.2 provide a significant performance boost, yielding over 2x performance in some cases. Web sites that need to retain these older ciphers in their negotiation list can achieve performance that is nearly on par with GCM for page and file downloads.

Web site administrators considering moving to AES128-SHA256 to obtain the added security from the SHA256 hashing will certainly see a significant performance boost from multibuffer, but if at all possible they should switch to GCM, which offers significantly higher performance due to its design.

↧

Video Quality Caliper quick overview

July 9, 2015, 4:45 am

Latest and popular articles on Intel Technologies

≫ Next: Books - High Performance Parallelism Pearls

≪ Previous: Performance of Multibuffer AES-CBC on Intel® Xeon® Processors E5 v3

Video Quality Caliper

available in Media Server Studio packages - Professional Edition and Video Pro Analyzer

Video Quality Caliper may have multiple applications in video developer’s hands.

Single file view and playback
Objective metrics graph visualization (only reference metrics for now)
RD graph for group of coded files compressed from the same reference origin
Side by side compare of 2 compressed stream as a single picture (ex. left side from one, right from another)
Side by side compare of a compressed stream with original uncompressed as a single picture (ex. left side from one, right from another)
Detection of unexpected visual artefacts
Investigation of compression artefacts

VQC can be used a file viewer for a single stream. It supports

Elementary streams and containers: MP2TS, MP4, MKV
All popular compression standards - HEVC, VP8, VP9, MPEG2, AVC
Uncompressed files, planar and interleaved
4:2:0/4:2:2/4:4:4 chroma formats and 8-16 bit depth
Auto detection of compressed bitstreams
Auto detection of uncompressed streams with option to customize your format if it is not properly recognized - to adjust resolution, color format, bit depth etc
Playback with any zoom factor
Works on Windows and Linux

Major function of the tool is to visualize objective Metrics

PSNR for Y, U, V and Overall = (6Y+U+V)/8
SSIM and SSIMb
MWDVQM
Allows to save data to a file and picture chart
Visualize 2 metrics on a single graph - left handed axis and Secondary right handed metrics

GUI is essential component of productivity

Open a file or group of files, independent or by user-defined template
Flexible navigation in between frames, metrics graph preview of whole stream
Average stream based metric for stream bitrate on RD-graph
Clickable RD-graph with each point as a control of a graphic line
Quick stream down sampled preview
Mouse wheel and hot keys
Amazing zoom to pixel values
Customizable chart area
Clickable chart area to select frame preview in a separate window
Legend for a file mapping to a graph
Splits for multiple streams preview

Extended view of multiple streams
Offset view for multiple streams
Synchronous view for multiple streams
Intel® Video Pro Analyzer interoperability to inspect selected frame
Cached data size control
Restart with previously working options

↧

Books - High Performance Parallelism Pearls

July 9, 2015, 4:49 pm

Latest and popular articles on Intel Technologies

≫ Next: Parallel Programming Books

≪ Previous: Video Quality Caliper quick overview

The two “Pearls” books contain an outstanding collection of examples of code modernization, complete with discussions by software developers of how code was modified with commentary on what worked as well as what did not! Code for these real world applications is available for download from http://lotsofcores.com whether you have bought the books or not. The figures are freely available as well, a real bonus for instructors who choose to uses these examples when teaching code modernization techniques. The books, edited by James Reinders and Jim Jeffers, had 67 contributors for volume one, and 73 contributors for volume 2.

Experts wrote about their experiences in adding parallelism to their real world applications. Most examples illustrate their results on processors and on the Intel® Xeon Phi™ coprocessor. The key issues of scaling, locality of reference and vectorization are recurring themes as each contributed chapter contains explanations of the thinking behind adding use of parallelism to their applications. The actual code is shown and discussed, with step-by-step thinking, and analysis of their results. While OpenMP* are MPI are the dominant method for parallelism, the books also include usage of TBB, OpenCL and other models. There is a balance of Fortran, C and C++ throughout. With such a diverse collection of real world examples, the opportunities to learn from other experts is quite amazing.

Volume 1 includes the following chapters:

Foreword by Sverre Jarp, CERN.

Chapter 1: Introduction

Chapter 2: From ‘Correct’ to ‘Correct & Efficient’: A Hydro2D Case Study with Godunov’s Scheme

Chapter 3: Better Concurrency and SIMD on HBM

Chapter 4: Optimizing for Reacting Navier-Stokes Equations

Chapter 5: Plesiochronous Phasing Barriers

Chapter 6: Parallel Evaluation of Fault Tree Expressions

Chapter 7: Deep-Learning and Numerical Optimization

Chapter 8: Optimizing Gather/Scatter Patterns

Chapter 9: A Many-Core Implementation of the Direct N-body Problem

Chapter 10: N-body Methods

Chapter 11: Dynamic Load Balancing Using OpenMP 4.0

Chapter 12: Concurrent Kernel Offloading

Chapter 13: Heterogeneous Computing with MPI

Chapter 14: Power Analysis on the Intel® Xeon Phi™ Coprocessor

Chapter 15: Integrating Intel Xeon Phi Coprocessors into a Cluster Environment

Chapter 16: Supporting Cluster File Systems on Intel® Xeon Phi™ Coprocessors

Chapter 17: NWChem: Quantum Chemistry Simulations at Scale

Chapter 18: Efficient Nested Parallelism on Large-Scale Systems

Chapter 19: Performance Optimization of Black-Scholes Pricing

Chapter 20: Data Transfer Using the Intel COI Library

Chapter 21: High-Performance Ray Tracing

Chapter 22: Portable Performance with OpenCL

Chapter 23: Characterization and Optimization Methodology Applied to Stencil Computations

Chapter 24: Profiling-Guided Optimization

Chapter 25: Heterogeneous MPI optimization with ITAC

Chapter 26: Scalable Out-of-Core Solvers on a Cluster

Chapter 27: Sparse Matrix-Vector Multiplication: Parallelization and Vectorization

Chapter 28: Morton Order Improves Performance

Volume 2 includes the following chapters:

Foreword by Dan Stanzione, TACC

Chapter 1: Introduction

Chapter 2: Numerical Weather Prediction Optimization

Chapter 3: WRF Goddard Microphysics Scheme Optimization

Chapter 4: Pairwise DNA Sequence Alignment Optimization

Chapter 5: Accelerated Structural Bioinformatics for Drug Discovery

Chapter 6: Amber PME Molecular Dynamics Optimization

Chapter 7: Low Latency Solutions for Financial Services

Chapter 8: Parallel Numerical Methods in Finance

Chapter 9: Wilson Dslash Kernel From Lattice QCD Optimization

Chapter 10: Cosmic Microwave Background Analysis: Nested Parallelism In Practice

Chapter 11: Visual Search Optimization

Chapter 12: Radio Frequency Ray Tracing

Chapter 13: Exploring Use of the Reserved Core

Chapter 14: High Performance Python Offloading

Chapter 15: Fast Matrix Computations on Asynchronous Streams

Chapter 16: MPI-3 Shared Memory Programming Introduction

Chapter 17: Coarse-Grain OpenMP for Scalable Hybrid Parallelism

Chapter 18: Exploiting Multilevel Parallelism with OpenMP

Chapter 19: OpenCL: There and Back Again

Chapter 20: OpenMP vs. OpenCL: Difference in Performance?

Chapter 21: Prefetch Tuning Optimizations

Chapter 22: SIMD functions via OpenMP

Chapter 23: Vectorization Advice

Chapter 24: Portable Explicit Vectorization Intrinsics

Chapter 25: Power Analysis for Applications and Data Centers

↧

Parallel Programming Books

July 9, 2015, 4:54 pm

Latest and popular articles on Intel Technologies

≫ Next: ANSYS* Scales Simulation Performance

≪ Previous: Books - High Performance Parallelism Pearls

Use these parallel programming resources to optimize with your Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor.

High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches ›
by James Reinders and James Jeffers | Publication Date: November 17, 2014 | ISBN-10: 0128021187 | ISBN-13: 978-0128021187

High Performance Parallelism Pearls shows how to leverage parallelism on processors and coprocessors with the same programming – illustrating the most effective ways to better tap the computational potential of systems with Intel® Xeon Phi™ coprocessors and Intel® Xeon® processors or other multicore processors.

More details on the 1st and (new) 2nd volume of the High Performance Parallelism Pearls can be found here

Structured Parallel Programming: Patterns for Efficient Computation ›
by Michael McCool, James Reinders and Arch Robison | Publication Date: July 9, 2012 | ISBN-10: 0124159931 | ISBN-13: 978-0124159938

This book fills a need for learning and teaching parallel programming, using an approach based on structured patterns which should make the subject accessible to every software developer. It is appropriate for classroom usage as well as individual study.

Intel® Xeon Phi™ Coprocessor High Performance Programming ›
by Jim Jeffers and James Reinders – Now available!

The key techniques emphasized in this book are essential to programming any modern parallel computing system whether based on Intel® Xeon® processors, Intel® Xeon Phi™ coprocessors, or other high performance microprocessors.

Parallel Programming and Optimization with Intel® Xeon Phi™ Coprocessors

Parallel Programming and Optimization with Intel® Xeon Phi™ Coprocessors ›
by Colfax International

This book will guide you to the mastery of parallel programming with Intel® Xeon® family products: Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors. It includes a detailed presentation of the programming paradigm for Intel® Xeon® product family, optimization guidelines, and hands-on exercises on systems equipped with the Intel® Xeon Phi™ coprocessors, as well as instructions on using Intel software development tools and libraries included in Intel Parallel Studio XE.

Intel® Xeon Phi™ Coprocessor Architecture and Tools: The Guide for Application Developers ›
by Reza Rahman

Intel® Xeon Phi™ Coprocessor Architecture and Tools: The Guide for Application Developers provides developers a comprehensive introduction and in-depth look at the Intel Xeon Phi coprocessor architecture and the corresponding parallel data structure tools and algorithms used in the various technical computing applications for which it is suitable. It also examines the source code-level optimizations that can be performed to exploit the powerful features of the processor.

Optimizing HPC Applications with Intel Cluster Tools: Hunting Petaflops ›
by Alexander Supalov

Optimizing HPC Applications with Intel® Cluster Tools takes the reader on a tour of the fast-growing area of high performance computing and the optimization of hybrid programs. These programs typically combine distributed memory and shared memory programming models and use the Message Passing Interface (MPI) and OpenMP for multi-threading to achieve the ultimate goal of high performance at low power consumption on enterprise-class workstations and compute clusters.

↧

ANSYS* Scales Simulation Performance

July 10, 2015, 9:30 am

Latest and popular articles on Intel Technologies

≫ Next: Virtual Archers Put Gesture to the Test with Longbow*

≪ Previous: Parallel Programming Books

The Need for Speed in Simulation-Based Design

Engineering simulation software has changed how companies design products, enabling them to explore and test more design options faster, while reducing the need for physical prototyping. ANSYS software has played a central role in this transition and is now used by 96 of the top 100 industrial companies on the FORTUNE 500 list.

ANSYS* customers have an insatiable need for higher computing performance, so they can model and test bigger assemblies with greater fidelity. Many of them also want to generate deeper insights with each simulation, by integrating additional physics variables, exploring nonlinear and composite materials, and evaluating more complex and dynamic environmental conditions. Yet, fast time to results remains a critical factor for most customers to meet aggressive time-to-market requirements. As the size and complexity of the simulations grows, the speed and capacity of the computing platform must also grow.

Check out the full ANSYS case studyDownload Download

Modernize Your Code for Higher Performance

The Intel® Modern Code Developer Program offers extensive resources, insights, and collaborative opportunities that can help you achieve next-generation performance through multi-level parallelism—while achieving a high return on your investment.

Learn more at http://software.intel.com/moderncode

↧

Virtual Archers Put Gesture to the Test with Longbow*

July 13, 2015, 4:23 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® Fortran Compiler 16.0 Release Notes

≪ Previous: ANSYS* Scales Simulation Performance

By John Tyrrell

Initially released in 2013 for mobile devices, developer Jason Allen reworked archery mini-game Longbow* for Intel® RealSense™ technology-enabled laptops, PCs, and tablets. The game—slated for release in August 2015—is a relatively simple archery simulation set in a series of medieval-style environments where players take into account distance, wind speed, and wind direction to hit the bull’s eye (Figure 1).

Figure 1: Players fire at traditional-looking targets in a rustic setting.

Intrigued by the 3D input possibilities of the Intel® RealSense™ SDK and Intel® RealSense™ 3D Camera (F200), Allen saw the archery-based gameplay as a perfect opportunity to use hand and gesture tracking for the action of aiming and firing arrows.

Optimizations and Challenges

Gesture Controls

Longbow for Intel RealSense technology is played using the hand-tracking capabilities of the Intel RealSense 3D camera, with the player’s forward bow-holding hand used for aiming and the rear hand used to fire. To aim arrows, the game detects the first hand raised to the camera and records its initial 3D position, which then becomes the center point. The player then aims by moving the forward hand, with the game tracking the distance and direction of it in relation to the center point. Mimicking the action of a real archer, the natural choice is to hold the aiming hand in a fist, but this is not actually required by the game.

The following code tracks the user's hand movements. It first queries the mass center using the Intel RealSense SDK member function QueryMassCenterImage from the PXC[M]HandData interface. If it finds that there hasn't been an initial orientation, it assumes the user has just raised their hand to the camera and recalibrates the starting point. Otherwise, it measures the distance the user has moved their hand from the initial point and interprets that movement in the game (in Longbow's case, it rotates the camera).

PXCMPointF32 imageLocation = data.QueryMassCenterImage();

        Vector3 normalizedImageLocation = new Vector3(imageLocation.x / input.colorWidth * 2f - 1f, imageLocation.y / input.colorHeight * 2f - 1f, 0f);

        if (initialOrientation == Vector3.zero)
        {
            if (!calibrating)
            {
                calibrationElapsed = 0f;
                moveDelta = Vector3.zero;
                calibrating = true;
            }
            else if (calibrationElapsed > .3f)
            {
                initialOrientation = new Vector3(-normalizedImageLocation.y, normalizedImageLocation.x, 0f);
                moveDelta = Vector3.zero;
                calibrating = false;
            }
        }

        if (initialOrientation == Vector3.zero) return;

        moveDelta = (initialOrientation - new Vector3(-normalizedImageLocation.y, normalizedImageLocation.x, 0f)) * sensitivity;

The next code snippet measures the user's hand depth in the same way. First, it finds if there is an existing point of reference, and if not, calibrates one. It then measures the delta (this time in depth) from the first point to get how far the player has pulled their hand back.

if (initialDepth == 0f)
        {
            initialDepth = data.QueryMassCenterWorld().z;
        }

        myDepth = data.QueryMassCenterWorld().z - initialDepth;

The player uses the other hand to pull back the arrow while the game uses depth tracking to measure its distance from the forward aiming hand (Figure 2). When the rear firing hand reaches a predetermined distance from the aiming hand, the arrow automatically fires.

Figure 2: Players use their hands to mimic the gestures of aiming and drawing a bow.

Allen originally tested a different and more realistic firing gesture whereby the action of opening the rear hand would release the arrow. However, the responsiveness of this gesture proved inconsistent and hence frustrating for players. As a result, the hand-opening gesture was abandoned during development in favor of the simpler depth-based mechanism.

Despite players occasionally misfiring arrows as they became accustomed to the maximum distance they could pull back without firing, the depth-based system proved to be much more reliable and accurate, resulting in a more enjoyable experience.

Interpolating Data

When using human physical motion as an input device―either through the use of a 3D camera or an accelerometer in a handheld device―a common issue is jittery on-screen movements. This is caused by the constant minute movements of the hand and the sensitivity of the input device―in this case the Intel RealSense camera―and the sheer volume of precise data it generates.

With Longbow, Allen used the Unity 3D function, “Lerp” (linear interpolation), a process that averages the input data to deliver smooth on-screen movement. The process identified the optimum number of times per second the game needed to pull the hand-detection data from the camera to prevent detectable lag for the user. This turned out to be 5 to 10 times per second (considerably lower than the game’s frame rate of 30 frames per second). Next, linear interpolation is applied to the data, which averages the input data and estimates where the hand will be. This process results in a smooth and accurate on-screen rendering of the player’s movements. Allen smoothed the camera’s rotation based on the moveDelta value calculated earlier. The smoothness value determines how much to smooth the input; too much and you get lagged movements, and too little causes the movement to jump around by tiny amounts.

transform.rotation = Quaternion.Lerp (transform.rotation,
Quaternion.Euler (moveDelta + new Vector3 (0f, yOffset, 0f)),
Time.deltaTime * smoothness);

Allen also discovered that pulling data from the Intel RealSense camera as infrequently as possible and applying interpolation reduces the load on the processor, which helps the game maintain a steady frame rate and run more smoothly. This is particularly helpful when running the game on less powerful devices and ultimately improves the overall user experience.

Optimizing the UX

The biggest issue Allen had during development was adapting the game’s user experience for the Intel RealSense camera. He initially explored applying gesture controls to the game’s entire user interface, from the menu selections right through to gameplay, to make the game accessible without the need for touch or a mouse and keyboard. Using gestures to stop, start, navigate the menus, and make selections worked on a functional and technical level, but Allen found that that the process fell significantly short in delivering an enjoyable user experience.

The first problem was the complexity of teaching the user what actions to use and where to use them. Players were required to memorize a set of three specific hand gestures to navigate the menu and start the game. Allen found that players would frequently confuse gestures resulting in unwanted outcomes. Additionally―particularly when the original fist closed-to-open firing gesture was still in the game―Allen found that players would sometimes trigger an unwanted action such as pausing the game, adding to their frustration.

No Offense

Another interesting challenge that Allen faced while implementing the initial gesture-controlled interface was making sure that the gestures recognized by the Intel RealSense SDK were appropriate for an international audience. For example, the “two-finger pinch” or OK symbol, which is made by bringing together the tips of the thumb and forefinger, has a potentially offensive meaning in the Brazilian culture. The inability to use certain commonly recognized gestures and the need to create original gestures made the process of creating a gesture control scheme that users would be able to memorize even more complex.

Heavy Hands

One unexpected issue that Allen found with the gesture controls was the physical discomfort players experienced from having to hold their hands in front of the camera throughout the game. This led to aching arms, which significantly reduced the fun factor. To address this issue, Allen modified the game to allow players to drop their hands between rounds, instructing the Intel RealSense camera to go through the process of detecting the hands again at the start of each new round.

Keeping With Tradition

Overall, the game’s initial gesture-only interface proved non-intuitive to players and added a layer of complexity to the navigation. In the end, both Allen and Intel agreed that the menu interface would work better using touch and traditional mouse and keyboard controls. In the case of Longbow where the game is played in close proximity to the camera and screen, using these traditional interface controls is easy and accessible for the player, and they delivered a significantly more intuitive and comfortable user experience.

Testing and Analysis

As an independent developer, Allen had no testing pool and conducted the local testing alone using only his own computer. Fortunately for Allen, working with the Intel RealSense SDK meant he was able to count on Intel’s support at each stage of development. He used the Intel RealSense SDK documentation provided during the early phases, relying more heavily on the support of Intel engineers as the project took shape. Throughout their collaboration, Intel provided valuable feedback on the implementation of the gesture controls, including for the interface and the actions of drawing and firing arrows.

The main problems that arose through testing were the arrow-release mechanism and the user interface as described previously. The initial firing mechanism involved opening the fist to release the arrow, and testing showed that many users were unable to consistently fire this way. This led directly to the implementation of the modified firing mechanism based on drawing distance, whereby the arrow is fired when the drawing hand reaches a certain distance away from the forward aiming hand. Testing also led to the return to traditional mouse, keyboard, and touch controls for the game’s main navigation.

Intel RealSense SDK: Looking Forward

Following his Intel-inspired discovery of the Windows* Store, Allen now develops games for web and Windows devices in addition to his core work for the mobile market. His keen interest in developing for emerging platforms is what led to his involvement with Intel and his work in bringing Longbow to the Intel RealSense SDK platform.

Developing for the Intel RealSense SDK opened Allen’s mind to a world of new possibilities, the first being head tracking and simulations, either in a game or in an actual simulator where, for example, the user is being taught a new skill. The ability to look around a virtual world without having to wear head gear is a capability that Allen has already experimented with in his previously released game Flight Theory*.

Allen believes that Intel RealSense technology is a new frontier offering exciting new user experiences that will be available to growing numbers of consumers once the hardware begins its commercial rollout.

What’s Next for Longbow

Longbow was initially developed for mobile platforms, and the Windows version currently uses the same art assets (Figure 3). Allen intended to upgrade the graphics when he began developing the Intel RealSense SDK-enabled version of the game, but unexpected UX challenges sidelined the task, although a visual update is still high on the list of priorities.

Figure 3: Allen borrowed from the past to add more fun and a frisson of danger to Longbow*.

Now that Allen has the Intel RealSense SDK Gold release, he might revisit the original finger-tracking gesture control for firing arrows, using the release finger movement rather than the pullback distance-sensitive release mechanism.

About the Developer

Driftwood Mobile is the studio of independent developer Jason Allen based in Tasmania, Australia. Allen initially founded the studio in 2008 to develop games for the blind and visually impaired, having noted that few experiences were available that adapted to that audience. Around the same time, the mobile gaming and app market was beginning to explode, a shift that Jason has successfully capitalized on with the release of five separate mobile titles to date. Collectively, the games have accumulated over 40 million downloads over the last three years, with bowling game Galaxy Bowling* being the most successful, both in terms of user numbers (currently approximately one million active users) and revenue.

Allen is currently exploring how to make Galaxy Bowling (Figure 4) accessible for the blind and visually impaired with the vital support from the community. According to Allen, the core challenge in adapting a game for visually impaired players is distilling the large amount of information simultaneously displayed on a screen to comprehensible audio-based directions, which need to be delivered in a linear sequence so the player can process them. Allen aims to take the experience beyond coded bleeps of early games, using more realistic sound effects to direct the player, with his experiments so far proving surprisingly successful in delivering a fun experience.

Figure 4: Galaxy Bowling* for iOS* and Android* devices is Allen’s most successful title to date.

Additional Resources

Driftwood Mobile developer website

Intel® Developer Zone for RealSense™ Technology

Intel RealSense SDK

Intel® RealSense™ Developer Kit

Intel RealSense Technology Tutorials

↧

Intel® Fortran Compiler 16.0 Release Notes

July 14, 2015, 1:18 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® C++ Compiler 16.0 Release Notes

≪ Previous: Virtual Archers Put Gesture to the Test with Longbow*

This page is under construction - please ignore it for now.

This page provides links to the current Release Notes for the Intel® Fortran Compiler 16.0 component of Intel® Parallel Studio XE 2016 for Windows*, Linux* and OS X*.

To get product updates, log in to the Intel® Software Development Products Registration Center.

For questions or technical support, visit Intel® Software Products Support.

For Release Notes for other versions, visit Intel® Compiler Release Notes.

Intel® Fortran Compiler 16.0 for Windows*	Intel® Fortran Compiler 16.0 for Linux*	Intel® Fortran Compiler 16.0 for OS X*
Initial Release Notes English	Initial Release Notes English	Initial Release Notes English

↧

Intel® C++ Compiler 16.0 Release Notes

July 14, 2015, 1:24 pm

Latest and popular articles on Intel Technologies

≫ Next: Conduct validation and debugging to meet industry compliance with Intel® Stress Bitstreams and Encoder.

≪ Previous: Intel® Fortran Compiler 16.0 Release Notes

This page is under construction - please ignore it for now.

This page provides links to the current Release Notes for the Intel® C++ Compiler 16.0 component of Intel® Parallel Studio XE 2016 for Windows*, Linux* and OS X*.

To get product updates, log in to the Intel® Software Development Products Registration Center.

For questions or technical support, visit Intel® Software Products Support.

For Release Notes for other versions, visit Intel® Compiler Release Notes.

Intel® C++ Compiler 16.0 for Windows*	Intel® C++ Compiler 16.0 for Linux*	Intel® C++ Compiler 16.0 for OS X*
Initial Release Notes English	Initial Release Notes English	Initial Release Notes English

↧

Conduct validation and debugging to meet industry compliance with Intel® Stress Bitstreams and Encoder.

July 14, 2015, 5:13 pm

Latest and popular articles on Intel Technologies

≫ Next: Attend IDF'15: HPC Developer Showcase

≪ Previous: Intel® C++ Compiler 16.0 Release Notes

To proof the compliance to the video standard (HEVC, VP9) Stress Bitstreams give a complete set on test vectors which can be used for short sanity check or full range validation. Given bitstreams will put decoder in the condition of worst case video decoding speed with stress on memory access and highest computational complexity. If ever seek for holes in implementation – a great opportunity to leverage aspiration for exploratory testing with Random Encoder. Every new “seed” will make a new bitstream with new cross syntax combinations. Flexibility of configuration file allows to keep control over any syntax element values range.

Media Server Studio is a proven set of development and validation tools to meet high bar of codecs quality. Stress Bitstreams and Encoder (SBE) for VP9 and HEVC is one of the leading products in the domain to design and develop decoders according to standard requirements. It is hard to test decoder with every encoder in the universe. To make less predictable encoder footprint SBE provides Random Encoder. This encoder is not focused on quality and compression, the main purpose it designed with a control over syntax elements specified in standard. Intel SBE addressed codec developer needs: Maximized coverage of values from allowed range, randomized Cross combinations of syntax elements, Maximized source code coverage for tested decoder with minimal bitstreams footprint.

SBE will benefit to developers in multiple areas: Independent Integrated Circuit, Intellectual Property Silicon codecs designers, Set Top Box vendors and Digital TV manufacturers, Enterprise and Consumer Software vendors, video player & transcoder solution developers and integrators. SBE is designed for validation process automation, easy to integrate to a test system with command line API and configuration files. MD5 checksums and reference decoders provided as part of product. Code branch and syntax element joint coverage in HTML form is provided in the package for each codec profile. Reference decoder source is used to showcase code coverage.

Randomized approach damages visual pictures bringing strong visual artefacts, especially if residuals randomization is enabled. SBE offers also visual clean bitstreams for Digital TV, STB and other end user naked eye validation. Randomized approach is a key to create provided Bitstreams. Convergence is guaranteed by uniform distribution but fixed “seed” allows always reproduce a result.

Every development and validation team will find own part in the multi-purpose product:

Debug streams: All standard features (for a profile) distributed to a few buckets – Intra, Inter, Extra. Inside a bucket there are several bitstreams indexed by its complexity from 001 to 250. These streams are intended to be used codec designers to make incremental improvements in development job.
Stress streams/ Worst case performance/Stress decoding with Maximized Read memory bandwidth: When a product claims a particular feature set it has to be validated against worst conditions assumed for. Stress streams will help identify if any issued with memory traffic or computing or binary arithmetic engines. Test you solution in expected worst performance conditions.
Visual clean streams: Randomization approach creates a lot of noise on a visual picture. If a concern regarding artefacted picture than there is another analogies labeled with Visual. Those bitstreams good for DigitalTV non automated testing where is human eyes to be used as testing factor.
Make your own test clips: Random Encoder is a unique tool to make a custom bitstream for your need. It is highly configurable to control every syntax element and compile a required stream. Do not expect coding efficiency from this tool! Only compliance is the target for it. The tool gives you a power of our approach.
Short “killer” streams: Quick smoke validation is the key for TTM. Leverage your design turnaround with tiny footprint bitstreams which provides high coverage for syntax though.
Special cases: There are special test corner cases with the product. We also like to serve special requests to satisfy your design limitations.

Learn more and Download a Free Evaluation at https://software.intel.com/en-us/intel-stress-bitstreams-and-encoder

↧

Attend IDF'15: HPC Developer Showcase

July 16, 2015, 1:34 pm

Latest and popular articles on Intel Technologies

≫ Next: Optimizing legacy molecular dynamics software with directive-based offload

≪ Previous: Conduct validation and debugging to meet industry compliance with Intel® Stress Bitstreams and Encoder.

Intel has just launched the Intel® Modern Code Developer Community to help HPC developers code for maximum performance on current and future hardware. The initiative is targeted to more than 400,000 HPC-focused developers and partners.

As part of Intel’s Modern Code initiative, you will have access to the tools, training, knowledge and the expert support you need to move ahead with your code modernization efforts. This includes modernization techniques such as vectorization, memory and data layout, multi-threading and multi-node programming.

But that’s not all. Coming up in August is a conference you won’t want to miss.

Intel Developer Forum (IDF)

The dates: August 18-20, 2015.

The place: San Francisco, Moscone Center

Register today and take advantage of your special HPC perks and discounts: Use Promo Code BUBHPC for a $695 Full Conference Pass or CPDHPC for a Complimentary “Pick Your Day” Pass.

Come Join Us

If you’re immersed in or contemplating code modernization, this year’s IDF should be at the top of your to do list. Join your colleagues, innovators, and community experts in learning how you can modernize your existing code and write new code to achieve breakthrough performance on your HPC systems.

IDF will showcase the experiences of intrepid HPC pioneers who have or are in the process of full-on code modernization. Learn about scalable systems solutions while you meet the experts. Find out what’s next in HPC – its hot new technologies and their impact on tomorrows innovations.

Here are a few important sessions you’ll want to attend:

Code Modernization Roundtable – A panel discussion with Intel Black Belt software developers. They will discuss: why parallel performance is important; the need to modernize your code; which applications can benefit; and how Intel can help.
Parallel Programming Pearls– Inspired by the Intel Xeon Phi Products Course, which dives into real-world parallel programming optimization examples from around the world. Experience the wit and wisdom of enthusiast, author, editor and evangelist James Reinders.
Hands-on Lab– An introduction to programming and optimization with Intel Xeon and Intel Xeon Phi processors.
Code Modernization Best Practices– A course on multi-level parallelism for Intel Xeon and Intel Xeon Phi processors. This session with Robert Geva, Intel Principal Engineer, will provide details on the growth in hardware resources and characterize performance using different levels of parallelism
Software-Defined Visualization: Fast, Flexible Solutions for Rendering Big Data - Get the latest information on industry progress for an open-source software-defined visualization rendering stack on Intel® Xeon® processors and the Intel® Xeon Phi™ product family without the need for specialized hardware such as GPUs.
And those are just for openers. There are many more great talks, poster chats, and keynotes at this event.
To Learn More – Visit the IDF website: Intel Developer Forum

↧

Optimizing legacy molecular dynamics software with directive-based offload

July 16, 2015, 2:47 pm

Latest and popular articles on Intel Technologies

≫ Next: Programming and Compiling for Intel® Many Integrated Core Architecture

≪ Previous: Attend IDF'15: HPC Developer Showcase

Optimizing legacy molecular dynamics software with directive-based offload

Directive-based programming models are one solution for exploiting many-core coprocessors to increase simulation rates in molecular dynamics. They offer the potential to reduce code complexity with offload models that can selectively target computations to run on the CPU, the coprocessor, or both. In this paper, we describe modifications to the LAMMPS molecular dynamics code to enable concurrent calculations on a CPU and coprocessor. We demonstrate that standard molecular dynamics algorithms can run efficiently on both the CPU and an x86-based coprocessor using the same subroutines. As a consequence, we demonstrate that code optimizations for the coprocessor also result in speedups on the CPU; in extreme cases up to 4.7X. We provide results for LAMMPS benchmarks and for production molecular dynamics simulations using the Stampede hybrid supercomputer with both Intel^® Xeon Phi™ coprocessors and NVIDIA GPUs. The optimizations presented have increased simulation rates by over 2X for organic molecules and over 7X for liquid crystals on Stampede. The optimizations are available as part of the “Intel package” supplied with LAMMPS.

↧

Programming and Compiling for Intel® Many Integrated Core Architecture

July 16, 2015, 3:31 pm

Latest and popular articles on Intel Technologies

≫ Next: A Structured Performance Optimization Framework for Simultaneous Heterogeneous Computing

≪ Previous: Optimizing legacy molecular dynamics software with directive-based offload

Compiler Methodology for Intel® MIC Architecture

This article is part of the Intel® Modern Code Developer Community documentation which supports developers in leveraging application performance in code through a systematic step-by-step optimization framework methodology. This article addresses: parallelization.

This methodology enables you to determine your application's suitability for performance gains using Intel® Many Integrated Core Architecture (Intel® MIC Architecture). The following links will allow you to understand the programming environment and help you evaluate the suitability of your app to the Intel Xeon and MIC environment.

Preparing for the Intel® Many Integrated Core Architecture

The Intel® MIC Architecture provides two principal programming models: the native model covers compiling applications to run directly on the coprocessor, the heterogeneous offload model covers running a main host program and offloading work to the coprocessor, including standard offload and the Cilk_Offload model. The following chapter gives you insights into the applicability of these models to your application.

Native and Offload Programming Models

The next chapter covers topics in parallelization. This includes Rank parallelization and Thread parallelization with links to various parallelization methods and resources along with tips and techniques for getting optimal parallel performance. In this chapter, you will learn techniques for the Intel OpenMP* runtime library provided with the Intel compilers, Intel® MPI, Intel® Cilk™ Plus, and Intel® Threading Building Blocks (Intel® TBB).

Efficient Parallelization

The third level of parallelism associated with code modernization is vectorization and SIMD instructions. The Intel compilers recognize a broad array of vector constructs and are capable of enabling significant performance boosts for both scalar and vector code. The following chapter provides detailed information on ways to maximize your vector performance.

Vectorization Essentials

Because of the rich and varied programming environments provided by the Intel Xeon and Xeon Phi processors, the Intel compilers offer a wide variety of switches and options for controlling the executable code that they produce. This chapter provides the information necessary to insure that a user gets the maximum benefit from the compilers.

New User Compiler Basic Usage

The final chapter in the section provides insight into some advanced optimization topics. Included are discussions of floating point accuracy, data movement, thread scheduling, and many more. This is a good chapter for users still not seeing their desired performance OR are looking for the last level of performance enhancements.

Advanced Optimizations

↧

A Structured Performance Optimization Framework for Simultaneous Heterogeneous Computing

July 17, 2015, 10:56 am

Latest and popular articles on Intel Technologies

≫ Next: The DRNG Library and Manual

≪ Previous: Programming and Compiling for Intel® Many Integrated Core Architecture

Heterogeneous computing platforms with multicore host system and many-core accelerator devices have taken a major step forward in the mainstream HPC computing market this year with the announcement of HP Apollo 6000 Sys-tem’s ProLiant XL250a server with support for Intel® Xeon Phi™ coprocessors. Although many application developers attempt to use it in the same way as GPGPU acceleration platforms, doing so forfeits the processing capability of multicore host processors and introduces power inefficiency in corporate IT op-erations. In this paper, we propose an application optimization framework to turn a sequential legacy application into a highly parallel application to make use of the hardware resources both on the host CPU and on the accelerator devices to enable simultaneous heterogeneous computing. As a case study, we look at how to apply this framework and adopt a structured methodology to adapt a European option pricing application to take advantages of a heterogeneous computing environment.

Download the complete PDF Download Download

↧

The DRNG Library and Manual

July 17, 2015, 2:21 pm

Latest and popular articles on Intel Technologies

≫ Next: Add Intel® RealSense™ Runtime to your App Installer

≪ Previous: A Structured Performance Optimization Framework for Simultaneous Heterogeneous Computing

Download

Download the static binary libraries, source code, and documentation:

libdrng-1.0.zip (ZIP archive)
libdrng-1.0.tar.gz (tar archive)

This software is distributed under the Intel Sample Source Code license.

About

This is the DRNG Library, a project designed to bring the rdrand and rdseed instructions to customers who would traditionally not have access to them.

The "rdrand" and "rdseed" instructions are available at all privilege levels to any programmer, but tool chain support on some platforms is limited. One goal of this project is to provide access to these instructions via pre-compiled, static libraries. A second level goal is to provide a consistent, small, and very easy-to-use API to access the "rdrand" and "rdseed" instruction for various sizes of random data.

The source code and build system are provided for the entire library allowing the user to make any needed changes, or build dynamic versions, for incorporation into their own code.

Getting Started

For ease of use, this library is distributed with static libraries for Microsoft* Windows* and Microsoft* Visual Studio*, Linux Ubuntu* 14.10, and OS X* Yosemite*. The library can also be built from source, and requires the Visual Studio with the Intel(r) C++ Compiler or Visual Studio 2013 on Windows, or GNU* gcc* on Linux and OS X*. See the Building section for more details.

Once the static library is compiled, it is simply a matter of linking in the library with your code and including the header in the header search path to use the library. Linking the static library is beyond the scope of this documentation, but for demonstration, a simple Microsoft* Visual Studio* project is included, named test, as well as a simple project with Makefile for Linux or OS X. Source for the test is in main.c, and the test project on Linux can uses the top-level Makefile. The rdrand.sln solution includes the test project.

Rdrand is only supported on 3rd generation Intel(r) Core processors and beyond, and Rdseed is only supported on 5th generation Intel(r) Core processors and Core M processors and beyond. It makes sense to determine whether or not these instructions are supported by the CPU and this is done by examining the appropriate feature bits after calling cpuid. To ease use the library automatically handles this, and stores the results internally and transparently to the end user of the library. This result is stored in global data, and is thread-safe, given that if one thread of execution supports rdrand, they all will. Users may find it more practical, however, to call theRdRand_isSupported() and RdSeed_isSupported() functions when managing multiple potential code paths in an application.

The API was designed to be as simple and easy-to-use as possible, and consists of these functions:

int rdrand_16(uint16_t* x, int retry);
int rdrand_32(uint32_t* x, int retry);
int rdrand_64(uint64_t* x, int retry);

int rdseed_16(uint16_t* x, int retry_count);
int rdseed_32(uint32_t* x, int retry_count);
int rdseed_64(uint64_t* x, int retry_count);

int rdrand_get_n_64(unsigned int n, uint64_t* x);
int rdrand_get_n_32(unsigned int n, uint32_t* x);

int rdseed_get_n_64(unsigned int n, uint64_t* x, unsigned int skip, int max_retries);
int rdseed_get_n_32(unsigned int n, uint32_t* x, unsigned int skip, int max_retries);

int rdrand_get_bytes(unsigned int n, unsigned char *buffer);
int rdseed_get_bytes(unsigned int n, unsigned char *buffer, unsigned into skip, int max_retries);

Each function calls rdrand or rdseed internally for a specific data-size of random data to return to the caller.

The return of these functions states the hardware was not ready (if non-retry specified), success, or that the host hardware doesn't support the desired instruction at all.

Building

Building the Rdrand Library is supported under Microsoft* Visual Studio 2013*, Linux* and OS X*. Use of the Intel(r) C++ Compiler is optional unless you are using a version of Visual Studio earlier than 2013 on Microsoft Windows*.

To build the library, open the rdrand Microsoft Visual Studio solution (rdrand.sln), and build the project as normal from the build menu. Included are two projects, rdrand the actual library and test, the demonstration program.

On Linux and OS X the build is wrapped with GNU* autoconf*. To build:

$ ./configure
$ make

Release Notes

The DRNG Library is simple and as of this release is functionally complete. There are no known issues.

↧

Add Intel® RealSense™ Runtime to your App Installer

July 17, 2015, 3:22 pm

Latest and popular articles on Intel Technologies

≫ Next: Turn on/off Intel(R) Edison Arduino on board LED via Bluetooth SPP by using the IoT dev kit

≪ Previous: The DRNG Library and Manual

Because your "end users" will not have the Intel® RealSense™ SDK installed on their system, your app installer MUST include the RealSense Runtime Installer.

The Runtime Installer version MUST MATCH the SDK version on which the app was built. A system can simultaneously hold multiple versions of the runtime installer as they are foldered separately. (Look under \\Program Files (x86)\Common Files\Intel\RSSDK\v(n)

The Runtime Installer for versions prior to v4 (Gold R2) are included in the SDK itself. For v4, the runtime installer is downloadble from here.
For Gold R3 (v5), the runtime installer is downloadble from here (scroll down after the SDK and DCMs). See which to use below.

We strongly recommend (require for Showcase Apps) that the Runtime Installer be run silently. If you do not run the installer silently, the user may override the modules chosen and make your app inoperable. Use the command
--silent --no-progres --acceptlicense=yes

The Runtime Installer components MUST MATCH the modules used in the app. You may choose to use the entire package but the size is very large (especially if you don't use voice). You can list the components to add using --finstall=<feature-list> --fnone-all with a comma separating the features. List of Components is shown Features and Components

There are some compatibility issues possible if you do not install the runtime properly since other apps may have already installed some (or none) of the same version components on the user's system. This can be caused by the Installer type used to add runtime.

There are multiple runtime installers mentioned in the SDK Manual. However only the intel_rs_sdk_runtime_YYYY.exe installer is most likely to provide what your user needs. With great thanks to Tion, here's the best use model:

The package installer that you can create from the SDK itself (from the full intel_rs_sdk_offline_package_r_(v#).exe) should only be used on a clean system with no previously installed versions of the runtimes (packaged with any app) and no SDK package installed. Assume for most apps that you can NOT know that the user has a clean system. (However for custom kiosks or similar clean environments, it is an option). If other apps have installed newer versions of the runtime, using the full SDK to create the package installer will FAIL (possibly silenttly and the problem will not be seen until the app itself fails).
Do NOT confuse this SDK package (which may include the words 'offline package' with the websetup package (which is probably what you need).
The Core runtime installer (intel_rs_sdk_runtime_core_r_(v#).exe) will ONLY provide raw streaming functionality, and will not include support for modules with algorithms. So again, only for custom environment apps. This is called the "Capture Only" installer because it only provides the captured frame.
Recommended for any shared apps system, that needs algorithm support, use the intel_rs_sdk_runtime_(v#).exe installer which can also be downloaded in sections if you use the intel_rs_sdk_websetup_(v#).exe installer. These two installers are the ones to use with the silent install and component list command lines listed above. These runtime downlods are available near the bottom of the page at https://software.intel.com/en-us/intel-realsense-sdk/download for R2 and R3 RealSense (v4 and v5).

For more information, see also the SDK Developer's Guide Runtime section

Other Installation Considerations:

During installations and initial app run, you should also check

the DCM (camera driver) version - see the Checking_camera_driver_version.html
and follow the Privacy_requirements_and_guidelines.html

Language packs are very large and have separate installations for size and licensing reasons. See the SDK documentation Speech Runtime and language packs.

↧