Performance Gains for Ayasdi Analytics* on the Intel® Xeon® Processor E7-8890 V3

September 25, 2015, 1:10 pm

Latest and popular articles on Intel Technologies

≫ Next: Game Development with Unity* - An Example

Introduction

Ayasdi deploys vertical applications that utilize Topological Data Analysis to extract value from large and complex data. The Ayasdi platform incorporates statistical, geometric, and machine-learning methods through a topological framework to more precisely segment populations, detect anomalies, and extract features.

This paper describes how Ayasdi’s Analytics* running on systems equipped with the Intel® Xeon® processor E7-8890 v3 gained a performance improvement over running on systems with the previous generation of Intel® Xeon® processor E7-4890 v2.

Ayasdi’s Analytic and Intel® Xeon® Processor E7-8890 V3

	Number of Cores	Number of Threads	Memory	Vectorization
Intel® Xeon® processor E7-4890 v2	15	30	DDR3	Intel® Advanced Vector Extensions
Intel® Xeon® processor E7-8890 v3	18	36	DDR4	Intel® Advanced Vector Extensions 2

Table 1. Comparison between processors

Table 1 shows the comparison between the Intel Xeon processor E78890 v3 and Intel Xeon E7-4890 v2. The Intel Xeon processor E78890 v3 has 18 cores compared to the 15 cores of Intel Xeon processor E7-4890 v2, the latter allowing more parallelism resulting in performance improvement. Furthermore, the Intel Xeon processor E7-8890 v3 has a larger memory bandwidth comparing to that of the Intel Xeon processor E7-4890 v2 and uses DDR4 memory while the Intel Xeon processor E7-4890 v2 uses DDR3 memory, thus speeding up the executions.

In terms of software advantages, the Intel Xeon processor E7-8890 v3 supports Intel® Advanced Vector Extensions 2 (Intel® AVX2) while the Intel Xeon processor E7-4890 v2 supports only Intel® Advanced Vector Extensions (Intel AVX). In addition, the Intel Xeon processor E7-8890 v3 introduces Bit Manipulation Instruction sets, BMI1 and BMI2. These instruction sets speed up vector and matrix operations and the core computations of complex machine-learning algorithms.

Ayasdi Analytics was optimized for the Intel Xeon processor E7-8890 v3 by using the new Intel AVX2 intrinsic functions, especially Fused Multiply Add and BMIs. This optimization was accomplished by hand-coding in C++ and through the use of the Intel® Math Kernel Library (Intel® MKL)—Intel MKL version 11.2.

Performance Test Procedure

To show that Intel AVX2 along with the new microarchitecture in the Intel Xeon processor E7 v3 Family increase the throughput of Ayasdi Analytics, we performed tests on two platforms. One system was equipped with the Intel Xeon processor E7-8890 v3 and the other with the Intel Xeon processor E7-4890 v2.

Performance is measured in terms of the following:

The throughput of analyses (analyses per hour) that can be supported by the cluster, with acceptable latency.
The job latency of the analyses in minutes. The job latencies were measured with nine users concurrently accessing the systems.

Test Configurations

System equipped with the Intel Xeon processor E7-8890 v3

System: Pre-production
Processor: Intel Xeon processor E7-8890 v3 @2.5 GHz
Cores: 18
Memory: 1 TB DDR4-1600 MHz

System equipped with the Intel Xeon processor E7-4890 v2

System: Pre-production
Processor: Intel Xeon processor E5-4890 v2 @2.8 GHz
Cores: 15
Memory: 1 TB DDR3-1333 MHz

Operating system: Red Hat Enterprise Linux* 7.0

Application: Ayasdi Analytics Benchmark

Test Results

Figure 1. Performance comparison between processors.

Figure 1 shows a 1.85x performance gain of the system with the Intel Xeon processor E7-8890 v3 over that of the system with the Intel Xeon processor E7-4890 v2. The performance gain is due to the enhanced microarchitecture, increase in core count, better memory type (DDR4 over DDR3), and Intel AVX2.

Figure 2. Latency comparison between processors.

Figure 2 shows the reduction in latency on the system with the Intel Xeon processor E7-8890 v3 over that of the system with the Intel Xeon processor E7-4890 v2. The decrease in latency was credited by the enhanced microarchitecture, core count increase, better memory type (DDR4 over DDR3), and Intel AVX2.

Conclusion

More cores, enhanced microarchitecture, and the support of DDR4 memory contributed to the performance improvement of Ayasdi Analytics on systems equipped with the Intel Xeon processor E7-8890 v3 compared to those with the Intel Xeon processor E7-4890 v2. With the introduction of Intel AVX2, matrix manipulations get a performance boost. In addition, applications that make use of Intel MKL will receive a performance improvement without having to change the source code, since their functions are optimized using Intel AVX2.

For More Information

Ayasdi Official Website http://www.ayasdi.com

Topological Data Analysis http://www.ayasdi.com/blog/topology/topological-data-analysis-a-framework-for-machine-learning/
http://www.ayasdi.com/wp-content/uploads/2015/02/Topology_and_Data.pdf

Machine learning http://www.sas.com/en_us/insights/analytics/machine-learning.html

Fused Multiply-Add http://rd.springer.com/chapter/10.1007%2F978-0-8176-4705-6_5

Intel Math Kernel Library (Intel MKL) https://software.intel.com/en-us/intel-mkl

Intel Advanced Vector Extensions https://software.intel.com/en-us/articles/intel-mkl-support-for-intel-avx2

Intel Xeon E7 v3 processor product family https://software.intel.com/en-us/articles/intel-xeon-e7-4800-v3-family

↧

Game Development with Unity* - An Example

September 23, 2015, 11:24 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel Optimizations in the Android* Compiler

≪ Previous: Performance Gains for Ayasdi Analytics* on the Intel® Xeon® Processor E7-8890 V3

Download PDF

Collaboration between Intel and Unity Technologies has resulted in x86 native support for Android*, which is available in Unity* 5. To demonstrate this native support, this blog describes how to build a simple game with Unity in the style of Pac-Man* 3D.

What is Unity?

Unity is a cross-platform game engine developed by Unity Technologies and is used to develop video games for PCs, consoles, mobile devices, and websites. Unity Technologies is notable for its ability to target games for multiple platforms. Within a project, developers have control over delivery to mobile devices, web browsers, desktops, and consoles. Supported platforms for this popular game engine include BlackBerry* 10, Microsoft Windows Phone* 8, Microsoft Windows*, Apple OS X*, Android, Apple iOS*, Unity Web Player* (including Facebook*), Sony PlayStation* 3, PlayStation 4, PlayStation Vita, Microsoft Xbox* 360, Xbox One*, Nintendo Wii U*, Nintendo 3DS* line, and Wii*.

Unity is actively used by major companies such as Blizzard, Ubisoft, and Electronic Arts, and various Indie studios. For the Indie studios, Unity is attractive because it has a free version and is relatively simple to use. Also the Unity Technologies Asset Store lets you buy or get some free assets to create your own game. For more information about Unity, to buy a Unity Pro version, or to get a free personal version, go to:http://unity3d.com/. For more information about the Asset Store, go to: https://www.assetstore.unity3d.com/en/.

Game Development

Introduction

Unity is simple to use, and even a beginning developer can create a real game with Unity. For this blog, I use the popular childhood game, Pac-Man, as the basis for the game, adding some elements from geometric games, which are popular in Google Play*. Thus we will create a game that will resemble a real commercial game.

Preparation

First we need to create a scene. I will not to spend a lot of time explaining how to move or where to put the object, because those actions are subjective. You can find all available game objects in the “GameObject” menu in the Asset Store. The menu provides various 2D and 3D objects, lighting sources, audio objects, and others.

For this example, the protagonist is a sphere, which we will call Player. There are some other spheres in the scene, which we will call Points. Player picks up these points. We also want to count the points that accumulate.

We can assign a script to each object in the scene. These scripts control movement around the scene for its own objects. The scripts should be created using C# or JavaScript*. The tacitly accepted rule is to call each script "controller." For example, PlayerController or CameraController.

To create a script, follows these steps:

In the Project panel, select Create.
In the drop-down menu, select С# or JavaScript (in this case, C#)

When you have created the script:

Click the object in the Hierarchy panel for which you wrote the script.
In the Inspector panel, click the Add Component button, and then choose Script.

Moving Around the Scene

Here is the code for moving around the scene:

using UnityEngine;
using System.Collections;
public class PlayerController : MonoBehaviour
{
	private Rigidbody RigidBody;
	public int SpeedUp = 10;
	void Start()
	{
		RigidBody = GetComponent<Rigidbody> ();
	}
	void FixedUpdate()
	{
		float VerticalMove = Input.GetAxis ("Vertical");
		float HorizontalMove = Input.GetAxis ("Horizontal");

		Vector3 Move = new Vector3 (HorizontalMove, 0, VerticalMove);
		RigidBody.AddForce (Move * SpeedUp);
	}
}

Let's look at this code in more detail.

First, we create a variable of the type RigidBody to tell Unity that this object can physically interact with other objects in the scene and move around the stage.

The Start() function does everything that happens at the moment the game starts. Inside this function we assign a RigidBody to our variable. Thus we can control our object with the help of a variable.

The FixedUpdate() function does everything that happens during the runtime of the game. With the help of Input.GetAxis, we get vertical and horizontal movement once we click the control button. You will need to create the appropriate variables and write values in them.

Next we need to create a vector, which will move our object. There are two kinds of vectors in Unity: Vector2 and Vector3. We need to use Vector3 because we are developing a 3D game. Let’s create a variable with the type Vector3 and push the variables with horizontal and vertical data into it.

With the help of RigidBody.AddForce, the vector will make our object move. But our object will move slowly, so I created a SpeedUp variable to correct the movement speed.

Interaction with Points

We want to make sure that when Player touches Points, it destroys them. Also, let’s create a counter that counts the number of selected Points and types a message in the center of the screen, such as: “Congratulations! You picked up all points!”

Let’s add this codestring to our script:

<code>
void OnTriggerEnter(Collider other)
{
	if (other.gameObject.CompareTag ("point"))
	{
		other.gameObject.SetActive (false);
	}
}</code>

For interactions between objects or identification of interactions, the object's Collider is used. In our case when the Collider of Player contacts the Collider of another object, the object is destroyed. That is why we have the Collider of the other object as a parameter of the OnTriggerEnter function. We can assign a tag to all objects on the scene. This tag shows us which object has interacted with Player. The function other.gameObject.CompareTag signals us about a collision with Points with help from the tag “point”. Of course, we need to assign the tag "point" to all points we have and check Is Trigger in the Collider menu.

The next step is for the object “destroying” to use the other.gameObject.SetActive(false) function.

You can change the tag of objects in this section of Inspector:

We start to work with the UI because we want to create a counter of picked points. That is why we need to add the codestring to the top of the script as follows:

using UnityEngine.UI;

Before we start coding, we’ll add two text objects in the GameObject – UI – Text menu. Let's call them CountText and Message.

We’ll also add a private variable CountOfPickedPoint of type int and two public variables called CountText and Message.

The Start() function will be as follows:

<code>
void Start()
	{
		RigidBody = GetComponent<Rigidbody> ();
		CountOfPickedPoint = 0;
		SetCountOfPickedPointsText ();
		Message.text = "";
	}</code>

The OnTriggerEnter() function will become as follows:

void OnTriggerEnter(Collider other)
	{
		if (other.gameObject.CompareTag ("point"))
		{
			other.gameObject.SetActive (false);
			CountOfPickedPoint = CountOfPickedPoint + 1;
			SetCountOfPickedPointsText ();
		}
	}

The SetCountOfPickedPointsText () is a function as follows:

void SetCountOfPickedPointsText()
	{
		CountText.text = "Count of Picked Points : " + CountOfPickedPoint.ToString();
		if(CountOfPickedPoint >= 47)
		{
			Message.text = "Congratulations!You picked up all of the points!";
		}
	}

Thus, our text variables and counter will reset at the moment the game starts. Counter will increase and CountText will change when our Player interacts with Points. If the value of Counter is equal to or greater than the number of points on the scene, we get a message saying we successfully finished the game. Thus, we’ve finished our PlayerController.

Camera

Now we want to correct the camera options to follow Player. To do this, we change the CameraController.

We measure the distance between camera and player at the moment the game starts.
During runtime, we сhange the location of the camera by a predetermined distance.

Thus, our camera follows Player everywhere. The CameraController script is as follows:

<code>
using UnityEngine;
using System.Collections;
public class CameraController : MonoBehaviour
 {
	public GameObject Player;
	private Vector3 OffSet;
	void Start ()
{
		OffSet = transform.position - Player.transform.position;
	}

 
void LateUpdate ()
{
		transform.position = Player.transform.position + OffSet;
	}
}
</code>

That’s it! Plain and simple!

Build the game

Here are the steps to build the game:

Step 1

Step 2

Step 3

After clicking Build, specify the save location. That’s it! Your APK now has native x86 support.

About the Author

Artem Gruzdev works in Software & Service Group at Intel Corporation. His main interest is optimization of performance and parallel programming. In his current role as an Software Engineering Intern Artem works closely with PARSEC-3.0 Benchmark Suite developers to achieve the best possible performance on Intel platforms. Artem holds a Bachelor’s degree in Applied Mathematics and Computer Science from the N.I. Lobachevsky State University of Nizhni Novgorod.

↧

Intel Optimizations in the Android* Compiler

September 24, 2015, 4:21 pm

Latest and popular articles on Intel Technologies

≫ Next: Cordova Whitelisting with Intel XDK for AJAX and launching external apps

≪ Previous: Game Development with Unity* - An Example

Download Document

By Jean Christophe Beyler

Acknowledging (alphabetical):
Johnnie L Birch Jr, Dong-Yuan Chen, Olivier Come, Chao-Ying Fu, Jean-Philippe Halimi, Aleksey V Ignatenko, Rahul Kandu, Serguei I Katkov, Maxim Kazantsev, Razvan A Lupusoru, Yixin Shou, Alexei Zavjalov

Introduction

As the Android* ecosystem continues to evolve, Intel is working with OEMs to provide an optimized version of the Android runtime, thus providing better performance on Intel® processors. One of the ecosystem components is the compiler, which has been available for a few years but has recently undergone massive changes.

In the Android ecosystem, the application developer generally writes a section of the application in the Java* language. This Java code is then transformed into an Android-specific bytecode, called the Dex bytecode. In the various flavors of Android’s major releases, there have been several ways to go from the bytecode format to the actual binary format that is run on the processor. For example, in the Jelly Bean* version, there was a just-in-time (JIT) compiler in the runtime, called Dalvik* VM. From KitKat* on, the appearance of a new Android runtime (ART), added the existence of an ahead-of-time (AOT) compiler in ART.

When the compiler is young, there are often stepping-stones to achieving the state-of-the-art transformations one can expect from a more mature compiler. Since the ART compiler is young, the way a programmer writes the code can impact the ability of the compiler to generate optimal code. (For tips on how you can optimize your code for this compiler, see 5 Ways to Optimize Your Code for Android 5.0 Lillipop.) On the other hand, developing a more generic compiler is a huge effort, and it takes time to implement the initial infrastructure and optimizations.

This article describes the basic infrastructure and two optimizations that Intel implemented in the ART compiler. Both optimizations provide a means to simplify compile-time evaluated loops. These elements are only part of Intel’s commitment—amid much more infrastructure and optimizations—to provide the best user experience for the Android OS and are presented here to show how synthetic benchmarks are greatly optimized (improved??) by these optimizations.

Optimizations

Since 2008, there have been three variations of Google’s Android compiler. The Dalvik Jit compiler, available since Froyo, was replaced by the Quick AOT compiler in the Lollipop version of Android. This succession represented a major change in the paradigm for transforming the Dex bytecode, which is Android’s version of the Java bytecode, into binary code. As a result of this change, most of the middle-end optimization logic from the Dalvik VM was copied over to Quick, and, though AOT compilation might provide the compiler more time for complex optimizations, Quick provided infrastructure and optimizations similar to that of the Dalvik VM. In late 2014, the ASOP added a new compiler named Optimizing. This latest version of the compiler is a full rewrite of the Quick compiler and, over time, seems to be adding more and more infrastructure to enable more advanced optimizations, which were not possible in previous compiler versions.

At the same time, Intel has been working on its own infrastructure and optimizations into the compiler to provide the best user experience for Intel processor-based devices. Before presenting the optimizations that were implemented by Intel in the Android compiler, the next section shows two small optimizations that are classic in production compilers and help provide a better understanding of the two Intel optimizations.

Constant Folding and Store Sinking

Constant Folding and Store Sinking are very known optimizations in compiler literature. This section presents them and shows how the Android* compiler builds upon them to deliver more complex optimizations. Constant Folding used in the Android* compiler was developed by Google, while the Store Sinking algorithm is implemented by Intel.

Consider the code:

a = 5 + i + j;

If the compiler is able to determine that j is actually equal to 4, the compiler simplifies the code into:

a = 5 + i + 4;

Then, with a bit of rewriting, it obtains:

a = 5 + 4 + i;

At this point, the compiler is generally able to simplify this into:

a = 9 + i;

The transformation (5 + 4) to 9 is called constant folding. In a way, the compiler has deemed that the calculation of the addition can be done during compilation and doesn’t not have to wait for the generated code to figure out the result is 9. This is a key concept used during the Trivial Loop Evaluator optimization presented next.

Now what happens when we have a loop and the constant is added:

for (i = 0; i < 42; i++) {
a += 4;
}

At this point, it is not clear that Constant Folding can simplify the code. However, as the article will show, both the Constant Calculation Sinking and the Trivial Loop Evaluator optimizations can.

Another common optimization is to delay the store of a variable until after the loop is done. For example:

for (i=0; i < 42; i++) {
a += 4;
final_value = a;
}

Since the loop is not using the value stored in final_value, the compiler can sink the assignment after the loop, transforming the code into:

for (i=0; i < 42; i++) {
a += 4;
}
final_value = a;

Note that the current implementation only does this for class members and not local variables. Both these simple optimizations showcase simple transformations that the compiler can do. The following two optimizations are more complex but build on these two concepts: if you know that you can delay the calculation, then you can put the code after the loop.

The next sections present two optimizations implemented in the Android compiler and build on Google’s implementation of Constant Folding and Intel’s implementation of Store Sinking.

Constant Calculation Sinking

The Constant Calculation Sinking optimization sinks any calculation based on constants that can be evaluated at compile time.

Consider the following loop:

public void AddEasyBonusPoints(int basePoints) {
for (int i = 0; i < GetBricks(); i++) {
basePoints += GetEasyPointsPerBrick(i);
}
}

Assume that the function is from a Breakout-style game, where the player tries to destroy all the bricks using a bounding ball. In the game, the user can trigger a bonus score. The bonus score is the sum of the brick points of the level.

GetBricks simply returns the integer 42 because that version of the level has 42 bricks and GetEasyPointsPerBrick simply returns 10 as the value because, for example, at the Easy level each brick has the same cost. This is a plausible, albeit simple, piece of code and if the compiler has a good inliner, the loop is transformed into the version shown below.

Inlining

Before performing this optimization, the first step is to inline the two methods as stated previously. Once inlined, the code looks like this:

public void AddEasyBonusPoints(int basePoints) {
for (int i = 0; i < 42; i++) {
basePoints += 10;
}
}

At this point, it is clear that the calculation does not require a loop. Intuitively, the calculation can be simplified to:

public int AddEasyBonusPoints(int basePoints) {
return basePoints + 420;
}

The rest of this section explains how the compiler supports and handles this and how the optimization handles other operations such as subtraction or multiplication.

Inner Workings: Integer Cases

The optimization rewrites statements of theform:

var += cst;

By moving the calculation after the loop (an operation known as sinking). The calculation can also be a subtraction, remainder, division, or multiplication. The following table shows the possible transformations.

Operation	Transformed Into
var += cst	var += cst * N, where N is the number of iterations
var -= cst	var -= cst * N, where N is the number of iterations
var %= cst	var %= cst
var *= cst var /= cst	var = result, if result does not overflow and operation can be safely sunk var = 0, if 0 is obtained during evaluation of the calculation

Of course, for the optimization to be safe, a few checks must take place. First, the variable must not be used in the loop. If the accumulation is being used, sinking the end result creates a problem because the intermediate value is required during the loop’s execution. A second requirement is that the bounds of the loop must be known at compile time.

For multiplication and division, as the algorithm calculates the value’s progression, the calculation hits a fixed point if the value becomes zero. Zero times or divided by anything is zero; therefore the calculation can be simplified to being 0.

Supporting Floating-Point Cases

While working on the integer version, the multiplication and division case spurred the idea that a similar technique could be used for the floating-point cases. The optimization can do two things: first, it can sink the result if fully calculated, meaning the compiler was able to evaluate the full calculation. Second, if the iteration count is high enough, the compiler sinks zero. That mechanism brought the idea that multiplying by a given value in floating-point might lead to infinity and dividing by a value might lead to 0.0.

However, compared to the integer case where the compiler can determine that regardless of the initial value the result would lead to 0 in certain cases; in the floating-point case the initial value is required to ensure calculation convergence.

Impact on CF-Bench*

The Constant Calculation Sinking optimization is performed on small loops where the accumulation by a constant can be sunk, resulting in better performance. When considering the various existing Android benchmarks, the CF-Bench benchmark suite contained a few of those loop types.

Before showing the results of applying the Constant Calculation Sinking optimization, Intel’s ART team added a second small optimization, which removed unused loops; that is, loops that have no side effects, invokes, possible exceptions, and no calculations used by code after the loop. This is a common optimization that allows the compiler to essentially remove dead code. This process allows the loop to be completely removed after all the constant calculations have been sunk, which would otherwise leave only the skeleton of the for loop.

Figure 1: Constant Calculation Sinking speedup on CF-Bench* on the TrekStor SurfTab xintron i 7.0

Figure 1 shows the speedup achieved on various subtests of the CF-Bench benchmark suite. There is a major speedup on the integer versions due to the compiler removing the whole loop of each case after sinking the various calculations. For the single-precision floating-point version, named Java MSFLOPs, the optimization does provide better performance, though it is only a 4x improvement.

Summary

Constant Calculation Sinking is an important optimization. When considered individually, it is not immediately clear whether it can be applied to real world cases, but the game example shows how candidate loops are recognized. Integrating the optimization provides a means to optimize loops that get transformed by the inliner method, for example.

For the Constant Calculation Sinking optimization, the synthetic benchmark CF-Bench benefited in terms of score for a few of its subtests. In the compiler domain, it is rare to have speedups on the order of magnitude such as the one obtained for the benchmark. It shows that the transformations to the MIPS subtest score shadow any future optimization applicable to other subtests.

Trivial Loop Evaluator

The second optimization is a generalization of Intel’s Constant Calculation Sinking optimization. Let’s look at another example from the Breakout method and assume the developer wants to provide the maximum number of points for the level:

public int CalculateMaximumPoints() {
     int sum = 0;
     // Starting points per brick is 10.
     int points_per_brick = 10;
     int num_colors = GetBrickColors();
     // Go through the number of colors.
     for (int i = 0; i < num_colors; i++) {
       int num_bricks = GetNumberOfBricks ( i );
       int points_in_level = num_bricks * points_per_brick;
       points_per_brick += GetPointIncrementPerLevel ( i );
       sum += points_in_level;
     }
     return sum;
   }
Now the three helper functions are imagined to be:
   public final int GetBrickColors() {
     // Six colors: red, blue, green, black, white, yellow.
     return 6;
   }
   public final int GetNumberOfBricks(int i) {
     // We want less and less bricks as the brick level goes up.
     //   This provides us with: 84, 75, 64, 51, 36, 19
     //  We assume i is < 7, if it is not, we are negative :), call it a feature for this example.
     return 100 - (i + 4) * (i + 4);
   }
   public final int GetPointIncrementPerLevel(int i) {
     // 10 points per level, makes the progression a bit less linear.
     return 10 * i;
   }

These are simple methods that perform a calculation, but it is not trivial to pre-calculate this loop by hand, and, in any case, as the developer tweaks the code of the helper functions, updating the maximum number of points is not manageable. That being said, let’s focus on what the compiler does with this.

Inliner

The first step lies once again in inlining the methods. Since the three methods are simple enough, they are inlined directly in the method, which means that the compiler then considers the method as if it were actually written as:

public int CalculateMaximumPoints() {
     int sum = 0;
     // Starting points per brick is 10.
     int points_per_brick = 10;

     int num_colors = 6;
     // Go through the number of colors.
     for (int i = 0; i < num_colors; i++) {
       int num_bricks = 100 - (i + 4) * (i + 4);
       int points_in_level = num_bricks * points_per_brick;
       points_per_brick += 10 * i;
       sum += points_in_level;
     }
     return sum;
   }

Once this is done, the loop can actually be entirely computed at compile time, since it is using only local variables and does not rely on data from memory. Therefore, it can be evaluated and simplified at compile time into:

public int CalculateMaximumPoints() {
     sum = 9520;
     return sum;
 }

This is much simpler and much more efficient! The implementation of the Trivial Loop Evaluator is implemented in a simple manner: First calculate all the input values and their initial constant values. Then during the loop evaluation, remember the state of all the registers and emulate instruction-by-instruction and update the register state. Finally, once the loop completes, calculate the outputs that require updating and emit the instructions.

Impact on Quadrant*

Figure 2 shows the benefit of the Trivial Loop Evaluator on the Quadrant benchmark suite. The suite contains a few subtests that benefit from it. Note that the scores are improved by a factor of 6 for the Double subtest and to 60x for the Long and Short subtests. Finally, the overall score improved by 26x.

Figure 2. Trivial Loop Evaluator’s impact on the Quadrant* benchmark on the TrekStor SurfTab xintron i 7.0

Inherently, when the optimization is used, the loop is removed. Therefore, when it is applied, the score of the benchmark jumps if that is the loop being timed.

Summary

The Trivial Loop Evaluator is a powerful optimization when every loop input is available to the compiler. For this optimization to apply realistically, other optimizations such as inlining must occur as well. This is paramount to the two optimizations discussed in this article: developers generally don’t write loops that are applicable per se.

Conclusion

This article described two optimizations—Constant Calculation Sinking and Trivial Loop Evaluation—that the Intel team worked on. These optimizations are a stepping-stone for more advanced optimizations for two reasons: they are used at the end of the code simplification process and they also simplify loops, which helps other optimizations later further the process.

The team’s future work lies in enabling more calculations and loops to be simplified and removed. More aggressive inlining is the first task that comes to mind. As the examples showed, the first step in simplifying the code is often inlining code and then optimizing the new version.

↧

Cordova Whitelisting with Intel XDK for AJAX and launching external apps

September 25, 2015, 6:17 pm

Latest and popular articles on Intel Technologies

≫ Next: Accelerating texture compression with Intel® Streaming SIMD Extensions

≪ Previous: Intel Optimizations in the Android* Compiler

Cordova CLI 5.1.1

Starting with Cordova CLI 5.1, the security model that uses Domain whitelisting to restrict the access to other domains from the app has changed. By default now the Cordova apps are configured to allow access to any site, but it is recommended that before you move your app to production you should provide a whitelist of the domains that you want your app to have access to.

Starting from Cordova Android 4.0 and Cordova iOS 4.0, security policy is extended through Whitelist Plugin. For other platforms, Cordova uses the W3C Widget Access specifications for domain whitelisting.

The Whitelist Plugin uses 3 different whitelists and Content Security Policy.

Navigation Whitelist :

Navigation Whitelist controls which URLs the Webview can be navigated to. (Only top level navigations are allowed, with the exception,for Android it applies to iFrames also for non-http(s) schemes). By default, you can only navigate to file:// URLs. To allow other URLS, <allow-navigation> tag is used in config.xml file. With the Intel XDK you need not specify this in config.xml, the Intel XDK automatically generates config.xml from the Build settings.

In the Intel XDK you specify the URL that you would like the Webview to be navigated to under Build Settings > Android > Cordova CLI 5.1.1 > Whitelist > Cordova Whitelist > Navigation. For example : http://google.com

Intent Whitelist:

Intent Whitelist controls which URLs the app is allowed to ask the system to open. By default, no external URLs are allowed. This applies to only hyperlinks and calls to window.open(). App can open a browser (for http:// and https”// URLs) or other apps like phone, sms, email, maps etc. To allow app to launch external apps through URL or launch inAppBrowser through window.open(), <allow-intent> tag is used in config.xml, but again you need not specify this in config.xml, the Intel XDK takes care of it through Build settings.

In the Intel XDK specify the URL you want to whitelist for external applications under Build Settings > Android > Cordova CLI 5.1.1 > Whitelist > Cordova Whitelist > Intent. For example http://example.com or tel:* or sms:*

Network Request Whitelist:

Network Request Whitelist controls, which network requests, such as content fetching or AJAX (XHR) etc. are allowed to be made from within the app. For the web views that support CSP, it is recommended that you use CSP. This whitelist is for the older webviews that do not support CSP. This whitelist is defined in the config.xml using <access origin> tag, but once again in Intel XDK you provide the URL under Build Settings > Android > Cordova CLI 5.1.1 > Whitelist > Cordova Whitelist > Network Request. For example: http://mywebsite.com

By default, only request to file”// URLs are allowed, but Cordova applications by default include access to all website. It is recommended that you provide your whitelist before publishing your app.

Content Security Policy:

Content Security Policy controls, which network requests such as images, AJAX requests (XHR) etc. are allowed to be made via web view directly. This is specified through meta tags in your html file. It is recommended that you use CSP <meta> tag on all of your pages. Android KitKat onwards supports CSP, but Crosswalk web view supports CSP on all android versions.

For example include this in your index.html file.

<meta http-equiv=“Content-Security-Policy” conent=“default-src ‘self’ data: gap” https://ssl.gstatic.com; style-src ‘self’ ‘unsafe-inline’; media-src *”><meta http-equiv=“content-Security-Policy” contnet=“default-src ‘self’ https:”>

Important Note:

As of Intel XDK release 2496, Cordova iOS 4.0 is not released yet. So, for iOS W3C Widget Access policy is used. The settings in Intel XDK for whitelisting URLs are as follows.

For Windows platforms also, W3C Widget Access standards are used and the build settings for whitelisting are as follows.

Cordova CLI 4.1.2

For using whitelisting with Cordova CLI 4.1.2 please follow this article.

↧

Accelerating texture compression with Intel® Streaming SIMD Extensions

September 25, 2015, 12:59 pm

Latest and popular articles on Intel Technologies

≫ Next: Application Performance Profiling – When to use Intel® Graphics Performance Analyzers and Intel® VTune Amplifier

≪ Previous: Cordova Whitelisting with Intel XDK for AJAX and launching external apps

Improving ETC1 and ETC2 texture compression

What is texture compression?

Texture compression has been used for some time now in computer graphics to reduce the memory consumption and save bandwidth on the graphics pipeline. It is supported by modern graphics APIs, such as OpenGL* ES and DirectX*. The process of compressing a texture is lossy. Existing algorithms must not only achieve the best speedups but also preserve as much of the original information as possible.

Popular compression algorithms are DXT, PVRTC, ETC and ASTC. These algorithms are designed with more emphasis on decompression speed rather than compression. The reason for this is that in a real-world scenario, graphics artists and engineers are expected to perform compression offline and once, whereas decompression will be done during runtime each time the application starts using textures.

But compression is important too! It is not uncommon for applications to perform runtime compression to save storage space or bandwidth. An example of this can be found in the popular browsers, Opera* and Chrome*. Runtime compression allows the browsers to compress graphic tiles in circumstances where RAM is scarce and data transfers to GPU are expensive (for example, on mobile devices).

ETC textures

ETC stands for Ericsson* Texture Compression and is an open standard supported in OpenGL and OpenGL ES. The technique allows lossy compression of images at a ratio of 4:1 (depending on input format and compression method).

The original format was ETC1 (published as iPACKMAN), which was based on a previous project named PACKMAN ^[1]. The goal of the format was to enable high-quality and low-complexity texture compression for mobile devices. The format was able to handle opaque textures (discarding the alpha channel or encoding it separately). The format was later extended to ETC2 and included alpha channel compression as well as new methods for improving RGB-image quality.

Open source encoders for this are hard to find. Vendors prefer to keep their solutions proprietary or closed source.

ETC1 textures are supported in Android* and benefit from GPU hardware decompression.

As of API level 22, developers can use the ETC1Util class in the android.opengl package to perform runtime compression in their applications.

ETC works on 4x4 pixel blocks. This allows random access during decompression and good data alignment when it comes to SIMD operations. The pixel ordering in a block is different than what you would otherwise expect to find in memory:

Pixel layout for ETC blocks according to Khronos OpenGL ES Specification

Every pixel can be represented in RGBA color space, with each channel having values from 0 to 255. The total size of a block is 512 bits. The size is important because it allows the data to be unpacked and stored into 128 and 256-bit registers for computation.

The operations performed on the block by ETC encoders aim to determine two base colors around which to approximate the rest. Picking these colors is a complex problem, with an exhaustive search consisting of 2³⁴ possibilities for a single block.

Improve quality and speed with SSE and AVX instructions

The goal of this article is to present ways in which Intel^® Streaming SIMD Extensions can be used to accelerate the encoding process for ETC textures. The speed benefit can be traded to try different combinations and approaches that will in turn improve the quality of the compressed image.

x86 and x86_64 CPUs offer SIMD extensions for operations on large registers:

128-bit for SSE2 through 4.2
256-bit for AVX and AVX2

The supported instructions consist of load/store, logical and arithmetical operations on integers (8-bit, 16-bit, 32-bit) and floating point numbers (single and double precision). These should be enough to cover most compute-intensive parts of the encoding algorithm.

ETC1

ETC1 operates on 4x4 pixel blocks: it splits the block either vertically or horizontally in 4x2 or 2x4 sub-blocks and computes the average color value for each sub-block. Once it does that, it uses a special modifier table to approximate the colors in the blocks as average color +/- table entry index value. The table layout can be found in the official specification^[2].

The approximation is done by generating all the possible color combinations from the average color and the table and calculating the deviation (or error) from the original pixel. The table index with the smallest error is then selected to encode the outgoing compressed pixel.

The two sub-block base colors (calculated as pixel averages in our case) are represented as 444 rather than 888. If their values are close, then the first color will be represented as 555, and the second will be computed from the first one by adding a 333 delta value. This mode of encoding the two base colors is called differential and is enabled by setting a special bit in the compressed data.

SSE registers can take advantage of ETC block data locality when performing compression. Values corresponding to each color channel can be stored and computed separately. For the current solution, to calculate the average base colors, we can use SSE to split the original 4x4 pixel block into four 2x2 blocks, process these subblocks, and then gather the data for the corresponding 2x4 or 4x2 ETC1 representation:

Computing four pixels at a time reduces the number of instructions and maps well to SSE registers. See code sample below.

This approach is able to match the quality of a generic non-SIMD implementation.

Below are some examples of test cases and results for 256x256 images.

PSNR (Peak Signal to Noise Ratio) for RGB888 images is defined as:

With MSE representing the mean square error—calculated as the difference of pixels between the original image and the one resulting after decompression:

RMSE represents the root-mean-square deviation.

Top Left: Original image, Top Right: ETC Compressed,
Bottom Right: Stats, Bottom Left: ETC Compressed using SSE2

Above: Average performance improvement of SSE ETC1 implementation compared to regular code

For large images (2048x2048) the graphical deterioration is less visible. This image was compressed in 0.1 seconds (120 MB/sec). And has PSNR = 31 db.

Results for other images would behave in a similar manner with differences between implementations varying by +/-0.1 in PSNR and RMSE and speed improvement by a factor of 2 when using SIMD.

Difference between original image and compressed version.
Pixels were computed as 255 - |original - compressed|
The whiter the pixel the closer it is to the original value.

ETC2

ETC2 brings new improvements to ETC1’s compression: it adds both extra modes for RGB textures and alpha compression.

For RGB, it takes advantage of some invalid color combinations that can result during differential mode and introduces three novel compression techniques:

T
H
Planar

T and H mode require the block to be encoded using four RGB444 colors. The decompressed pixels are associated with one of the four colors. Converting a 444 color to 888 is done by duplicating the bits; for example, abcd efgh ijkl becomes abcdabcd efghefgh ijklijkl.

In both modes the four colors (paint colors) are computed using two base colors.

The difference between T and H is the relation between the four paint colors. In T mode one base color determines the paint color while the other base color is used to determine the other three. A delta value is selected from a lookup table and subtracted/added to the second base color:

base0
base1 - delta
base1
base1 + delta

The delta is then encoded with the base colors into the compressed block.

In H mode the paint colors are computed as follows:

base0 - delta
base0 + delta
base1 - delta
base1 + delta

This encoding mode makes H suitable for color blocks that can be modulated in the intensity direction.

Last but not least, Planar mode provides a way to compress blocks that slowly vary in one direction (as seen in gradients). It uses three RGB676 colors to represent the rest of the block through a series of weighted averages based on each pixel’s position.

The three colors are used to interpolate the rest of the block on decompression.

Implementing ETC2 for RGB

The most complex task to implement the T and H mods for ETC2 is determining the base colors in an efficient way (speed-wise and quality-wise). The original paper detailing ETC2 suggests using Linde–Buzo–Gray (LBG ^[3]) algorithm and radius compression to identify the two base colors and then to iterate through all the available options, finding the best combination of paint colors based on T, H, and deltas.

For our implementation we used a more straightforward approach. We iterated through each compression mode (ETC1, Planar, T, and H) and selected the one that best matched. ETC1 was implemented as described in the previous section.

Planar colors were selected based on the definition, and then the mean square error was computed for each pixel in the block. SIMD instructions allow computing the base colors and MSE up to 60% faster compared to normal code. This mode is the fastest compute-wise, and it makes sense to try it first in order to use its MSE as a reference or threshold for the other compression modes.

As a fast implementation of T and H mode, we create a map containing the RGB444 colors corresponding to the RGB888 pixels in the block (T and H mode each approximate the block using 4 RGB444 colors). If the number of elements in the map is one, two, or three we pass those colors as base colors in T mode. Iterating through the delta table to select the best value for each pixel in T mode, we identified all others that match within a RGB444 radius and set that as base0. For the rest of the pixels we computed an average and set that as base1 and determined the best delta.

In H mode we split the colors into two clusters using a k-means algorithm and then compute the base colors as cluster averages, with the delta being selected in order to create the smallest error.

SSE can benefit T and H implementations by computing the mean square error per pixel. When adding and multiplying pixel values at this stage, it makes sense to expand each byte in order to avoid overflows. This means we could only store four, or at most eight values in a SSE register. The resulting speedup is between 1.5x and 2x.

AVX allows us to load an entire color channel in a single register (for example: 16 red values expanded to 16 bits each) and perform all the required operations on it. When using all these modes combined we can gain up to 3 db in image quality at the expense of a roughly 10-50x performance penalty over ETC1. For comparison the radius compression proposed by the authors of ETC2 can be 729 to 15,625 times slower, and the gain is 3.5 db higher than ETC1 and 1 db higher than LBG for finding T and H base colors^[4].

By tuning the way we select the base colors, we can improve the speed to reach about 3x ETC1 time for an average improvement of 0.5 db (up to 4 db for gradients). This improvement makes ETC2 suitable for performing runtime compression in environments where speed is more important than quality.

Impact on the future

This solution has the potential to reduce memory consumption and save power on devices that use compressed textures. Using SIMD instructions to implement the texture encoders over regular code might further reduce the memory accesses required to perform the operation by storing most of the data in registers.

New encoding techniques such as Adaptive Scalable Texture Compression (ASTC) can be accelerated using Intel^® Streaming SIMD Extensions and can make run-time compression a viable alternative on mobile and entry-level devices.

Instruction sets such as AVX-512 could allow more data to be handled at once and enable better precision on performance-oriented machines. While decoding takes place in the hardware, encoding can be achieved in a reasonable amount of time and with sufficient quality in software by taking advantage of CPU features such as SSE and AVX.

Code Samples

Computing average values for vertical and horizontal sub-blocks in ETC1
Input data is stored as __m128i registers: red[0], red[1], red[2], red[3]

References

[1] http://www.jacobstrom.com/publications/StromAkenineGH05.pdf
[2] https://www.khronos.org/registry/gles/specs/3.2/es_spec_3.2.pdf
[3] http://mlsp.cs.cmu.edu/courses/fall2010/class14/LBG.pdf
[4] http://www.ep.liu.se/ecp/016/002/ecp01602.pdf

↧

Application Performance Profiling – When to use Intel® Graphics Performance Analyzers and Intel® VTune Amplifier

September 28, 2015, 8:02 am

Latest and popular articles on Intel Technologies

≫ Next: Asteroids and DirectX* 12: Performance and Power Savings

≪ Previous: Accelerating texture compression with Intel® Streaming SIMD Extensions

I am often asked the question “What is the difference between Intel® Graphics Performance Analyzers and VTune Amplifier?” It’s a very good question and the answer boils down to use case and workflow. Both tools are extremely useful and equally important when profiling applications.

Before we jump in and discuss their differences and use cases, let’s do a quick review of each of the tools.

VTune Amplifier

VTune Amplifier is a trace based analysis tool used for deep analysis of a given program’s runtime. VTune Amplifier collects a plethora of data ranging from CPU, GPU, bandwidth, threading, and more. You can use VTune Amplifier’s hotspot analysis to quickly identify the lines of code that are taking the most time. Many applications today are multi-threaded. VTune Amplifier offers “Locks and Waits” analysis to quickly find common causes of slow threaded code. In addition, VTune Amplifier also offers extensive GPU analysis for OpenCL workloads and GPU profiling. This is an extremely brief summary of some of the tools within VTune Amplifier. Finally, VTune Amplifier provides hardware-based profiling to help analyze your code’s efficient use of the microprocessor. For more detailed information, head on over to VTune Amplifier’s website.

Intel® Graphics Performance Analyzers (GPA)

Intel GPA is a suite of four analyzers and a graphics monitor. Each tool is used to profile a different stage of the graphics optimization workflow. System Analyzer is for live metrics analysis of a remote target. Platform Analyzer is for analyzing CPU bound graphics applications. Frame Analyzer for DirectX and Frame Analyzer for OpenGL are for analyzing GPU bound graphics applications. GPA is designed for graphics analysis, providing a fast way to debug, analyze and optimize games and graphics applications. For more information about GPA, check out the GPA website.

When do I use each tool?

Since both VTune Amplifier and GPA profile the performance of applications, the questions often arises – “Which one do I use in my situation?” The answer to this question leads to better understanding the differences between the two tools.

The simple answer is, if it’s a graphics application, start with Graphics Performance Analyzer and then move to VTune Amplifier if your application ends up being CPU bound. If your application is not a graphics application, start with VTune Amplifier’s basic hotspot analysis. Below is a general decision chart to use when profiling a given application for performance.

If the target application is a GPU bound graphics application, use Frame Analyzer. If the application is a CPU bound graphics application, start with Platform Analyzer and move on to VTune Amplifier for a deeper dive if needed. If the application is not a graphics application, VTune Amplifier is the best tool to start with.

What are the differences?

Both tools are used when profiling a CPU bound graphics application, so what are the differences? GPA is designed primarily for game developers who create complex 3D environments. Platform Analyzer is used when developers want to quickly make correlations between GPU and CPU frames, view durations and serialization of the DirectX or OpenGL ES render threads, and study the live metrics captured in System Analyzer in more detail. Platform Analyzer is a GPU centric application, primarily showing CPU data to correlate with GPU data. VTune Amplifier is used when that developer wants to go deeper and analyze the performance of the code line by line.

For example, a developer finds that their game is CPU bound with System Analyzer so they capture a trace and proceed to analyze that trace in Platform Analyzer. While in Platform Analyzer, the developer finds that the render thread is not the issue, but the game business logic is not threaded correctly. At this point, the developer would use VTune Amplifier’s “lock and wait” analysis to help solve the threading issue.

Knowing when to use each tool is the best way to understand the differences between the tools. For questions regarding either GPA or VTune Amplifier, head on over to the forums (GPA, VTune Amplifier). We are always happy to help answer any questions you might have.

↧

Asteroids and DirectX* 12: Performance and Power Savings

September 29, 2015, 8:57 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® XDK FAQs - IoT

≪ Previous: Application Performance Profiling – When to use Intel® Graphics Performance Analyzers and Intel® VTune Amplifier

Download Code Sample

The asteroids sample that Intel developed is an example of how to use the Microsoft DirectX* 12 graphics API to achieve performance and power benefits over previous APIs and was initially shown at SIGGRAPH 2014. Now that DirectX 12 is public, we are releasing the source code for the sample. In it, we render a scene of 50,000 fully dynamic and unique asteroids in two modes: maximum performance and maximum power saving. The application can switch between using the DirectX 11 and DirectX 12 APIs at the tap of a button.

All of the results here were captured on a Microsoft Surface* Pro 3 when the tablet was running in a steady, thermally constrained state. This state represented the experience of playing a demanding game for more than 10–15 minutes.

Performance

In the performance mode the application is allowed to run as fast as possible within the thermal and power constraints of the platform. Using DirectX 11, we see the following:

The image shows the frame rate (top left) and the distribution of power between the CPU and GPU. Toggling the demo to run on DirectX 12 shows a significant improvement.

Performance with DirectX 12 increases ~70 percent (from 19 FPS to 33 FPS). The power graph explains why this is happening. DirectX 12 is designed for low-overhead, multithreaded rendering. Using the new API we reduced the CPU power requirement, thus freeing that power for the GPU.

Power

To directly compare the power savings of DirectX 12 to another API, we also support a mode that locks the frame rate so that the demo does the same amount of work in each API. Toggling from DirectX 11 (on the left half of the power graph) to DirectX 12 (on the right) while keeping the workload fixed, we see the following:

Rendering the same scene with DirectX 12 uses less than half the CPU power when compared to DirectX 11, resulting in a cooler device with longer battery life.

These increases in power efficiency in DirectX 12 are due both to reduced graphics submission overhead and an increase in multithreaded efficiency. Spreading work across more CPU cores at a lower frequency is significantly more power efficient than running a single thread at high frequency.

Summary

With this demo, we have shown that DirectX 12 enables significant improvements in both power and performance. This demo was created to show the two extremes: fixed power and fixed workload. In reality developers can choose any desired blend of power and performance for their applications.

The main takeaway is that power and performance are inseparably linked. Conventional notions of "CPU versus GPU bound" are misleading on modern devices like the Surface Pro 3. An increase in CPU power efficiency can be used for more performance even if an application is not "CPU bound."

More Information

GitHub: https://github.com/GameTechDev/asteroids_d3d12

DirectX Developer Blog: http://blogs.msdn.com/b/directx/
DirectX 12 Twitter feed: https://twitter.com/DirectX12
Intel Software Twitter feed: https://twitter.com/IntelSoftware

Intel technologies may require enabled hardware, specific software, or services activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

For more information go to http ://www.intel.com/performance.

↧

Intel® XDK FAQs - IoT

September 29, 2015, 11:29 am

Latest and popular articles on Intel Technologies

≫ Next: Slash Your MSC Nastran* Simulation Runtimes by Up to 50 Percent

≪ Previous: Asteroids and DirectX* 12: Performance and Power Savings

Q1: Can Intel XDK IoT edition run alongside the regular Intel XDK?

Yes, it can. Since the IoT edition is a superset of the standard edition, it contains the same features as the standard edition with some additional features for IoT development. It is alright to have them both installed. However, you cannot run the IoT edition alongside the regular edition. This is expected to be fixed in future releases.

Q2: Why isn't the Emulate tab visible?

The Emulate tab will only be visible for projects that are created from "Start with a Template", "Work with a Demo", "Import an Existing HTML5 Project", and "Start with App Designer" options under the App Developer Projects section.

Q3: How do I use WebSevice API in my IoT project from main.js?

The main.js file is no different from a typical JavaScript file besides the fact that it is used in the context of node.js. You can create a simple http server that serves up an index.html. The index.html file should contain a reference to the JavaScript files that update the HTML DOM elements with the relevant Web Services data as you would with a typical HTML5 project. The only difference here is that you are accessing the index.html (HTML5 application) from the http server function in the main.js file. The Web Services enabled application would be accessible through your browser since you will need to access it using the IoT device's IP address.

You can find more information here

Q4: How do I connect Intel XDK to my board without an active internet connection?

You can connect to your board through a USB cable from your computer to establish a virtual Ethernet connection. For more information, visit https://software.intel.com/en-us/articles/intel-edison-connecting-ethernet-over-usb

Q5: Can the xdk-daemon run on other Linux distributions besides Yocto?

The Intel XDK IoT Edition xdk-daemon is ONLY supported on Yocto.

Q6: How do I update the MRAA library on my Intel IoT platforms (Intel Edison or Intel Galileo)?

Once you have completed the Yocto Linux image setup, you can update the current version of mraa by running the following commands in a ssh or serial terminal connection:

opkg update
opkg upgrade

Q7: Where can I download the Intel XDK IoT Edition?

The Intel XDK IoT Edition is available on the Intel IoT downloads page. You will also find a link to additional FAQ pages for the Intel XDK IoT Edition product.

Back to FAQs Main

↧

Slash Your MSC Nastran* Simulation Runtimes by Up to 50 Percent

September 29, 2015, 1:59 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® C++ Compiler Standard Edition for Embedded Systems with Bi-Endian Technology Version Release Notes

≪ Previous: Intel® XDK FAQs - IoT

Doubling performance for your most complex MSC Nastran* simulations can deliver a range of high-value options for your engineering and design teams. You may be able to obtain results hours or even days sooner to shorten your design cycles. Or you may choose to test more design options to accelerate innovation, or explore a wider range of coupled variables to gain a deeper understanding of real-world product behavior.

Intel® Xeon Phi™ coprocessors offer a simple, cost-effective option for potentially doubling the performance of large and complex simulations running on MSC Nastran. Each coprocessor provides up to 61 cores and 244 threads, delivers up to 1.2 teraflops of double-precision peak performance,1 and can be added easily to existing clusters and workstations. With the 2016 Alpha release, MSC Nastran will be able to take full advantage of this processing power.

Download complete PDF Solution Brief Download Download

↧

Intel® C++ Compiler Standard Edition for Embedded Systems with Bi-Endian Technology Version Release Notes

September 27, 2015, 11:28 am

Latest and popular articles on Intel Technologies

≫ Next: Version Manager for Cordova* Software

≪ Previous: Slash Your MSC Nastran* Simulation Runtimes by Up to 50 Percent

This page provides the current Release Notes for the Intel® C++ Compiler Standard Edition for Embedded Systems with Bi-Endian Technology product. All files are in PDF format - Adobe Reader* (or compatible) required.

To get product updates, log in to the Intel® Software Development Products Registration Center.

For questions or technical support, visit Intel® Software Products Support.

Intel® C++ Compiler Standard Edition for Embedded Systems with Bi-Endian Technology Release Notes

Version

Release_Notes_Compiler.pdf

16.0 Beta

↧

Version Manager for Cordova* Software

October 1, 2015, 6:15 am

Latest and popular articles on Intel Technologies

≫ Next: MagLens*: A New Perspective on Information Discovery

≪ Previous: Intel® C++ Compiler Standard Edition for Embedded Systems with Bi-Endian Technology Version Release Notes

If you are a Cordova* dev, you are familiar with the following command

npm install -g cordova

This command installs Cordova for you globally on the command line. But what if you want to install it in another location? You have to configure your path settings to search for it correctly. What if you want to use multiple versions (gasp)? There are a few blog posts out there detailing how to install multiple versions and alias them.

To the rescue

Inspired by tools like RVM* and NVM*, I created Version Manager for Cordova Software (cvm) to solve this problem. As part of the HTML5 tools team at Intel, I write and maintain a few Cordova plugins. Having to test them on different versions became cumbersome, especially with the Android* switch from Ant* to Gradle*. Based off my experience, the tools is designed to help plugin authors speed up the development and debugging cycle.

The tool

There are two parts to this CLI tool; Cordova version management and dispatching to the correct version of Cordova. The CLI tool allows you to install any valid version of Cordova, switch your current version, and uninstall versions of Cordova. The tool will create a .cvm directory in your home folder where it stores all the Cordova versions. It also creates a .cvmrc file that has the selected version of Cordova. You can always fall back to your system install instead of using a version supplied by cvm.

The second part intercepts calls to the "cordova" command and dispatches it to the correct version. We accomplish this by adding a new cordova file inside the .cvm folder and update our path environment variables to include the .cvm folder before our node modules.

Usage

The tool requires Node.js* and NPM* to be used. To install, simply run the command

npm install -g version-manager-cordova-software

After you install, you will need to configure your environment variables for the new cordova script. Windows* users who are using the command prompt, you need to modify the system PATH environment variable. Make sure it is before the entry to node_modules

c:\users\myaccount\.cvm;c:\users\myaccount\AppData\roaming\npm

If you are using a shell, such as cygwin* or gitbash*, use the following instructions. Update your shell profile, e.g .bash_profile, to export the path environment variable. After updating, you can restart your terminal or resource your shell profile.

export PATH="$HOME/.cvm:$PATH"

You now will have a new command cvm to manage your Cordova versions. Below are example commands to install, switch, and remove versions of Cordova

cvm install 5.1.1 - install version 5.1.1

cvm install 5.3.3 - install version 5.3.3

cvm use 5.3.3 - use version 5.3.3

cvm uninstall 5.1.1 - uninstall 5.1.1

cvm use system - use the system version of Cordova

↧

MagLens*: A New Perspective on Information Discovery

October 1, 2015, 3:43 pm

Latest and popular articles on Intel Technologies

≫ Next: How I Made A 3D Video Maker

≪ Previous: Version Manager for Cordova* Software

By Benjamin A. Lieberman, PhD

One of the biggest challenges in a data-rich world is finding information of relevance, particularly when the information you’re seeking is visual in nature. Imagine looking for a specific image file but being limited to a text search of descriptions of the image or laboriously scanning thumbnail pictures one by one. How can you know if the image file was properly categorized? Or categorized at all? What if you need to pick out a single image from tens of thousands of other, similar images?

The engineers at Intel have developed the Intel® Magnifying Lens Tool, an innovative and exciting approach to solving this problem, and led the effort to develop the first web app based on this approach, MagLens. The MagLens technology shows great promise for changing the way individuals approach mass storage of personal information, including images, text, and video. The technology is part of an ongoing effort at Intel to change the way information is captured, explored, and used.

Images and Files Everywhere

Today, most people carry a camera multiple hours a day in the form of a smartphone. More and more of our personal content is online—text, video, pictures, books, movies, social media, and more. The list seems to grow with each passing day.

Users are increasingly storing their files and content in cloud. Service providers like Amazon, Apple, and Google are all making this migration easier, safer, and less expensive. Personal content is available 24 hours a day, 7 days a week, 365 days a year from practically any mobile device.

Unfortunately, old-style storage structures based on hierarchical folders present a serious barrier to the optimized use of data. Specifically, the classical file storage structures are prone to poor design, mismanagement, neglect, and misuse. Unless users are well organized, these file shares rapidly become a dumping ground, much like books stacked in a pile (Figure 1). The books in the image are clearly organized but not terribly usable. As a result, it becomes increasingly difficult to locate relevant information in these “stacks,” particularly when the current techniques of textual search are applied to visual content.

Figure 1. I know it must be in here somewhere...

In the exploding cloud-storage space, it’s now possible to store tens of thousands of image-related information online. These files typically accumulate over time and are often as haphazardly stored as they are created. Unlike a well-regulated research library, little or no metadata (for example, date, location, subject matter, or topic) are created with the images, making a textual search all but impossible. How can you find the file you want when it could be anywhere, surrounded by virtually anything?

Rapidly scanning this vast forest of files for a single image is a daunting task. Consider the steps involved. First is the desire to locate information of interest: What differentiates that information from other information like it? Next, you attempt to remember where that information was stored. Was it on a local drive or a cloud server; if it’s in the cloud, which service? What was the folder structure? Or is there just a collection of multiple files all stored in one place? Now how do you recognize the file of interest? It’s doubtful that the file name will be helpful (which may be something automatically generated and cryptic, such as 73940-a-200.jpg), and thumbnail images are difficult to see clearly, even on a high-definition display. What’s required is some method to rapidly scan the stored images for specific patterns of shape and color by leveraging our highly evolved visual sense.

MagLens Expands the Options for Discovery

Many years of research in neurobiology and cognitive science have shown that human thinking and pattern recognition are geared toward the visual space. Our brains have evolved to handle the complex (and necessary-to-survival) problem of determining whether the fuzzy shape behind the next clump of bushes is a large, carnivorous animal that would like to invite us over for a quick bite. These days, we’re faced with choices that have somewhat less terminal outcomes, such as finding a set of photos from our most recent vacation. Nevertheless, we would be aided in our task if we could use our well-honed pattern-discovery and matching efficiently.

The Intel® Magnifying Lens Tool is both simple and profound. Like many successful metaphors in computing, the idea is that you should be able to rapidly scan across a visual field, with the focus of attention (that is, the focal point) matching the greatest detailed magnification of an image (Figure 2). Around the focus image, other images are magnified, as well, but in a decreasing amount as you move away from the focal point, similar to the way icon magnification on Mac OS X* works as you pass the cursor over the application Dock. With MagLens, the visual field is the full screen space (rather than a linear bar), allowing a rapid scan across thousands of images in a few seconds. All the images remain in view at all times, varying by size as the focus of attention scans across the field of vision.

Figure 2. MagLens* technology allows dynamic exploration of a visual space.

Contrast this technology with previous attempts at magnification, where the magnified portion of the image blocks out other vital information in the view (Figure 3). Even if the magnification moves around the screen, only the area directly under the cursor is visible: the remainder of the image is blocked from view, hampering the ability of your visual systems to recognize patterns. You have to engage your short-term memory to temporarily store one screen view, and then mentally compare it with the next. In contrast, the MagLens approach makes the entire view available for pattern matching.

Figure 3. Zooming a section of the image obscures large areas of the non-highlighted image.

This “elastic presentation space” approach differs from previous attempts at rapid scanning, such as by “flipping” pages of thumbnails, or dragging a scroll bar, in that it simultaneously gives you a natural scan of the information field (much like how your eyes normally scan a complex visual field), dynamically increasing the level of detail at the point of visual attention. Combined with the natural gesture recognition that 3D recognition technology (such as the Intel® RealSense™ technology) provides, this technique opens the visual computation space to a wide range of applications. To explore this option, the development team integrated Intel RealSense technology into the prototype to optimize the application for a wide range of content.

Where We Have Been

The research on what would become MagLens began as four years of Intel-sponsored research (1997–2001) by Sheelagh Carpendale, who was working on her doctoral dissertation at the University of Calgary (see “For More Information”). Although the approach she devised has been discussed and written about extensively, there has to date been no successful adoption of a widespread technological approach. John Light at Intel began to pursue a prototype using Dr. Carpendale’s “elastic presentation space” idea.

Light’s team created a prototype that took advantage of modern computing power to allow users to view hundreds of images at the same time.

Design Thinking Overview

Shardul Golwalkar, an Intel intern at the time, expanded this prototype into a more usable proof of concept. The expansion project began with text-heavy content such as textbooks and was later expanded to more visual content exploration through magazines and news publications. Employing a user-centric development technique called Design Thinking (see the sidebar “Design Thinking Overview”), Shardul broadened the prototype into a functional web-enabled space, where it would be possible to perform the visual scan through a standard web browser interface.

At the conclusion of his internship, Shardul continued with the idea as an undergraduate at Arizona State University. During this time, he continued to explore the optimization of the technology and supported project development. Together, Shardul and Light demonstrated that it was possible to model 40,000 simultaneous images to the “discovery space” and enable a multimodal understanding of the data space using both gestures and vision. At the end of this effort the team had succeeded in creating an interface that was intuitive, powerful, and empowering for user-driven self-discovery of new capabilities—a delight for the user.

Where We Are

When the initial development was complete, there was interest at Intel in moving to a more commercial product. Intel sponsored the company Empirical to develop the Intel® Magnifying Lens Tool and move it toward a 2015 product release. Developers at Empirical reworked the original development, building new workflows and making the overall experience more polished and performant. See “For More Information” for a link to the current product, or click here.

A major goal of the initial development was to allow users connected to Internet file shares (such as Google Drive*) to view cloud-based files and ultimately enjoy integration across multiple cloud file stores. The product was optimized for web use, especially for touch screen display devices such as All-in-One desktops, 2 in 1s, notebooks, and tablets. Using MagLens, users no longer need to know the file storage hierarchy to find materials. The MagLens site collects all the identified file stores and “flattens” the visualization to a single, scalable 2D space. Now, it’s possible to locate a file of interest regardless of where it resides.

Imagine the Possibilities

Intel selected Empirical to develop the MagLens concept into a viable product based on its years of experience with product design and development. The Intel collaboration with Empirical has discovered many possible applications for MagLens. Indeed, Intel is open to licensing the MagLens code to software vendors, original equipment manufacturers, cloud services providers, and operating system developers—to expand the concept beyond photos and images to applications involving magazines, photography, films, and visualization of complex multimedia information. The Intel contact for licensing inquiries is Mike.Premi@intel.com.

Research is also continuing on the core concept of browsing through additional metadata to enable exploration and sorting for likely conceptual matches, such as a filter or clustering algorithm that gathers similar images (i.e., a digital photo library). Other techniques include using algorithms for facial recognition and integration as a utility of core operating systems.

MagLens shows great promise for changing the ideas around information discovery, organization, and integration. The future of this technology is limited only by our ability to see the possibilities.

References

Marianne Sheelagh Therese Carpendale, “A Framework for Elastic Presentation Space” (doctoral dissertation, Simon Fraser University, 1999), http://pages.cpsc.ucalgary.ca/~sheelagh/wiki/pmwiki.php?n=Main.Thesis
Roger Chandler, “Adventures in Design Thinking,” (2015). https://software.intel.com/en-us/blogs/2015/06/09/adventures-in-design-thinking
Learn more about the MagLens application from Empirical and Intel at mag-lens.com.

About the Author

Ben Lieberman holds a PhD in biophysics and genetics from the University of Colorado, Health Sciences Center. Dr. Lieberman serves as principal architect for BioLogic Software Consulting, bringing more than 20 years of software architecture and IT experience in various fields, including telecommunications, rocket engineering, airline travel, e-commerce, government, financial services, and the life sciences. Dr. Lieberman bases his consulting services on the best practices of software development, with specialization in object-oriented architectures and distributed computing—in particular, Java*-based systems and distributed website development, XML/XSLT, Perl, and C++-based client–server systems. He is also an accomplished professional writer with a book (The Art of Software Modeling, Benjamin A. Auerbach Publications, 2007), numerous software-related articles, and a series of IBM corporate technology newsletters to his credit.

↧

How I Made A 3D Video Maker

October 1, 2015, 4:27 pm

Latest and popular articles on Intel Technologies

≫ Next: Freejam Builds Loyalty for Robocraft* Through a User-Friendly Development Strategy

≪ Previous: MagLens*: A New Perspective on Information Discovery

By Lee Bamber

Anyone familiar with perceptual computing technologies has probably heard about the USD 1 million Intel® RealSense™ App Challenge launched by Intel to encourage the creation of innovative and unique applications that leverage the Intel® RealSense™ SDK. This competition was split into newcomers and invitation-only ambassadors, the latter having previous experience in the field with two prize funds for each group to keep things fair. My entry, “Virtual 3D Video Maker,” was entered into the Ambassador group with a higher prize fund, pitching me against the best Intel® RealSense™ app coders in the world. In early 2015, I learned that I had been declared First Place Winner and recipient of a prize of USD 50,000.

The Virtual 3D Video Maker lets you create your own 3D videos

This article explores my journey and reveals some lessons I learned along the way, along with a few deep dives into the techniques I used to create the final winning app.

The Vision

The idea for “Virtual 3D Video Maker” came while I was working on another app for a previous competition called ‘Perceptucam’. This app involved scanning users and rendering them in virtual 3D while conducting a videoconferencing call.

The Perceptucam application used a virtual board room as the setting for the call.

When creating 3D videoconferencing software, you first need to capture and store the actual 3D representation of the person in front of the camera—no small feat of engineering. About half-way into this ambitious project, it occurred to me that it might be possible to record the 3D data and send it to someone else to be played back at a later time. After seeing my image as a 3D person planted into a virtual scene, I knew that this would be great fun for other users who want to see their friends in 3D. When the Intel RealSense App Challenge was announced, it was an easy decision to enter.

It was also fortunate that my daily life consists of team calls over Skype* or Google Talk* and the rest of the day coding or making “how to” videos and demos. These types of tasks gave me an awareness of the kind of tools needed to record, store, and replay videos, which was a useful insight when you want to recreate such a tool in 3D.

A single piece of paper and a lot of scribbling later, I had the screen layouts, the buttons, and the extent of the functionality mapped out on paper and also in my mind. The mere act of writing it down made it real and once it existed in the real world, all that remained was to code it.

The Elements

Making an Intel RealSense app is not difficult, thanks to the Intel RealSense SDK and the examples that come with it. However, circumventing the built-in helper functions and going directly to the raw data coming from the camera is another matter, but this is precisely what I did. At the time there was no command to generate 3D geometry from the output from the Intel® RealSense™ camera, nor could I find any advice on the Internet about it. I had to start with a set of depth values on a 320×240 grid of pixels and produce the 3D myself, and then ensure it was fast enough to be renderable in real-time.

Thanks to my status as Ambassador and my previous experience with writing Intel RealSense apps, I merely had to get the 3D representation on the screen as quickly as possible, and then polish the visual a bit. It helped that I knew a programming language called Dark Basic Pro* (DBP), which was expressly designed to make prototyping 3D apps quick and easy. Of course being the author of said programming language, I was able to add a few more commands to make sure the conversion of depth data to an actual 3D object rendered to the screen was not a performance hog.

The primary functionality of the app was to represent the user as a rendered 3D object.

At this point I was intimately familiar with the data required to reproduce the 3D geometry and texture image. To keep the file size of the video small, I used the original (though compressed) depth and color information data. For example, to represent a single vertex in 3D space takes 12 bytes (three float values for X, Y, and Z with each float taking 4 bytes). The actual depth data coming from the camera was a mere 2 bytes (one 16-bit float), making my potential final file six times smaller in this case. Other data was less optimum but after choosing the right data to export during the real-time recording I could get about 30 seconds of footage recorded before I exceeded the 32-bit address space I allowed.

With DBP, users can code easily and quickly in BASIC* and then supplement the command set with their own commands written in C++. To work closely with the Intel RealSense SDK I created an internal module to add specific Intel RealSense app commands to the language, and most of the approach described above was coded in C++. For more information on DBP, visit the official website at: http://www.thegamecreators.com/?m=view_product&id=2000.

When I wrote the DBP side of the code, I created commands that triggered large chunks of C++ functionality, allowing me to code mainly in C++. A typical program might be:

MAKE OBJECT CUBE 1,100

RS INIT 1

RS UPDATE

SYNC

LOOP

RS END

The DBP side is essentially reduced to an initialization call, an update function during the main loop, and a final clean-up call when you leave the app. The commands to MAKE OBJECT CUBE and SYNC create a dummy 3D object and render it to the screen, but by passing the object number into the RS INIT command I can delete the contents that represents a cube and replace it with a larger mesh and texture that represents what the camera is viewing.

Although the complexities of storing and rendering geometry is beyond the scope of this case study, when storing 3D geometry, you would typically store your vertices like this:

struct

{

float fX;

float fY;

float fZ;

}

However, the actual depth camera data that eventually produces this three float structure actually looks something like this:

unsigned short uPixelDepth;

The latter datatype is preferable when you want to save large amounts of real-time data coming from the camera. When playing back the recording, it’s a simple matter of feeding the depth data into the same 3D avatar generator that was used to represent the user when they made their recording. For more information on generating 3D avatars, read my paper: https://software.intel.com/en-us/articles/getting-realsense-3d-characters-into-your-game-engine

Once I had my real-time 3D avatar rendered and recorded to a file, I had to create an interface that would offer the end user a few buttons to control the experience. This was when I realized that I could write an app that had no buttons, instead opting for voice control in the truest sense of an Intel RealSense application. Having written many hundreds of apps that required a keyboard, mouse, and buttons, I was fascinated about what an app might look like if it had none of those things.

The act of looking down causes the app to slide up a selection of voice activated buttons.

Adding voice control was straightforward because I planned to use only five functions: record, stop, playback, export, and exit. I then discovered that I needed to create an interface because the end user would need to know what the keywords are in order to parrot them back to the app. By adding a system that could detect the direction of the end user’s head, I could trigger a panel with the word prompts that slided into view from the bottom of the screen when the user looks down—what I refer to as context control. This idea worked like a charm. After a few runs with the software I knew the words by heart and could bark orders quickly, which proved to be more intuitive and much faster than moving a mouse and clicking a button.

In college, my lecturer often accused me of creating “flash trash,” in reference to my predilection for adding color and character art into the simplest of database menus. An understandable remark perhaps, since we were using original Pascal* and I was supposed to be writing a serious banking tool. But even then I understood the need for a good visual experience. To that end I added buttons that would move smoothly into view when the user looks at the base of the screen, and would grow in size when certain words are spoken. I also added extra functionality just for fun. For example, shouting the words “Change backdrop” rotates through a choice of background images, or shouting “Light” adjusts the lighting of the scene around the virtual 3D representation of the user.

It did not come all at once of course, I started with the most basic screen and the ability to record, and after many recordings and play testing it occurred to me that a real film studio would have lights, and it just so happened DBP had a whole suite of lighting commands. It took only about 10 minutes to tie the voice recognition system into the activation of a few lights, and the experience of barking commands into thin air and seeing the lights change instantly was so much fun, it just had to go into the final app.

Added all together, with the functionality of recording yourself in 3D, playing it back through a separate player, using voice control instead of buttons, and using a sense of context to make the experience more intuitive and smarter, the app represented what I thought an Intel RealSense app should look like. It appears the judges agreed with me.

Lessons Learned

In a competition setting, the biggest restriction is time—the competition has a deadline. Your ideas can quickly expand, taking on a life of their own. They rarely fit neatly into a set of conditions and restrictions. So you need to understand the scope of your project and chop, shrink, shift, and refine your ideas, as necessary, into the shape that best fits the purpose. When you are sketching ideas on paper, ask yourself whether you can do the project in half the time you have allotted. If the answer is no, get a fresh piece of paper and refine your vision.

A critical phase in any project, be it for a competition or otherwise, is to flesh out the grey areas as you see them. Get comfortable with all the technical elements that your ideas will require, and create prototypes to ensure you know how those elements are going to work before you start the final project. This process gets much easier when you’ve been coding for a while, largely because you’ve probably coded a derivative of the concept for a past project. Even so, with over three decades of coding behind me, I’m much more comfortable working on a project when I’ve pre-empted all the grey areas and created prototypes to try out each technology.

If you want to develop Intel RealSense apps rapidly, use the built-in examples that come with the SDK. Covering most every aspect of the functionality, these examples allow you to step through the code line by line, which helps you understand how it all links together. For more ambitious projects, you’ll certainly find holes in the commands available, but thanks to a well-documented set of low-level functions that access the raw data of the camera, you’ll be able to create workarounds.

The Future

My efforts to compress the 3D video data into a small file resulted in about 30 seconds of footage, and yet we now have a wealth of compression techniques that could condense this data to a fraction of the memory I used. For an overview of the subject of data compression, a great place to start is the Wiki page on the subject: https://en.wikipedia.org/wiki/Data_compression. For practical lessons and code you can visit the 3D Compression.com website: http://www.3dcompression.com/, which also has demos you can download and run. Such compression will be vital for the day when we have the ability to transmit not just a 2D camera image and sound across the world, but our entire 3D self and anything we happen to be holding at the time.

We’re now seeing early versions of hardware technology that one day could completely transform how we communicate with each other over the Internet. Imagine wearing a pair of next-generation augmented reality glasses in your office room, which has depth cameras in each corner pointing to the center of the room. Somewhere else in the world is another office room with a similar setup, with smart software connecting these two environments. The software scans you and your office, and also renders the “non-static” elements of the other office through the glasses.

When you enter your office wearing your glasses you can see someone sitting there, casting a realistic shadow, and being lit with the correct amount of ambient light. Only by taking off the glasses do you realize that this person isn’t physically in the same location as you, but in every other respect can talk and listen as if they were. If the software feels it necessary, it can enable the “static” parts to be rendered so you can look around the other person’s office. This scenario might seem like a futuristic gimmick, but at some level, when you can see a person in 3D, hear them talking from a specific direction, and know that they’re aware of your environment, you’ll feel they’re right next to you.

Summary

Ever since I first got my hands on an Intel RealSense camera and started analyzing the raw data produced, I realized the significance of what this technology represents. Up to this point in the evolution of the computer and its user, the interactivity has predominantly been one way. We use keyboard, mouse, buttons, and touch to tell the computer specifically what we want, and the response from the computer has been pretty linear as a result. You press this, the computer does that.

With the emergence of input methods that do not require the human to do anything, the computer suddenly has access to a stream of input data it never had before. Furthermore, this data is not a few isolated stabs of input, but a torrent of data pouring into the computer as soon as the user sits down and logs in. This data can be used in many ways, but perhaps the most significant application is to enable real human-to-human communication, even when that human is not in the same physical location.

As food for thought, I conclude with a question; how many of your daily trips out into the real world require you to be physically there? Allowing for normal changes in attitude to technology as we evolve as a society, how many activities can be substituted with “virtual presence” without undermining or discouraging that activity? I won’t accept “work” being one of those activities; that’s just too easy!

About The Author

When not writing articles, Lee Bamber is the CEO of The Game Creators (http://www.thegamecreators.com), a British company that specializes in the development and distribution of game creation tools. Established in 1999, the company and surrounding community of game makers are responsible for many popular brands including Dark Basic*, The 3D Game Maker*, FPS Creator*, App Game Kit* (AGK) and most recently, Game Guru*.

↧

Freejam Builds Loyalty for Robocraft* Through a User-Friendly Development Strategy

September 30, 2015, 3:14 am

Latest and popular articles on Intel Technologies

≫ Next: Porting Adobe AIR* Applications To The Intel® X86 Platform

≪ Previous: How I Made A 3D Video Maker

By Karen Marcus

Download PDF

Robocraft* is an online Games as a Service that requires player involvement to build aspects of the game content. Specifically, players build robots that act as the primary agents in the game. Because players are called upon to co-create the game—and because many of them are as young as 10-years old and play Robocraft on older technology—user focus has been at the heart of the creative process for developer Freejam. Key programming and design decisions made early in the development process were based on the desire to build loyalty with users and encourage access for as many of them as possible while at the same time giving them a game that looks good and is fun to play. These decisions raised challenges for the developer, but, two years into the process, Freejam has reached its goals of a loyal user following and a great-looking, fun game.

Initial Stages

Robocraft is an online third-person shooter (TPS) game first introduced by Freejam in 2013. Gameplay involves players using individual blocks to build robots that can then fight each other. Players can form teams, and the object of the game is to destroy as many opponent robots as possible and/or capture the enemy base.

Figure 1: Example of unique robot - plane

As with similar building games, user content generation is extremely important. Sebastiano Mandalà, cofounder and chief technology officer at Freejam, explains: “When we started, we knew the game was going to be creative, and we ended up having 300 different cubes players can use to create a unique robot that can both look beautiful and be deadly in battle.” Figures 1 and 2 show examples of unique robots.

Figure 2: Example of unique robot - ship

To ensure this level of user content generation, the original Robocraft team of five had to decide which technology would affordably enable them to do the programming. The team selected Unity*. Mandalà notes, “We don’t have an in-house engine, which means we haven’t developed platform-specific code. We just used what comes out of the box from Unity [Technologies], which is compatible with many platforms.”

Unity fit the criteria; however, the team faced challenges in implementing programming specifics. “For example,” says Mandalà, “all those cubes rendered on the screen were very difficult for most graphics cards, because thousands of cubes were specifically enabled. Eventually, we had to work out some tricks to batch the graphics and collider for the physics to make it run smoothly, especially on low-end machines.”

In addition to user generation, this consideration for low-end player technology has been important. Mandalà comments, “With Robocraft, we were targeting kids between 10 and 14 years of age as well as older teens and adults. But kids don’t usually have very powerful computers, so we were forced to design the game to work with older technology.”

Inclusive Programming

To develop a loyal base, Freejam wanted to make Robocraft accessible to a wide range of players on a variety of computing systems. Initial programming involved targeting older graphics cards as well as newer ones, including both integrated and nonintegrated cards. The team currently uses Unity 4 and Microsoft DirectX* 9 to program the game.

Mandalà notes, “We know that programming for more recent machines is ‘cooler,’ and we might target them later, once we have the lower-end aspects fully under control. For example, we’ll be able to define in more detail things like shading and soft shadows, so the game looks better. That would involve switching to Unity 5 and creating a special version for DirectX 11 or 12. We would be improving an already good game for faster machines. It’s an ongoing process, and that’s part of our plan for the future.” Currently, Robocraft adapts to each player’s computing resources, detecting system specifications such as CPU and GPU fill rate and running accordingly.

Graphic Design

Additional challenges came with the graphic design. Mandalà explains: “Because players build the robots, each robot has a unique shape and form. Initially, with Unity 4, we rendered each cube separately and assigned it its own collider, but both the rendering and the physics engines didn’t perform well on many machines. With thousands of primitives, Unity 4 performance was just not enough, so we had to re-design the code in order to be able to handle our scenario. Being able to modify the shape of the robots during combat was another challenge. With our new algorithms, the draw call count was drastically reduced, while the colliders were clustered and simplified. Finally, we incorporated those optimizations in the design process before the game went live.” Figure 3 shows the changing shape of a robot being destroyed during combat.

Figure 3: Robot being destroyed

To optimize the game for PCs that have embedded GPUs using DirectX, the team minimized the number of draw calls, batching all the cubes that form one robot into one draw call where possible. Fill-rate optimization is crucial for integrated cards, and, to optimize it, the team implemented an algorithm to dynamically skip the rendering of the faces of the cubes occluded by other cubes. Finally, the game takes advantage of the Unity LoD system to reduce the number of draw calls and polygons of the dynamic parts of a robot that are far from the camera.

Screen resolution posed a few problems. Mandalà says, “Screen resolution is linked to fill rate. Recent powerful graphics cards can handle high resolutions without problems, while legacy cards are heavily affected. To solve this problem, we implemented multiple graphic resolution settings, from ‘fastest’ to ‘fantastic’ and a few in between. Special optimizations are in place in the lightest modes, while more graphic details are enabled in the heaviest modes for more recent graphics cards.” Figure 4 shows the game in fast mode. Forward rendering and shadows are enabled. Deferred shading and relative post process effects are available in the highest settings.

Figure 4: Optimized screen resolution

Mandalà notes: “Thanks to Unity 5 and the awesome optimizations implemented in the new engine, we are already planning to add more features that could make the game more realistic, with respect to both physics and graphics.”

Intel’s Involvement

To ensure that the game runs smoothly on Intel® architecture, the Freejam team runs Robocraft on Unity middleware, which, observes Mandalà, is a great engine for Intel® platforms. “However,” he adds, “we introduced a number of custom optimizations in the game to improve performance on Intel platforms and we’ve been optimizing for 32-bit and 64-bit systems as well. In the near future, we’ll be focusing more on exploiting Intel® multicore technology for our players who have higher-end PCs.”

The team also uses many Intel tools, especially for analysis and quality assurance (QA). “Intel® Graphics Performance Analyzers [Intel® GPA] in particular,” says Mandalà, “have been essential in helping us optimize performance. Performance is crucial for us, as we have discovered that there is a strong correlation between frame rate and player retention. We have used Intel GPA to diagnose bottlenecks and optimize the game to increase the frame rate and, therefore, retention.”

Tools like Intel GPA inform the team how best to optimize the rendering code, but Robocraft is CPU bound instead of GPU bound. Mandalà notes, “We optimize on the CPU while respecting what’s going on with the GPU.” He adds, “Intel support has always been first class! The tools are great, especially Intel GPA, although my ultimate wish would be to use VTune with Unity.”

Testing and Post-Development

The team tests the game through an internal QA process that consists of many iterations before an expansion goes live. The process includes an extensive test bed of Intel PCs, from those with Intel® Core™ Duo processors to those with Intel® Core™ i7 processors.

The Games as a Service nature of Robocraft allows the team to test some features in a closed user environment, gathering feedback from the community. Community feedback is monitored at all times, as it is an essential part of the development processes.

Freejam is an indie developer that self-publishes Robocraft. Mandalà says, “We’re making good progress growing the community as we evolve the game. Our market success is surely also due to our partnerships with Intel, the Steam* entertainment platform, and others we’ve worked with to reach millions of players worldwide.”

The team has plans to improve the game, including adding new game modes, creativity options, community and social facilities, and other elements. Virtual reality is also on the radar for exploration. Mandalà notes, “We’ll continue to rely on Intel’s platforms, support, and tools for the ongoing development, testing, and optimization and to keep the game in line with advances in CPU and GPU technology. The more processing power we have, the better the game performs and the better our ‘fantastic’ setting is. In the future, for example, we will make the game less CPU bound to improve the speed and graphics even further by using the Intel multicore architecture with multithreaded code.”

Freejam has also been discussing with Intel how to work together to raise awareness for both brands and products within their respective communities.

Future Plans

The Robocraft user community has always been a key driver for Freejam. Mandalà remarks, “We always believed in the game, and our gut feeling was that it was going to be good. It’s been rewarding to get confirmation from our passionate community that they love the game. We communicate directly with them a lot—daily, in fact. Sometimes they hate us, and sometimes they love us. It can be exhausting to try and keep up with them, but also euphoric.”

Mandalà says the user community interaction is what drives the pace of game updates. “We have all these great ideas, and we share and discuss our theories with the community. That’s why we’re growing so fast.” The pace of technology plays a part in update speed, as well. Mandalà states, “The quicker higher-performance hardware goes down in price, the faster we can enable our community to play Robocraft with better graphics, because the new architecture supports higher quality settings for the game.”

The growth of the game has meant growth for Freejam. Starting with a team of five, the company has expanded to about 50 staff members in just two years. Mandalà says the scale-up has been necessary because of the number of features the team wants to implement, but the transition has been challenging. “It’s a good challenge. We’ve been forced to impose more structure on ourselves, which has been positive. We’ve also needed to ramp up our marketing efforts and invest more resources in that.”

All that growth has resulted in a solid game with a loyal following. The Freejam team has noticed that since Freejam released Robocraft, several games have come out with a similar concept. Mandalà notes that the game will continue to improve and hopefully add to its following: “As Robocraft is an online service, development is ongoing. We’ve been developing for more than two years now, but we’ve still got plenty to do. We’re not done yet!”

Mandalà advises developers looking to create similar online Games as a Service projects to expect a long, challenging journey. He says, “Expect to continue developing new content on an ongoing basis, improving features, encouraging a healthy community feedback loop, and engaging in partnerships with key platforms and technology. Keep analyzing the game and player behaviors, test ideas, and enhance or improve the service with minimal viable product iterations. Boost winning ideas, and rethink or remove the ones that don’t work as well. Finally, expect the unexpected, get as much data as you can, and fuse those data with creativity and intuition to drive the vision of the game you and your community are building.”

The Freejam team welcomes contact from other developers and is happy to discuss projects over coffee or Skype*. Mandalà can be reached via his Twitter account @sebify.

Summary

Using Unity and DirectX, and working in conjunction with Intel, Freejam created an online game—Robocraft—that appeals to players of all ages and works well across a wide range of computing technology. A focus on usability prompted the development team to make different decisions than they might have made if the focus were on appearance. For example, the game was designed to work with older technology and adapt to each player’s computing resources. Optimizations were performed to ensure that these older systems could handle the graphics. Robocraft is a work in progress, and Freejam anticipates many exciting future updates.

About Freejam

Freejam was founded in Portsmouth, United Kingdom, by five guys who had a solid belief in one simple game idea, Robocraft. They pooled their combined 52 years of traditional development experience and threw it out the window. With a great idea and new lean start-up philosophy, they’re free to innovate.

The development team was assembled to prove that truly amazing games can be created through experimentation and evolution. The five team members are skilled and organized professional game developers who have embraced the indie way. The company is small, so it can react fast, listen to its players, make quick decisions, try new stuff, and sculpt the game efficiently. In the same unrestrained way that musicians write songs, from a simple jam session with a few chords, the Freejam team has created a game that has grown into a masterpiece.

The developers’ approach relies on open and transparent interaction with their gamers, with the understanding that they can’t improve their game without input from everyone. They don’t know where Robocraft will end up, but that’s all part of the fun.

Website: http://robocraftgame.com/about
Twitter: https://twitter.com/freejamgames
Facebook: https://www.facebook.com/freejamgames
Robocraft on Steam: http://store.steampowered.com/app/301520

↧

Porting Adobe AIR* Applications To The Intel® X86 Platform

October 1, 2015, 5:03 pm

Latest and popular articles on Intel Technologies

≫ Next: Parallel Noise and Random Functions for OpenCL™ Kernels

≪ Previous: Freejam Builds Loyalty for Robocraft* Through a User-Friendly Development Strategy

Download PDF[PDF 436KB]

If you have an Adobe AIR*-based application you can easily port it to the Intel® x86 platform. Support for porting Android AIR applications to the x86 platform started with the Adobe AIR SDK version 14.

Here are the steps:

Download the latest Adobe AIR SDK.
Extract the SDK and navigate to the bin folder.
Set the system path for the bin folder. We are going to run the ADT command from the command prompt.

Starting with Adobe SDK version 14, an ADT command-line option (-arch) has been added to create the package for x86 platform.
The -arch command is optional and by default it creates an armv7 package.
Once you have all the required files to build the AIR app, such as the HTML and SWF files, icons, any SWC libraries or action script files, the application descriptor file, and the certificate file to sign your Adobe AIR app, arrange everything in a folder.
If your application uses any ANE files, follow the process in the given link to also package the ANEs specific to x86.
Open the command prompt and navigate to the folder where the AIR application-specific files exist.
After the command prompt, type the following command:

adt -package -target apk-captive-runtime -arch x86 -storetype pkcs12 –keystore ../mycert.pfx sample.apk sample-app.xmlsample.swficons
- adt – AIR SDK command.
- arch – to target for x86 set the arch command as x86. If this command is not given the armv7a package is created by default.
- Keystore – Give the path to your certificate file to sign the AIR application.
- Sample-app.xml – Your AIR application descriptor file.
- Sample.swf – Your application SWF files if any.

Porting an Adobe Flash* Professional CS6 Project (.FLA ) to Android on x86

If you are using Adobe Flash Professional CS6 and you want to port to x86 please follow these steps:

Support for x86 processors is available since the Flash Professional CC 2014.1 release. Please go through this article on how you can publish an AIR application for Android on x86 if you are using Flash * CC 2014.
Open the .FLA file using the Adobe Flash Professional CS6.
Using the Publish feature, publish the application with the final output as a .SWF file. (Target – AIR 3.2 or greater for Android, Script – ActionScript 3.0, Output File - YourSWFname.swf).
Copy the application descriptor file and the .SWF file into a folder.
Download the latest Adobe AIR SDK.
Navigate to the bin folder of the SDK.
To create the final APK, run the below command with the application descriptor file, the .SWF file, icons, and any other necessary resources.
Make sure you add the ICONS and other resources in the command line.

adt -package -target apk-captive-runtime -arch x86 -storetype pkcs12 –keystore ../mycert.pfx sample.apk sample-app.xmlsample.swficons
- adt – AIR SDK command.
- arch – To target for x86, set the arch command as x86. If this command is not given armv7a package is created by default.
- Keystore – Give the path to your certificate file to sign the AIR application.
- Sample-app.xml – Your AIR application descriptor file.
- Sample.swf – Your application SWF files, if any.

You cannot create a FAT binary but can only generate multiple APKs through this process. Once your APK is ready, please go to this link to learn how you can submit the multiple APKs to the Google Play* store.

About the Author

Praveen Kundurthy works in the Intel® Software and Services Group. He has a master’s degree in Computer Engineering. His main interests are mobile technologies, Microsoft Windows*, and game development.

↧

Parallel Noise and Random Functions for OpenCL™ Kernels

October 2, 2015, 8:55 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® RealSense™ Depth Enhanced Photography

≪ Previous: Porting Adobe AIR* Applications To The Intel® X86 Platform

Download Now Noise.zip

About the Sample

The Noise sample code associated with this paper includes an implementation of Perlin noise, which is useful for generating natural-looking textures, such as marble and clouds, for 3D graphics. A test that uses Perlin noise to generate a “cloud” image is included. (See the References section for more information on Perlin noise.) 2D and 3D versions are included, meaning that the functions take two or three inputs to generate a single Perlin noise output value.

The Noise sample also includes pseudo-random number generator (RNG) functions that yield fairly good results—sufficient that a generated image visually appears random. 1D, 2D, and 3D versions are included, again referring to the number of inputs to generate a single pseudo-random value.

Introduction and Motivation

Many applications require a degree of “randomness” — or actually, “pseudo-randomness.” That is, a series of values that would appear random or “noisy” to a human. However, for repeatability, applications also commonly require that the RNG be able to reliably generate exactly the same sequence of values, given the same input “seed” value or values.

Most RNG algorithms meet these requirements by making each generated value depend on the previous generated value, with the first value in the sequence generated directly from the seed value. That approach to RNG is problematic for highly parallel processing languages such as OpenCL. Forcing each of the many processing threads to wait on a single sequential RNG source would reduce or eliminate the parallelism of algorithms using it.

One approach to dealing with this issue is to pre-generate a large table of random values, with each of the parallel threads generating unique but deterministic indices into that table. For example, an OpenCL kernel processing an image might select an entry from the pre-generated table by calculating an index based upon the pixel coordinates that kernel is processing or generating.

However, that approach requires a potentially time-consuming serial RNG process before the parallel algorithm can begin—limiting performance improvements due to parallelism. It also requires that the number of random numbers to be used be known at least approximately, in advance of running the parallel algorithm. That could be problematic for parallel algorithms that need to dynamically determine how many random values will be used by each thread.

The OpenCL kernel-level functions in the Noise sample code associated with this paper takes an approach more suitable for the OpenCL approach to dividing work into parallel operations.

Noise and Random Number Generation for OpenCL

OpenCL defines a global workspace (array of work items) with one, two, or three dimensions. Each work item in that global space has a unique set of identifying integer values corresponding to the x, y, and z coordinates in the global space.

The Perlin noise and RNG functions in the Noise sample generate a random number or noise sequence based on up to three input values, which can be the global IDs for each work item. Alternatively, one or more of the values might be generated by a combination of the global IDs and some data value obtained or generated by the kernel.

For example, the following OpenCL kernel code fragment shows generation of random numbers based on the 2D global ID of the work item.

kernel void	genRand()
{
	uint	x = get_global_id(0);
	uint	y = get_global_id(1);

	uint	rand_num = ParallelRNG2( x, y );

	...

Figure 1. Example of random number use - two dimensions.

This approach allows for random or noise functions to run in parallel between work items, yet generate results that have a repeatable sequence of values that are “noisy” both between work items and sequentially within a work item. If multiple 2D sets of values need to be generated, the 3D generation functions can be used, with the first two inputs generated based upon the work item’s global ID, and the 3rd dimension generated by sequentially increasing some starting value for each additional value required. This could be extended to provide multiple sets of 3D random or noise values, as in the following example for Perlin noise:

kernel void multi2dNoise( float fScale, float offset )
{
float	fX = fScale * get_global_id(0);
float	fY = fScale * get_global_id(1);
float	fZ = offset;

float	randResult = Noise_3d(  fX,  fY,  fZ );
...

Figure 2. Example of Perlin noise use - three dimensions.

Limitations

The Noise_2d and Noise_3d functions follow the same basic Perlin noise algorithm but differ in implementation based on Perlin’s recommendations. (See reference 1.) In the Noise sample, only Noise_3d is exercised to implement the noise example, but a test kernel for Noise_2d is included in Noise.cl for the reader who wants to modify the sample to test that variation.

The Noise_2d and Noise_3d functions should be called with floating point input values. Values should span a range, such as (0.0, 128.0), to set the size of the “grid” (see Figure 3) of randomized values. Readers should look at the clouds example to understand how Perlin noise can be transformed into various “natural looking” images.

The default ParallelRNG function used in the random test provides visually random results but is not the fastest RNG algorithm. This function is based on the “Wang hash,” which was not designed for use as an RNG. However, some commonly used RNG functions (a commented out example is included in the Noise.cl file) showed visible regularities when filling a 2D image, particularly in the lower order bits of results. The reader may want to experiment with other, faster RNG functions.

The default ParallelRNG function generates only unsigned 32 bit integer results—if floating point values on a range such as (0.0, 1.0) are needed, the application must apply a mapping to that range. The random example maps the random unsigned integer result to the range (0, 255) to generate gray scale pixel values, simply using an AND binary operation to select 8 bits.

The default ParallelRNG function will not generate all 4,294,967,296 (2^32) unsigned integer values for sequential calls using the previously generated value. For any single starting seed value the pseudo-random sequences/cycles range from at least as small as 7,000 unique values to about 2 billion values long. There are around 20 different cycles generated by the default ParallelRNG function. The author believes it will be uncommon that any work item of an OpenCL kernel will require more sequentially generated random numbers than the smallest cycle can provide.

The 2D and 3D versions of the function—ParallelRNG2 and ParallelRNG3—use a “mixing” of cycles by applying an XOR binary operation between the result of a previous call to ParallelRNG and the next input value, which will change the cycle lengths. However, that altered behavior has not been characterized in detail, so it is recommended that the reader carefully validate that the ParallelRNG functions meet the needs of their application.

Project Structure

This section lists only the key elements of the sample application source code.

NoiseMain.cpp:

main()
Main entry point function. After parsing command-line options, it initializes OpenCL, builds the OpenCL kernel program from the Noise.cl file, prepares one of the kernels to be run, and calls ExecuteNoiseKernel(), then ExecuteNoiseReference(). After validating that the two implementations produce the same results, main() prints out the timing information each returned and stores the resulting images from each.

ExecuteNoiseKernel()
Set up and run the selected Noise kernel with OpenCL.

ExecuteNoiseReference()
Set up and run the selected Noise reference C code.

Noise.cl:

defaut_perm[256]
Table of random values 0—255 for 3D Perlin noise kernel. Note that this could be generated and passed to the Perlin noise kernel, for an added degree of randomness.

grads2d[16]
16 uniformly spaced unit vectors, gradients for 2D Perlin noise kernel.

grads3d[16]
16 vector gradients for 3D Perlin noise kernel.

ParallelRNG()
Pseudo-Random Number Generator, one pass over 1 input. An alternative RNG function is commented out, in case the reader wants to test a faster function that yields poorer results.

ParallelRNG2()
RNG doing 2 passes for 2 inputs

ParallelRNG3()
RNG doing 3 passes for 3 inputs

weight_poly3() and weight_poly5() and WEIGHT()
These are alternative weight functions used by Perlin noise, to insure continuous gradients everywhere. The second (preferred) function allows continuous 2nd derivative everywhere as well. The WEIGHT macro selects which is used.

NORM256()
Macro converting range (0, 255) to (-1.0, 1.0)

interp()
Bilinear interpolation using an OpenCL built

hash_grad_dot2()
Selects a gradient and does dot product with input xy, part of Perlin Noise_2d function.

Noise_2d()
Perlin noise generator with 2 inputs.

hash_grad_dot3()
Selects a gradient and does dot product with input xyz, part of Perlin Noise_3d function.

Noise_3d()
Perlin noise generator with 3 inputs.

cloud()
Generates one pixel of a “cloud” output image for CloudTest using Noise_3d .

map256()
Converts from the Perlin noise output range (-1.0, 1.0)to the range (0, 255) needed for gray scale pixels.

CloudTest()
The cloud image generation test. The slice parameter is passed to cloud, to allow the host code to generate alternative cloud images.

Noise2dTest()
Test of Noise_2d– not used by default.

Noise3dTest()
Test of Noise_3d– default Perlin noise function. Uses map256 to generate pixel values for a grayscale image.

RandomTest()
Test of ParallelRNG3, currently uses the low order byte of unsigned integer result to output a grayscale image.

Two Microsoft Visual Studio solution files are provided, for Visual Studio versions 2012 and 2013. These are “Noise_2012.sln” and “Noise_2013.sln”. If the reader has a newer version of Visual Studio, it should be possible to use the Visual Studio solution/project update to create a new solution derived from these.

Note that the solutions both assume that the Intel® OpenCL™ Code Builder has been installed.

Controlling the Sample

This sample can be run from a Microsoft Windows* command-line console, from a folder that contains the EXE file:

Noise.exe < Options >

Options:

-h or --help
Show command-line help. Does not run any of the demos.

-t or --type [ all | cpu | gpu | acc | default |<OpenCL constant for device type>
Select the device to run the OpenCL kernel upon by type of device. Default value: all

<OpenCL constant for device type>

CL_DEVICE_TYPE_ALL | CL_DEVICE_TYPE_CPU | CL_DEVICE_TYPE_GPU | 

CL_DEVICE_TYPE_ACCELERATOR | CL_DEVICE_TYPE_DEFAULT

-p or --platform< number-or-string >
Selects platform to use. A list of all platform numbers and names is printed when a demo is run. The platform being used will have “[Selected]” printed to the right of it. If using string, provide enough letters of the platform name to uniquely identify it. Default value: Intel

-d or --device < number-or-string >
Select the device to run the OpenCL kernels upon by device number or name. Device numbers and names on the platform being used are printed when a demo is run. The current device will have “[Selected]” printed to the right of it. Default value: 0

-r or --run [ random | perlin | clouds ]
Select the function demonstration to run. Random number, perlin noise, or cloud image generators each have demo kernels. Default value: random

-s or --seed < integer >
Provide an integer value to vary the algorithm output. Default value: 1

Noise.exe prints the time the OpenCL kernel and reference C-coded equivalent each take to run, as well as the names of the respective output files for each. When the program has finished printing information, it waits for the user to press ENTER before exiting. Please note that no attempt was made to optimize performance of the C-coded reference code functions; they are intended only to validate correctness of the OpenCL kernel code.

Examining Results

After a Noise.exe run is complete, examine the generated BMP format image files OutputOpenCL.bmp and OutputReference.bmp in the working folder, to compare the OpenCL and C++ code results, respectively. The two images should be identical, though it is possible that there might be very small differences between the two Perlin noise or cloud images.

The (Perlin) noise output should appear similar to Figure 3:

Figure 3. Perlin noise output.

The random output should look similar to Figure 4:

Figure 4. Random noise output.

The clouds function output should look similar to Figure 5 :

Figure 5. Generated cloud output.

References

Perlin, K., “Improving Noise,” http://mrl.nyu.edu/~perlin/paper445.pdf
“4-byte Integer Hashing,” http://burtleburtle.net/bob/hash/integer.html
Overton, M. A., “Fast, High-Quality, Parallel Random Number Generators,” Dr. Dobb’s website (2011). http://www.drdobbs.com/tools/fast-high-quality-parallel-random-number/229625477
Intel® Digital Random Number Generator (DRNG) Library Implementation and Uses, https://software.intel.com/en-us/articles/intel-digital-random-number-generator-drng-library-implementation-and-uses?utm_source=Email&utm_medium=IDZ
Intel Sample Source Code License Agreement, https://software.intel.com/en-us/articles/intel-sample-source-code-license-agreement/
Intel® OpenCL™ Code Builder,https://software.intel.com/en-us/opencl-code-builder

↧

Intel® RealSense™ Depth Enhanced Photography

October 2, 2015, 10:51 am

Latest and popular articles on Intel Technologies

≫ Next: Perceptual Drone Speech Recognition

≪ Previous: Parallel Noise and Random Functions for OpenCL™ Kernels

By Sean Golden

Photography is in the midst of an exciting revolution. Intel, the corporation that has enabled and powered the digital age, has developed Intel® RealSense™ Depth Enhanced Photography (DEP), a transformative approach to enabling artistic photography through new optical and processing technology. The result is a user experience that merges a virtual world with the real world in powerful ways. In short, it’s a new and compelling way to create, present, and experience visual art.

What Is Depth Enhanced Photography (DEP)?

DEP expands existing digital camera technology into three dimensions by capturing depth information as part of the image. Adding a depth value to each pixel during image capture allows photographers to utilize exciting new use cases that support new editing and presentation opportunities like being able to refocus on a portion of the image after capture or being able to apply a filter to an object in the foreground while retaining the background. The initial capture of these images can be accomplished with cameras such as those using Intel’s RealSense technology, which are becoming available even on mobile devices. The capture of the images and additional metadata is enabled by a new file format – eXtended Device Metadata (XDM). This new file format allows for presentation use cases to be performed on a wide range of display options at any time after the photographs are captured and stored.

Reference Image

Depth Map

Extended Device Metadata (XDM)

The core of DEP is the XDM enhanced file format that Intel has contributed to along with other leading technology companies with the goal of creating an open standard for the DEP ecosystem. The XDM specification, version 1.0, is a standard for storing device-related metadata in common image containers such as JPEG and PNG while maintaining compatibility with existing image viewers. XDM enabled files expand on the image information captured and stored, including a new three-dimensional (3D) depth value. One way to understand the new format is to consider a typical RGB image, which store three color values for every pixel. Then, add an additional attribute for each pixel’s distance from the camera. This depth information is added as a separate “depth map” or “point cloud” within the file container. The addition of depth information enables new experiences and interactivity with captured photographs. XDM supporting files can also contain data from multiple cameras. At a minimum a single camera with depth information is required. The first camera is associated with the container image (usually in JPEG format), while other cameras can optionally provide additional image, point clouds or depth maps that are processed in relation to the first image. One scenario is that additional full-resolution images from slightly different perspectives are also supported.

In addition to the new depth attribute for each pixel, XDM supporting files allow users to include additional metadata as desired, such as the camera’s orientation, location, or even manufacturer-specific sensor specifications. Intel is working with other technology companies to support and standardize this new file format with the goal of creating a worldwide standard file format so that the files are compatible across a wide range of platforms, from smartphones to desktop computers. A standardized image format also allows software developers to create applications that can recognize and support files created on different hardware platforms.

XDM Metadata in a JPG container

Depth Enhanced Photography in Action

Depth information can be used for a wide variety of functions that extend our idea of photography. Intel provides the Intel® RealSense™ SDK, which includes powerful use cases for photographers’ image-manipulation applications. Intel also encourages developers to create entirely new use cases that exploit or extend the enhanced data format. Examples of core use cases that are expected to become commonplace are:

XDM file reading and writing.
Artistic Filters/Background manipulation. By separating the image data into layers, image editing software can apply a wide variety of filters (masks) in real time. Users can apply filters to individual layers or can apply them while excluding specific layers. A simple example is maintaining color information for the foreground layer while converting the background layer to grayscale. Because the depth information allows for real-time separation of elements, it is even possible to substitute backgrounds, as shown in this Jim Parsons commercial demonstrating a virtual green screen in real time.
Depth-of-Field Change Effect. Depth of field (DoF) is a photographic concept originally controlled by the aperture and focal length of a camera lens. In simple terms, DoF is the distance between the nearest and farthest objects in the scene that appear sharp in the final photograph. Photographers use DoF to accentuate the elements in the photograph they want the viewer to notice. With traditional cameras, even modern digital SLRs, users must take multiple photographs to select different DoFs of the same image. Enhanced photography images allow the photographer to change the DoF after the photograph has been taken. For example, a single photograph of a large gathering at a family reunion can be taken one time and shared with as many people as desired, with the resulting image making each person in the image the focal point simply by touching his or her face in the photo, as long as everyone in the photo is in focus. This video featuring Jim Parsons from “The Big Bang Theory” demonstrates dynamic depth of field changing.
Motion Effects. Creative uses of depth data allow us to be able to create a two-dimensional (2D) image with the illusion of motion. XDM containing image files can be used to create motion effects like parallax and dolly zoom. Parallax is the slight difference in position between foreground and background objects in a scene based on the viewer’s line of sight. Simulating that difference in position by manipulating the foreground and background depth information in an XDM file allows a powerful illusion of depth from a single image. Dolly zoom is a more cinematic effect created by enlarging or shrinking either the foreground or background image in relation to the other. For example, enlarging the background while keeping the subject in the foreground the same size creates an illusion that the subject is either moving backwards or shrinking. In contrast, enlarging the subject in the foreground while keeping the background the same size creates an illusion that the subject is moving forward or growing rapidly.
Editing. Identifying layers by depth allows users to make a wide range of static or dynamic edits to files. Users can easily insert objects into an image between the foreground and background based on the depth information captured in the XDM file. Objects can be removed and replaced with new objects without complex and time-consuming pixel-by-pixel edits. Once placed, the new elements can be moved around or resized without having to redefine boundaries or spend more time on further detailed pixel editing.
In-Photo Measurements. Determining the size of objects can be a difficult task. Most of us have had the experience of needing to know the size of a box or a piece of furniture. Having the 3D data in an image allows users to get quick estimates of sizes of objects without needing a measuring tape. A photo of a room can allow the user to access measurements from their tablet while shopping for curtains or a new sofa.

What Does It All Mean?

DEP is transforming the fundamental concepts of image capture, processing, and viewing. It extends our idea of what an image is to three dimensions and provides dynamic interactivity at all stages of the image’s life cycle. In many respects, the XDM file enhancement is a new graphic medium, as different from traditional digital images as digital images are from chemical film-based images. Just as digital imaging created new opportunities for artists to create, manipulate, and display their art, Intel RealSense provides people with new ways to express themselves in a dynamic, interactive environment.

Other aspects of Intel RealSense technology are more applicable to science and technology. Intel RealSense cameras provide highly accurate, real-time 3D rendering of the environment, allowing drones or robots to maneuver through complex terrain more effectively. This has obvious potential in the ongoing development of self-driving cars and trucks as well as for greatly improving the efficiency of remote drones, such as the Mars rovers.

Where Does Intel® RealSense™ Go from Here?

Intel RealSense technology is opening the door into a new form of graphic expression. As with most transformative technologies, there’s no way to predict every way it will change how we interact with images.

Many manufacturers have already begun to include Intel RealSense cameras in their products. New applications are being built with the Intel RealSense SDK as you read this. Clearly, cameras on mobile devices will not be the same once DEP becomes mainstream.

For More Information

Download the Intel RealSense SDK at https://software.intel.com/en-us/intel-realsense-sdk/download.
Find the full feature set of the Intel RealSense SDK at https://software.intel.com/en-us/intel-realsense-sdk/details.
Learn more about the eXtensible Device Metadata Specification version 1.0 at https://software.intel.com/en-us/articles/the-extensible-device-metadata-xdm-specification-version-10

About the Author

Sean Golden is an author and technical writer. His most recent corporate job was as program director at a global financial services company. His background includes managing initiatives in financial product development, online media, Internet and traditional publishing, and data center consolidations for Fortune 500 companies. Prior to that, he managed application and enterprise software development for 15 years and served as publisher for a suite of personal computing monthly publications. He has a B.S. degree in physics.

↧

Perceptual Drone Speech Recognition

October 2, 2015, 3:00 pm

Latest and popular articles on Intel Technologies

≫ Next: OS X* 10.11 not Supported by Intel® Parallel Studio XE 2015

≪ Previous: Intel® RealSense™ Depth Enhanced Photography

Download Code Sample

Controlling Drones with Speech-Recognition Applications Using the Intel® RealSense™ SDK

Every day we hear about drones in the news. With applications ranging from spying and fighting operations, photography and video, and simply for fun, drone technology is on the ground floor and worth looking into.

As developers, we have the ability to create applications that can control them. A drone is ultimately just a programmable device, so connecting to them and sending commands to perform desired actions can be done using a regular PC or smartphone application. For this article, I have chosen to use one of the most “hackable” drones available on the market: the Parrot’s AR.Drone* 2.0.

We will see how to interact with and control this drone with a library written in C#. Using this as our basis we will add speech commands to control the drone using the Intel® RealSense™ SDK.

PARROT AR.DRONE 2.0

Among the currently marketed drones for hobbyists, one of the most interesting is the AR.Drone 2.0 model from Parrot. It includes many features and incorporates a built-in help system that provides a stabilization and calibration interface. The drone’s sturdy Styrofoam protection helps to avoid damage to the propellers or moving parts in case of falls or collisions with fixed obstacles.

The AR.Drone* 2.0 from Parrot

The hardware provides a connection with external devices on its own Wi-Fi* network between the drone and the connected device (smartphone, tablet, or PC). The communication protocol is based on AT-like type messages (like those used to program and control telephone modems years ago).

Using this simple protocol, it is possible to send the drone all the commands needed to get it off the ground, raise or lower in altitude, and fly in different directions. It is also possible to read a stream of images taken from cameras (in HD) placed onboard the drone (one front and one facing down) to save pictures during flights or capture video.

The company provides several applications to fly the drone manually; however, it’s much more interesting to study how to autonomously control the flight. For this reason, I decided (with the help of my colleague Marco Minerva) to create an interface that would allow us to control it through different devices.

Controlling the Drone Programmatically

We said that the drone has its own Wi-Fi network, so we’ll connect to it to send control commands. The AR.Drone 2.0 developer guide gave us all the information we needed. For example, the guide says to send commands via UDP to the 192.168.1.1 address, on port 5556. These are simple strings in the AT format:

• AT * REF for takeoff and landing control

• AT * PCMD to move the drone (direction, speed, altitude)

Once we connect to the drone, we’ll create a sort of ‘game’ where we send commands to the drone based on the inputs of our application. Let's see how to create a Class Library.

First, we must connect to the device:

public static async Task ConnectAsync(string hostName = HOST_NAME, string port = REMOTE_PORT)
        {
             // Set up the UDP connection.
             var droneIP = new HostName(hostName);

             udpSocket = new DatagramSocket();
             await udpSocket.BindServiceNameAsync(port);
             await udpSocket.ConnectAsync(droneIP, port);
             udpWriter = new DataWriter(udpSocket.OutputStream);

             udpWriter.WriteByte(1);
             await udpWriter.StoreAsync();

             var loop = Task.Run(() => DroneLoop());
        }

As mentioned, we must use the UDP protocol, so we need a DatagramSocket object. After connecting with the ConnectAsync method, we create a DataWriter on the output stream to send the commands themselves. Finally, we send the first byte via Wi-Fi. It will be discarded by the drone and is only meant to initialize the system.

Let's check the command sent to the drone:

        private static async Task DroneLoop()
        {
            while (true)
            {

                var commandToSend = DroneState.GetNextCommand(sequenceNumber);
                await SendCommandAsync(commandToSend);

                sequenceNumber++;
                await Task.Delay(30);
            }
        }

The tag DroneState.GetNextCommand formats the string AT command that must be sent to the device. To do this, we need a sequence number because the drone expects that each command is accompanied by a progressive number and ignores all the commands with a number equal to or less than one already posted.

Then we use WriteString to send the command via StreamSocket to the stream, forcing StoreAsync to write the buffer and submit. Finally, we increment the sequence number and use Task Delay to introduce a 30-millisecond delay for the next iteration.

The DroneState class is the one that deals with determining which command to send:

    public static class DroneState
    {
       public static double StrafeX { get; set; }
       public static double StrafeY { get; set; }
       public static double AscendY { get; set; }
       public static double RollX { get; set; }
       public static bool Flying { get; set; }
       public static bool isFlying { get; set; }

        internal static string GetNextCommand(uint sequenceNumber)
        {
            // Determine if the drone needs to take off or land
            if (Flying && !isFlying)
            {
                isFlying = true;
                return DroneMovement.GetDroneTakeoff(sequenceNumber);
            }
            else if (!Flying && isFlying)
            {
                isFlying = false;
                return DroneMovement.GetDroneLand(sequenceNumber);
            }

            // If the drone is flying, sends movement commands to it.
            if (isFlying && (StrafeX != 0 || StrafeY != 0 || AscendY != 0 || RollX != 0))
                return DroneMovement.GetDroneMove(sequenceNumber, StrafeX, StrafeY, AscendY, RollX);

            return DroneMovement.GetHoveringCommand(sequenceNumber);
        }
    }

The properties StrafeX, StrafeY, AscendY, and RollX define the speed of navigation left/right, forward/backward, the altitude, and rotation change of the drone, respectively. These properties are double and accept values between 1 and -1. For example, setting StrafeX to -0.5 moves the drone to the left at half of its maximum speed; specifying 1 will go to the right at full speed.

Flying is a variable that determines the takeoff or landing. In the GetNextCommand method we check the values of these fields to decide which command to send to the drone. These commands are in turn managed by the DroneMovement class.

Note that, if no command is specified, the last statement creates the so-called Hovering command, an empty command that keeps the communication channel open between the drone and the device. The drone needs to be constantly receiving messages from the controlling application, even when there’s no action to do and no status has changed.

The most interesting method of the DroneMovement class is definitely GetDroneMove, which effectively composes and sends the command to the drone. For other methods related to movement, please refer to this sample.

public static string GetDroneMove(uint sequenceNumber, double velocityX, double velocityY, double velocityAscend, double velocityRoll)
    {
        var valueX = FloatConversion(velocityX);
        var valueY = FloatConversion(velocityY);
        var valueAscend = FloatConversion(velocityAscend);
        var valueRoll = FloatConversion(velocityRoll);

        var command = string.Format("{0},{1},{2},{3}", valueX, valueY, valueAscend, valueRoll);
        return CreateATPCMDCommand(sequenceNumber, command);
    }
private static string CreateATPCMDCommand(uint sequenceNumber, string command, int mode = 1)
    {
        return string.Format("AT*PCMD={0},{1},{2}{3}", sequenceNumber, mode, command, Environment.NewLine);
    }

The FloatConversion method is not listed here, but it converts a double value between -1 and 1 in a signed integer that can be used by the AT commands, like the PCMD string to control the movements.

The code shown here is available as a free library on NuGet, called AR.Drone 2.0 Interaction Library, which provides everything you need to control the device from takeoff to landing.

AR.Drone UI on NuGet

Thanks to this sample application, we can forget the implementation details and focus instead on delivering apps that, through different modes of interaction, allow us to pilot the drone.

Intel® RealSense™ SDK

Now let’s look at one of the greatest and easiest-to-use features (for me) of the Intel RealSenseSDK — speech recognition.

The SDK offers two different approaches to speech:

Command recognition (from a given dictionary)
Free text recognition (dictation)

The first is essentially a list of commands, defined by the application, in a specified language for instructing the ‘recognizer’. Words not on the list are ignored.

The second is a sort of a recorder that “understands” any vocabulary in a free-form stream. It is ideal for transcriptions, automatic subtitling, etc.

For our project we will use the first option because we want to implement only a finite number of commands to send to the drone.

First, we need to define some variables to use:

        private PXCMSession Session;
        private PXCMSpeechRecognition SpeechRecognition;
        private PXCMAudioSource AudioSource;
        private PXCMSpeechRecognition.Handler RecognitionHandler;

Session is a tag required to access I/O and the SDK’s algorithms, since all subsequent actions are inherited from this instance.

SpeechRecognition is the instance of the recognition module created with a CreateImpl function inside the Session environment.

AudioSource is the device interface to establish and select an input audio device (in our sample code we select the first audio device available to keep it simple).

RecognitionHandler is the real handler that assigns the eventhandler for the OnRecognition event.

Let’s now initialize the session, the AudioSource, and the SpeechRecognition instance.

            Session = PXCMSession.CreateInstance();
            if (Session != null)
            {
                // session is a PXCMSession instance.
                AudioSource = Session.CreateAudioSource();
                // Scan and Enumerate audio devices
                AudioSource.ScanDevices();

                PXCMAudioSource.DeviceInfo dinfo = null;

                for (int d = AudioSource.QueryDeviceNum() - 1; d >= 0; d--)
                {
                    AudioSource.QueryDeviceInfo(d, out dinfo);
                }
                AudioSource.SetDevice(dinfo);

                Session.CreateImpl<PXCMSpeechRecognition>(out SpeechRecognition);

As noted before, to keep the code simple we select the first Audio device available.

PXCMSpeechRecognition.ProfileInfo pinfo;
              SpeechRecognition.QueryProfile(0, out pinfo);
              SpeechRecognition.SetProfile(pinfo);

Then we need to query the system about the actual configuration parameter and assign it to a variable (pinfo).

We should also set some parameters in the profile info to change the recognized language. Set the recognition confidence level (higher value request stronger recognition), end of recognition timeout, etc.

In our case we set the default parameter as in profile 0 (the first received from Queryprofile).

                String[] cmds = new String[] { "Takeoff", "Land", "Rotate Left", "Rotate Right", "Advance","Back", "Up", "Down", "Left", "Right", "Stop" , "Dance"};
                int[] labels = new int[] { 1, 2, 4, 5, 8, 16, 32, 64, 128, 256, 512, 1024 };
                // Build the grammar.
                SpeechRecognition.BuildGrammarFromStringList(1, cmds, labels);
                // Set the active grammar.
                SpeechRecognition.SetGrammar(1);

Next, we’ll define the grammar dictionary for instructing recognition system. Using BuildGrammarFromStringList we create a simple list of verbs and corresponding return values defining grammar number 1.

It is possible to define multiple grammars to use in our application and activate one at a time when needed, so we could create all the different command dictionaries for all the supported languages and provide a way for the user to switch between the different languages recognized by the SDK. In this case, you must install all the corresponding DLL files for the specific language support (the default SDK setup installs only the US English support assemblies). In this sample, we use only one grammar set with the default installation of US English.

We then select which grammar to make active in SpeechRecognition instance.

                RecognitionHandler = new PXCMSpeechRecognition.Handler();

                RecognitionHandler.onRecognition = OnRecognition;

Those instructions define a new eventhandler for the OnRecognition event and assign it to a method defined below:

        public void OnRecognition(PXCMSpeechRecognition.RecognitionData data)
        {
            var RecognizedValue = data.scores[0].label;
            double movement = 0.3;
            TimeSpan duration = new TimeSpan(0, 0, 0, 500);
            switch (RecognizedValue)
            {
                case 1:
                    DroneState.TakeOff();
                    WriteInList("Takeoff");
                    break;
                case 2:
                    DroneState.Land();
                    WriteInList("Land");
                    break;
                case 4:
                    DroneState.RotateLeftForAsync(movement, duration);
                    WriteInList("Rotate Left");
                    break;
                case 5:
                    DroneState.RotateRightForAsync(movement, duration);
                    WriteInList("Rotate Right");
                    break;
                case 8:
                    DroneState.GoForward(movement);
                    Thread.Sleep(500);
                    DroneState.Stop();
                    WriteInList("Advance");
                    break;
                case 16:
                    DroneState.GoBackward(movement);
                    Thread.Sleep(500);
                    DroneState.Stop();
                    WriteInList("Back");
                    break;
                case 32:
                    DroneState.GoUp(movement);
                    Thread.Sleep(500);
                    DroneState.Stop();
                    WriteInList("Up");
                    break;
                case 64:
                    DroneState.GoDown(movement);
                    Thread.Sleep(500);
                    DroneState.Stop();
                    WriteInList("Down");
                    break;
                case 128:
                    DroneState.StrafeX = .5;
                    Thread.Sleep(500);
                    DroneState.StrafeX = 0;
                    WriteInList("Left");
                    break;
                case 256:
                    DroneState.StrafeX = -.5;
                    Thread.Sleep(500);
                    DroneState.StrafeX = 0;
                    WriteInList("Right");
                    break;
                case 512:
                    DroneState.Stop();
                    WriteInList("Stop");
                    break;
                case 1024:
                    WriteInList("Dance");
                    DroneState.RotateLeft(movement);
                    Thread.Sleep(500);
                    DroneState.RotateRight(movement);
                    Thread.Sleep(500);
                    DroneState.RotateRight(movement);
                    Thread.Sleep(500);
                    DroneState.RotateLeft(movement);
                    Thread.Sleep(500);
                    DroneState.GoForward(movement);
                    Thread.Sleep(500);
                    DroneState.GoBackward(movement);
                    Thread.Sleep(500);
                    DroneState.Stop();
                    break;
                default:
                    break;

            }
            Debug.WriteLine(data.grammar.ToString());
            Debug.WriteLine(data.scores[0].label.ToString());
            Debug.WriteLine(data.scores[0].sentence);
            // Process Recognition Data
        }

This is a method of getting a value returned from the recognition data and executing the corresponding command (in our case the corresponding flight instruction for the drone).

Every drone command refers to the DroneState call with the specific method (TakeOff, GoUp, DoDown, etc.) with some specific parameter of movement or duration, referring in each case to a specific quantity of movement or a time duration for it.

Some commands need an explicit call to the Stop method to interrupt the actual action otherwise the drone will continue to move as instructed (refer to the previous code for those command).

In some cases is necessary to insert a Thread.Sleep between two different commands to permit the completion of the previous operation before sending the new command.

In order to test the recognition even if we don’t have a drone available I’ve inserted a variable (controlled by the checkbox present in the main window) that instructs the Drone Stub functional mode that creates the command but doesn’t send it.

To close the application, call the OnClosing method to close and destroy all the instances and handlers and to basically clean up the system.

In the code you can find some debug commands that print some helpful information in the Visual Studio* debug windows when testing the system.

Conclusion

In this article, we have shown how we can interact with a device as complex as a drone using an interaction interface with natural language. We have seen how we can define a simple dictionary of verbs and instruct the system to understand it and consequently control a complex device like a drone in flight. What I show in this article is only a small fraction of the available possibilities to operate the drone, and infinite options are possible.

Photo of the flying demonstration at the .NET Campus event session in 2014

About the Author

Marco Dal Pino has worked in IT from more than 20 years, and is a Freelance Consultant working on the .NET platform. He’s part of the staff of DotNetToscana, which is a community focused on Microsoft technologies, and he is a Microsoft MVP for Windows Platform Development. He develops Mobile and Embedded applications for retail and enterprise sectors, and is also involved in developing Windows Phone and Windows 8 applications for a 3rd party company.

Marco has been Nokia Developer Champion for Windows Phone since 2013, and that same year Intel recognized him as an Intel Developer Zone Green Belt for the activity of developer support and evangelization about Perceptual and Intel RealSense technology. He’s also an Intel Software Innovator for Intel RealSense and IoT technologies.

He is a Trainer and speaks at major technical conferences.

Marco Minerva has been working on .NET platform since its first introduction. Now he is mainly focused on designing and developing Windows Store and Windows Phone apps using Windows Azure as back-end. He is co-founder and president of DotNetToscana, the Tuscany .NET User Group. He is speaker in technical conferences and writes for magazines.

↧

OS X* 10.11 not Supported by Intel® Parallel Studio XE 2015

October 2, 2015, 3:23 pm

Latest and popular articles on Intel Technologies

≫ Next: Platform Analyzer - Analyzing Healthy and not-so Healthy Applications

≪ Previous: Perceptual Drone Speech Recognition

The newly released OS X* 10.11 (El Capitan) has some changes related to directory permissions. OS X* 10.11 introduced a new security policy called System Integrity Protection. With the new OS the following directories can only be written to by the system:

/bin
/sbin
/usr
/system
/Applications/Utilities

In contrast the following directories are available to any process:

/usr/local
/Applications
~/Library

The current release Intel Parallel Studio 2015 and 2016 RTM do not support it. If you upgrade your OS X 10.10 to 10.11, the Intel C++ Compiler's Xcode integration will not work. There is no work around for that, but the command line should continue work.

The OS X 10.11 support will be coming in a future Intel Parallel Studio 2016 update. Please see the Release Notes for the support.

↧

Platform Analyzer - Analyzing Healthy and not-so Healthy Applications

October 5, 2015, 4:42 pm

Latest and popular articles on Intel Technologies

≫ Next: Single Node Caffe Scoring and Training on Intel® Xeon E5-Series Processors

≪ Previous: OS X* 10.11 not Supported by Intel® Parallel Studio XE 2015

Recently my wife purchased a thick and expensive book. As an ultrasonic diagnostician for children, she purchases many books, but this one had me puzzled. The book was titled Ultrasound Anatomy of the Healthy Child. Why would she need a book that showed only healthy children? I asked her and her answer was simple: to diagnose any disease, even one not yet discovered, you need to know what a healthy child looks like.

In this article we will act like doctors, analyzing and comparing a healthy and a not-so-healthy application.

Knock – knock – knock.

The doctor says: “It’s open, please enter.”

In walks our patient, Warrior Wave*, an awesome game in which your hand acts as the road for the warriors to cross. It’s extremely fun to play, innovative, and uses Intel® RealSense™ technology.

While playing the game, though, something felt a little off. Something that I hadn’t felt before in other games based on Intel® RealSense™ technology. The problem could be caused by so many things, but what is it in this case?

Like any good doctor who is equipped with the latest and greatest analysis tools to diagnose the problem, we have the perfect tools to analyze our patient.

Using Intel® Graphics Performance Analyzer (Intel® GPA) Platform Analyzer, we receive a time-line view of our application’s CPU load, frame time, frames per second (FPS), and draw calls:

Let’s take a look.

Hmm… the first things that catch our eye are the regular FPS surges that occur periodically. All is relatively smooth for ~200 milliseconds and then jumps up and down severely.

For comparison, let’s look at a healthy FPS trace bellow. The game in this trace felt smooth and played well.

No pattern was evident within the frame time, just normal random deviations.

But in our case we see regular surges. These surges happen around four times a second. Let’s investigate the problem deeper, by zooming in on one of the surges and seeing what happening in the threads:

We can see that working thread 2780 spends most of the time in synchronization. The thread does almost nothing but wait for the next frame from the Intel® RealSense™ SDK:

At the same time, we see that rendering goes in another worker thread. If we scroll down, we find thread 2372.

Instead of “actively” waiting for the next frame from the Intel RealSense SDK, the game could be doing valuable work. Drawing and Intel® RealSense™ SDK work could be done in one worker thread instead of two, simplifying thread communication.

Excessive inter-thread communication can drastically slow down the execution and cause many problems.

Here is the example of a “healthy” game, where the Intel® RealSense™ SDK work and the DirectX* calls are in one thread.

RealSense™ experts say: there is no point in waiting for the frames from the Intel® RealSense™ SDK. They won’t be ready any faster.

But we can see that the main problem is at the top of the timeline.

On average, five out of six CPU frames did not result in a GPU frame. This is the cause of the slow and uneven GPU frame rate, which on average is less than 16 FPS.

Now let’s look at the pipeline to try and understand how the code is executing. Looking at the amount of packets on “Engine 0,” the pipeline is filled to the brim, but the execution is almost empty.

The brain can process 10 to 12 separate images per second, perceiving them individually. This explains why the first movies were cut at a rate of 16 FPS: this is the average threshold at which the majority of people stop seeing a slide show and start seeing a movie.

Once again, let’s see the profile of the nice-looking game:

Notice that the GPU frames follow the CPU frames with little shift. For every CPU frame, there is a corresponding GPU that starts execution after a small delay.

Let’s try to understand why our game doesn’t have this pattern.

First, let’s examine our DirectX* calls. The highlighted one with the tooltip is our “Present” call that sends the finished frame to the GPU. In the screenshot above, we see that it creates a “Present” packet on the GPU pipeline (marked with X’s). At round the 2215 ms mark, it has moved closer to execution, jumping over three positions, but at 2231 ms it just disappears without completing execution.

And if we look at each present call within the trace, not one call successfully makes it to execution.

Question: How does the game draw itself if all our DirectX* Present calls are ignored?! Good thing we have good tools so we can figure this out. Let’s take a look.

Can you see something curious inside the gray oval? We can see that this packet, not caused by any DirectX* call of our code, still gets to the execution, fast and out of order. Hey, wait a minute!!!

Let's look closely at our packet.

And now to the packet that got executed.

Wow! It came from an EXTERNAL thread. What could this mean? External threads are threads that don’t belong to the game.

Our own packets get ignored, but an external thread draws our game? What? Hey, this tool went nuts!

No, the image is quite right. The explanation is that on the Windows* system (starting with Windows Vista*), there is a program called Desktop Window Manager (DWM), which does the actual composition on the screen. Its packets are the ones we see executing at a fast rate with high priority. And no, our packets aren’t lost—they are intercepted by DWM to create the final picture.

But why would DWM get involved in a full- screen game? After thinking a while, I realized that the answer is simple: I have a multi-monitor desktop configuration. Switching my second monitor off the schema made the Warrior Wave behave like other games: normal GPU FPS, no glitches, and no DWM packets.

The patient will live! What a relief!

But other games still worked well even with a multi-monitor configuration, right (says the evil voice in the back of my head)?

To dig deeper, we need another tool to do that. Intel® GPA Platform Analyzer allows you to see CPU and GPU execution over time, but it doesn’t give you lower level details of each frame.

We would need to look more closely at the Direct3D* Device creation code. For this we could use Intel® GPA Frame Analyzer for DirectX*, but this is a topic for another article.

So let’s summarize what we have learned:

During this investigation we were able to detect poor usage of threads that led to FPS surges and a nasty DWM problem that was easily fixed by switching the second monitor of the desktop schema.

Conclusion: Intel® GPA Platform Analyzer is a must-have tool for initial investigation of the problem. Get familiar with it and add it to your toolbox.

About the Author:

Alexander Raud works in the Intel® Graphics Performance Analyzers team in Russia and previously worked on the VTune Amplifier. Alex has dual citizenship in Russia and the EU, speaks Russian, English, some French, and is learning Spanish. Alex has a wife and two children and still manages to play Progressive Metal professionally and head the International Ministry at Jesus Embassy Church.

↧