Intel® Software Guard Extensions Developing a Sample Enclave Application

July 21, 2016, 2:24 pm

Latest and popular articles on Intel Technologies

≫ Next: Using the Intel® RealSense™ Camera with TouchDesigner*: Part 3

≪ Previous: OpenGL* Performance Tips: Use Native Formats for Best Rendering Performance

Developing a Sample Enclave Application

Step 1

In this topic, you will see a quick guide of how to develop an enclave application.

Assume that you have an application with the following code:

#include <stdio.h>
#include <string.h>

#define MAX_BUF_LEN 100

void foo(char *buf, size_t len)
{
	const char *secret = "Hello App!";
	if (len > strlen(secret))
	{
		memcpy(buf, secret, strlen(secret) + 1);
	}
}
int main()
{
	char buffer[MAX_BUF_LEN] = "Hello World!";

	foo(buffer, MAX_BUF_LEN);
	printf("%s", buffer);

	return 0;
}

The program displays the string Hello App!

Step 2: Create an Enclave

1. On the menu bar of Microsoft* Visual Studio*, choose File-->New-->Project.

The New Project dialog box opens.

2. Select Templates-->Visual C++-->Intel® SGX Enclave Project. Enter name, location, and solution name in the appropriate fields like any other Microsoft* Visual Studio* project.

3. Click OK and the welcome dialog appears.

4. Click Next to go to the Enclave Settings page.

5. Configure the enclave with proper settings

Project Type:
- Enclave – Create an enclave project.
- Enclave library – Create a static library for an enclave project.
Additional Libraries:
- C++ STL – Link C++ STL with the enclave project.
Signing Key:
- Import an existing signing key to the enclave project. A random key will be generated if no file is selected. The Enclave signer will sign the enclave with the key file.

When the enclave project is created, the wizard ensures that the enclave project has proper settings.

Step 3: Define Enclave Interface

Use an EDL file to define the enclave interface, which exposes a trusted interface foo. The EDL file might look like the following:

// sample_enclave.edl
enclave {
	trusted {
		public void foo([out, size=len] char* buf, size_t len);
	};
};

Step 4: Import Enclave to Application

To call the enclave interface in the application, import the enclave to the application using Microsoft* Visual Studio* Intel® Software Guard Extensions Add-in.

1. Right click the application project and select Intel® SGX Configuration -> Import Enclave.

The Import Enclave dialog box opens.

2. Check the sample_enclave.edl box, and then press OK.

Step 5: Implement Application and Enclave Functions

To implement application and enclave functions, use the following code samples:

The enclave code

// sample_enclave.cpp
#include "sample_enclave_t.h"
#include <string.h>
void foo(char *buf, size_t len)
{
	const char *secret = "Hello Enclave!";
	if (len > strlen(secret))
	{
		memcpy(buf, secret, strlen(secret) + 1);
	}
}

The application code

#include <stdio.h>
#include <tchar.h>
#include "sgx_urts.h"
#include "sample_enclave_u.h"

#define ENCLAVE_FILE _T("sample_enclave.signed.dll")
#define MAX_BUF_LEN 100

int main()
{
	sgx_enclave_id_t   eid;
	sgx_status_t       ret   = SGX_SUCCESS;
	sgx_launch_token_t token = {0};
	int updated = 0;
	char buffer[MAX_BUF_LEN] = "Hello World!";

	// Create the Enclave with above launch token.
	ret = sgx_create_enclave(ENCLAVE_FILE, SGX_DEBUG_FLAG, &token,&updated, &eid, NULL);
	if (ret != SGX_SUCCESS) {
		printf("App: error %#x, failed to create enclave.\n", ret);
		return -1;
	}


	// A bunch of Enclave calls (ECALL) will happen here.
	foo(eid, buffer, MAX_BUF_LEN);
	printf("%s", buffer);

	// Destroy the enclave when all Enclave calls finished.
	if(SGX_SUCCESS != sgx_destroy_enclave(eid))
	return -1;

	return 0;
}

Step 6: Compilation and Execution

Now you can compile the application and enclave projects. After the compilation, set the working directory to the output directory and run the program. You should get the string Hello Enclave!

↧

Using the Intel® RealSense™ Camera with TouchDesigner*: Part 3

July 25, 2016, 12:10 pm

Latest and popular articles on Intel Technologies

≫ Next: Fine-Tuning Optimization for a Numerical Method for Hyperbolic Equations Applied to a Porous Media Flow Problem with Intel® Tools

≪ Previous: Intel® Software Guard Extensions Developing a Sample Enclave Application

The Intel® RealSense™ camera (R200) is a vital tool for creating VR and AR projects and real time performance interactivity. I found TouchDesigner*, created by Derivative*, to be an excellent program for utilizing the information provided by the Intel RealSense cameras.

This third article is written from the standpoint of creating real time interactivity in performances using the Intel RealSense camera in combination with TouchDesigner. Interactivity in a performance always adds a magical element. There will be example photos and videos from an in-progress interactive dance piece I am directing and creating visuals for. There will be demos showing how you can create different interactive effects using the Intel RealSense camera (R200). The interactive performance dance demo takes place in the Vortex Immersion dome in Los Angeles where I am a resident artist. The dancer and choreographer is Stevie Rae Gibbs, and Tim Hicks, cinematography and VR live action shooting, assisted me. The music was created by Winter Lazerus. The movies embedded in this article were shot by Chris Casady and Winter Lazerus.

Things to Consider When Creating an Interactive Immersive Project.

Just as in any performance there needs to be a theme. The theme of this short interactive demo is simple, liberation from what is trapping the dancer, the box weighing her down. The interactivity contributed to this theme. The effects were linked to the skeletal movements of the dancer and some were linked to the color and depth information provided by the Intel RealSense camera. The obvious linking of the effects to the dancer contributed to a sense of magic. The choreography and dancing had to work with the effects. Besides the use of theatrical lighting care had to be taken so that enough light was on the subject so that the Intel RealSense cameras could properly register. The camera distances from the dancer also had to be considered, taking into account the range of the camera and the effect wanted. The dancer also had to be careful to stay within the effective camera range.

The demo dance project is an immersive full dome performance piece so it had to be mapped to the dome. Having the effects mapped to the dome also influenced their look. For the Vortex Immersion dome, Jeff Smith of Eye Vapor has created a TouchDesigner interface for dome mapping. I used this interface as the base layer within which to put my TouchDesigner programming of the interactive effects.

Jeff Smith on Mapping the Dome Using TouchDesigner:

“There were several challenges in creating a real time mapping solution for a dome using TouchDesigner. One of the first things we had to work through was getting a perspective corrected image through each projector. The solution, which is well known now, is to place virtual projectors inside a mapped virtual dome and render out an image for each projector. Another challenge was to develop a set of alignment and blending tools to be able to perfectly calibrate and blend the projected image. And finally, we had to develop custom GLSL shaders to render real time fisheye imagery”.

Tim Hicks on Technical Aspects of Working with the RealSense Camera

“Working with the Intel RealSense camera was extremely efficient in creating a simple and stable workflow to connect our performer’s gestures through TouchDesigner, and then rendered out as interactive animations. Setup is quick and performance is reliable, even in low light, which is always an issue when working inside an immersive digital projection dome.”

Notes for Utilizing TouchDesigner and the Intel RealSense Camera

Like Part 1 and Part 2, Part 3 is aimed at those familiar with using TouchDesigner and its interface. If you are unfamiliar with TouchDesigner, before you follow the demos I recommend that you review some of the documentation and videos available here: Learning TouchDesigner. The Part 1 and Part 2 articles walk you through use of the TouchDesigner nodes described in this article, and provide sample .toe files to get you started.

A free non-commercial copy of TouchDesigner is available and is fully functional, except that the highest resolution is limited to 1280 x 1280.

Note: When using the Intel RealSense camera, it is important to pay attention to its range for best results.

Demo #1: Using the Depth Mapping of the R200 and SR300 Camera

This is a simple and effective way to create interactive colored lines that respond to the movement of the performer. In the case of this performance, the lines wrapped and animated around the entire dome in response to the movement of the dancer.

Create the nodes you will need, arrange, and connect/wire them in a horizontal row in this order:
- RealSense TOP node
- Level TOP node
- Luma Level TOP node
Open the Setup parameters page of the RealSense TOP node and set the Image parameter to Depth.
Set the parameters of the Level TOP and the Luma Level TOP to offset the brightness and contrast. Judge this by looking at the result you are getting in the effect.
Figure1.You are using the Depth setting in the RealSense TOP node for the R200 camera.
Create a Blur TOP and a Displace TOP.
Wire the Luma Level TOP to the Blur TOP and the top connector on the Displace TOP.
Connect the Blur TOP to the bottom connector of the Displace TOP (Note: the filter size of the blur should be based on what you want your final effect to look like).
Figure 2.Set the Filter for the Blur TOP at 100 as a starting point
Create a Ramp TOP, a Composite TOP.
Choose the colors you want your line to be in the Ramp TOP.
Connect the Displace TOP to the top connector in the Composite TOP and the Ramp TOP to the bottom connector in the Displace TOP.
Figure 3.You are using the Depth setting in the RealSense TOP node for the R200 camera.

Figure 4.The complete network for the effect.
Watch how the line reacts to the performer's motions.
Figure 5.Videofrom the demo performance of the colored lines created from the depth mapping of the performer by the RealSense camera.

Demo #2: RealSense TOP Depth Mapping Second Effect

In this demo, we use TouchDesigner with the depth feature of the Intel RealSense R200 Camera to project duplicates of the performer and offset them in time. I used it in the demo performance to project several images of the dancer moving at different times, creating the illusion of more than one dancer. Note that this effect was not in the dance performance but it is very worth using.

Add a RealSense TOP node to your scene.
On the Setup parameters page for the RealSense TOP node, for the Image parameter select Depth.

Create two Level TOP nodes and connect the RealSense TOP node to each of them.
Figure 6.You are using the Depth setting in the RealSense TOP node for the R200 camera.
Adjust the level node parameters to give you the amount of contrast and brightness you want on your effect. You might go back after seeing the effect and readjust the parameters. As a starting point for both Level TOPS, in the Pre Parameters page, set the Brightness parameter to 2 and the Gamma parameter to 1.75.
Create a Transform TOP and wire it to level2 TOP.
In the Transform TOP Parameters, on the Transform page, set the Translate x parameter to .2.Note that translating the x 1 would move the image fully off.
Create two Cache TOP nodes and wire one to the Transform TOP and one to level1 TOP.
On the cache1 TOPs parameters Cache Page, set Cache Size to 32 and Output Index to -20.
On the cache2 TOPs parameters Cache Page, set Cache Size to 32 and the Output Index to -40. I am using the Cache TOP to save and offset the timing of the images. Note that once you see how your effect is working with your performance you will want to go back and readjust these settings.
Notes on the Cache TOP: The Cache TOP can be used to freeze images in the TOP by turning the Active parameter to Off. (You can set the cache size to 1.) The Cache TOP acts as a delay if you set Output Index to a negative number and leave the Active parameter at On. Once a sequence of images has been captures by turning the On parameter on and off, they can be looped by animating the Output Index parameter.
For more info on the Cache TOP, click here.
Figure 7.You could add in more Level TOPS to create more duplicates.
Wire both Cache TOPS to a Difference TOP.
Figure 8.The Cache TOPS are wired to the Diff TOP so that both images of the performer will be seen.

Figure 9.The entire network for the effect. Look at the effect when projected in your performance, go back, and adjust the node parameters as necessary.

Demo #3: RealSense TOP Color Mapping For Texture Mapping

Using the RealSense TOP node to texture map the geometries, in this case the boxes with the dancers moving image.

Create a Geometry COMP and go inside it, down one level (/project1/geo1) and create an In SOP.
Go back up to project1 and create a Box SOP.
In the Box SOP Parameters, set the Texture Coordinates to Face Outside. This will insure that each face will get the full texture (Zero to 1).
Wire the Box SOP to the Geometry COMPs input.
Create a RealSense TOP Node and in the Parameters Setup page, set the Model to R200 and the Image to Color.
Create a Phong MAT and in the Parameters RGB page set the Color Map to realsense1 or alternatively drag the RealSense TOP node into the Color Map parameter.
In the Geo COMP Render parameter page, Material put phong1
Create a Render TOP, a Camera COMP, a Light COMP.
Create a Reorder TOP and in the Reorder parameter page, set the Output Alpha, Input 1 to One using the drop down.
Figure 10.The entire network to show how the Intel RealSense R200 Color mode can be used to texture all sides of a Box Geo.

Figure 11.The dancer appears to be holding up the box, which is textured with her image.

Figure 12.Multiple boxes with the image of the dancer animate around the dancer once she has lifted the box off herself.

Demo #4: RealSense CHOP Movement Control Over Large Particle Sphere

For this effect, I wanted the dancer to be able to interact playfully with a large particle ball. She moves towards the sphere and it moves away from her.

Create a RealSense CHOP node. In the Parameters, Set Up page, Model to be an R200, the Mode to Finger/FaceTracking. Turn On the Person Center-Mass World Position and the Persons Center Mass Color Position.
Connect the RealSense CHOP node to a Select CHOP node.
In the Select CHOP, Select page, set the ChannelNames to, person1_center_mass:tx.
Create Math CHOP node, leave the defaults on for now, (You can adjust them later as needed in your performance) and wire the select CHOP node to the Math CHOP node.
Create a Lag CHOP node and wire the Math CHOP node to that.
Connect the Lag CHOP node to a Null CHOP node and connect the Null CHOP node to a Trail CHOP node.
Figure 13.The entire network to show how the RealSense R200 CHOP can be hooked up. The Trail CHOP node is very useful for seeing if and how much the RealSense camera is working.
Create a Taurus SOP, connect it to a Transform SOP and then connect the Transform SOP to a Material SOP.
Create a Point Sprite MAT.
In the Point Sprite MAT, Point Sprite parameters page, choose a yellow color.
In the Material SOP, parameters page, set the Material to pointsprite1
Create a Copy SOP, keep its default parameter settings, and wire the Material SOP to the bottom connection on it.
Create a Sphere SOP, wire it to a Particle SOP.
Wire the Particle SOP to the top connector in the Copy SOP.
In the Particle SOP, State parameter page, Particle Type, to Render as Point Sprites.
Connect the Copy SOP to a Geo COMP. Go one level down project1/geo1. Delete the Torus SOP and create an In SOP.
Figure 14.For the more advanced a Point Sprite MAT can be used to change the look of the particles
Export the personal1_center_mass:tx channel from the Null SOP to the Transform SOP parameters, the Transform page, Translate tx.
Figure 15. Exporting the channel.

Figure 16.The large particle ball assume a personality as the dancer plays with it, trying to catch it.

Demo #5: Buttons to Control Effects

Turning on and off interactive effects is important. In this demo, I will show the simplest way to do this using a button.

Create a Button COMP.
Connect it to a Null CHOP.
Activate and export the channel from the Null CHOP to the Parameters, Render Page of the Geo COMP from the previous Demo 4. Pressing the button will turn the render of the Geo COMP on and off.
Figure 17.An elementary button set up

Summary

This article is designed to give the reader some basic starting points, techniques and ideas as to how to use the RealSense camera to create interactivity in a performance. There are many more sophisticated effects to be explored using the RealSense camera in combination with TouchDesigner.

Related Applications

Many apps that people have created for the RealSense camera are very useful.

https://appshowcase.intel.com/en-us/realsense/node/9167?cam=all-cam - drummer app for Intel RealSense Cameras.

https://appshowcase.intel.com/en-us/realsense?cam=all-cam - apps for all Intel RealSense cameras.

About the Author

Audri Phillips is a visualist/3d animator based out of Los Angeles, with a wide range of experience that includes over 25 years working in the visual effects/entertainment industry in studios such as Sony*, Rhythm and Hues*, Digital Domain*, Disney*, and Dreamworks* feature animation. Starting out as a painter she was quickly drawn to time based art. Always interested in using new tools she has been a pioneer of using computer animation/art in experimental film work including immersive performances. Now she has taken her talents into the creation of VR. Samsung* recently curated her work into their new Gear Indie Milk VR channel.

Her latest immersive work/animations include: Multi Media Animations for "Implosion a Dance Festival" 2015 at the Los Angeles Theater Center, 4 Full dome Concerts in the Vortex Immersion dome, one with the well-known composer/musician Steve Roach. The most recent being the fulldome concert, "Relentless Universe”. She also created animated content for the dome show for the TV series, “Constantine*” shown at the 2014 Comic-Con convention. Several of her Fulldome pieces, “Migrations” and “Relentless Beauty”, have been juried into "Currents", The Santa Fe International New Media Festival, and Jena FullDome Festival in Germany. She exhibits in the Young Projects gallery in Los Angeles.

She writes online content and a blog for Intel®. Audri is an Adjunct professor at Woodbury University, a founding member and leader of the Los Angeles Abstract Film Group, founder of the Hybrid Reality Studio (dedicated to creating VR content), a board member of the Iota Center, and an exhibiting member of the LA Art Lab. In 2011 Audri became a resident artist of Vortex Immersion Media and the c3: CreateLAB. A selection of her works are available on Vimeo , on creativeVJ and on Vrideo.

↧

Fine-Tuning Optimization for a Numerical Method for Hyperbolic Equations Applied to a Porous Media Flow Problem with Intel® Tools

July 22, 2016, 2:53 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® Software Guard Extensions Tutorial Series: Part 2, Application Design

≪ Previous: Using the Intel® RealSense™ Camera with TouchDesigner*: Part 3

Frederico L. Cabral – fcabral@lncc.br
Carla Osthoff – osthoff@lncc.br
Marcio Rentes Borges – marcio.rentes.borges@gmail.com

National Laboratory for Scientific Computing (LNCC)

Introduction

In order to exploit the capabilities of parallel programming in many-core processor architecture that is present in the Intel® Xeon Phi™ coprocessor, each processor core must be used as efficiently as possible before going to a parallel OpenMP* approach. Otherwise, increasing the number of threads will not present the expected speedup. In this sense, the so-called “top-down characterization of micro-architectural issues” should be applied¹.

In this paper, we present some analysis for potential optimization for a Godunov-type semi-discrete central scheme, proposed in², for a particular hyperbolic problem implicated in porous media flow, for Intel® Xeon® processor architecture, in particular, the Haswell processor family which brings a new and mode advanced instruction set to support vectorization. This instruction set is called Intel® Advanced Vector Extensions 2 (Intel® AVX2), which follows the same programming model that its previous Intel AVX family used and at the same time extends it by offering most of the 128-bit SIMD integer instructions with 256-bit numeric processing capabilities³.

Considering the details in this architecture, one can achieve optimization and obtain a considerable gain of performance even in single-thread execution before going to parallelization approach. The auto-vectorization followed by semi-auto vectorization permits the extraction of the potential for one thread which will be crucial for the next, parallelization step. Before using more than one single thread it's to be sure that a single thread is executed efficiently.

This paper is organized as follows: section 2 explains what a porous media application is and how it's important to oil and gas recovery studies; section 3 explains the numerical method used in this analysis for a hyperbolic problem; section 4 describes the hardware and software environment used in the paper; section 5 explains optimization opportunities and results; section 6 shows the results for some different sizes of the problem; section 7 analyzes the OpenMP parallel approach and Intel® Advisor analysis; section 8 presents the conclusion and future work.

Porous Media Flow Application

As mentioned in²“In porous media applications, flow and transport are strongly influenced by spatial variations in permeability and porosity fields. In such problems, the growth of the region, where the macroscopic mixing of fluids occurs, is dictated by rock heterogeneity acting in conjunction with fluid instabilities induced by unfavorable viscosity ratio. Usually, one assumes heterogeneity solely manifested in the permeability field with porosity commonly assumed homogeneous ^{1 - 3} or inhomogeneous but uncorrelated with the permeability⁴. However, as deeper reservoirs are detected and explored, geomechanical effects arising from rock compaction give rise to time-dependent porosity fields. Such phenomenon acting in conjunction with the strong correlation between permeability and porosity somewhat affects finger growth, leading to the necessity of incorporating variability in porosity in the numerical modeling. In the limit of vanishing capillary pressure, the fluid-phase saturations satisfy a hyperbolic transport equation with porosity appearing in the storativity term. Accurate approximations of conservation laws in strongly heterogeneous geological formations are tied up to the development of locally conservative numerical methods capable of capturing the precise location of the wave front. Owing to the fact that the saturation entropy solutions admit nonsmooth composite waves.”

Numerical Method for a Hyperbolic Problem

The numerical method used in this paper is devoted to solving the following partial differential equation.

Where is the storage coefficient and is a scalar function, and the vector represents a flux of the conserved quantity s, which may depend on other functions of position and time (for example, on Darcy velocity in porous media applications)².

Experiment Environment

All tests presented herein were executed in a 14.04 Ubuntu* computer with an Intel® Core™ i5-4350U processor @ 1.40 GHz with two real cores and hyper-threading (a total of four simultaneous threads). This processor belongs to the Intel Haswell family that has full support of Intel® Advanced Vector Extensions 2 (Intel® AVX2), which is the focus of this paper. This computer has 4 GB of RAM and 3 MG of cache. Intel® Parallel Studio XE 2015 was used to analyze performance and guide toward the optimization process.

Optimization Opportunities

This section presents the analysis and identifies the parts inside the code that can permit optimization via vectorization and parallelization. To accomplish that, Intel tools such as Intel® VTune™ Amplifier, Reports generated by Compiler, and Intel® Advisor are used. Such tools perform some kinds of analysis inside the source and binary (executable) code and give hints on how to optimize the code. This task is mostly applied to performance bottlenecks analysis inside the CPU pipeline. Modern CPUs implement a pipeline together with other techniques like hardware threads, out-of-order execution, and instruction-level parallelism to use hardware resources as efficiently as possible¹. The main challenge for an application programmer is to take full advantage of these resources, because they rely on the microarchitecture level, which is “far distant” from the programming level available in modern and widely used programming languages such as C, C++, and Fortran*.

Intel tools can help by analyzing compiler/linker and execution times to identify how those resources are being used. First, the compiler is able to inform a report where one can discover what was automatically done and what was not, for optimization goals. Sometimes a loop will be vectorized and other times it will not, and in both cases the compiler tells that and it's useful to the programmer to know whether an intervention is needed or not. Second, Intel VTune Amplifier shows the application behavior in the pipeline and on the memory usage. It’s also possible to know how much time is spent in each module and whether the execution is efficient or not. Intel VTune Amplifier creates high-level metrics to analyze those details, such as cycles-per-instruction, which is a measure of how fast instructions are being executed. Finally, Intel Advisor can speculate and tell how much performance can be improved with thread parallelism. By combining all these tips and results, the programmer is able to make changes in the source code as well as in compiler options to achieve the optimization goal.

Auto Vectorization

Recent versions of compilers are able to vectorize the executable code by setting the optimization compiler option to O2 or higher. One can disable vectorization by using the -no-vec compiler option, which can be helpful to compare results. Briefly, vectorization is the ability of perform some operations in multiple data in a single instruction⁵. Let’s take the following loop as an example.

for (i=0; i<=MAX; i++) {
       c[i] = b[i] + a[i];
}

If this loop is not vectorized, in memory registers it will look something the image shown in Figure 1.

Figure 1: Non-vectorized piece of code.

The same loop can be vectorized (see Figure 2), and the compiler may use the additional registers to perform four additions in a single instruction. This approach is called Single Instruction Multiple Data (SIMD).

Figure 2: Vectorized loop.

This strategy often leads to an increase in performance as one instruction operates simultaneously on more than one data.

In the flow application studied in this paper, the auto-vectorization presents a speedup of two times over a Haswell processor that has an Intel AVX2 instruction set. The -xHost option instructs the compiler to generate code according to the computer host architecture. In the case of a Haswell host, this option will guarantee that Intel AVX2 instructions are generated. Figure 3 shows the Intel VTune Amplifier summary where the elapsed time is 186.628 seconds, Clocks per Instruction (CPI) about 0.5, and a total number of 1.067.768.800.000 instructions retired.

Figure 3: Intel® VTune™ Amplifier summary without xHost.

The main goal of optimization is to reduce the elapsed time and increase CPI, which should be ideally 0.75—full pipeline utilization.

Figure 4 shows the top-down tree ordered by effective time by utilization for each module in the application. Note that the function “vizinhanca” is the most expensive, consuming about 20 percent of the total time. This function computes porosities, permeabilities, velocities, flux on the element, and the flux on all neighboring elements for all elements at the mash on mass calculation step.

Figure 4: Top-down tree.

Notice that CPI is critically poor for the “fkt” function that computes flux at a given time step, for all steps. This function calls “vizinhanca”.

Figure 5 shows part of the “vizinhanca” source code, the time to execute each line, and the corresponding assembly code for the selected line, which computes the linear reconstruction. Notice that the instructions generated are not specifically the Intel AVX2 instructions.

Figure 5:Assembly code for linear reconstruction.

The instructions MULSD, MOVSDQ, SUB, IMUL, MOVSXD, MOVL, MOVQ, and ADDQ are all from the Intel® Streaming SIMD Extensions 2 (Intel® SSE2) instruction set. These “old” instructions were released with the Intel® Pentium® 4 processor in November 2000.

Compiling the same code with -xHost, which instructs the compiler to generate code for the host architecture, and executing via Intel VTune Amplifier, the total elapsed time is reduced to 65.657 seconds with CPI Rate equal to 0.628, which is much better and closer to the ideal 0.75 value (see Figure 6).

Figure 6: Intel® VTune™ Amplifier summary with xHost.

To understand why it happens, one can look at the assembly code and notice that instructions are generated for Intel AVX2. It's also noticed that the Instructions Retired metric is much smaller than the non-vectorized version. This occurs because the metric does not count instructions that were canceled because they were on a path that should not have been taken due to a mispredicted branch. Thus the total amount of mispredict is much smaller, which also explains the increase of CPI [7]. Figure 7 shows the same piece of code as shown in Figure 5. Notice that most of the assembly instructions are Intel AVX2 instead of Intel SSE2. The difference is due to the mnemonic for each instruction, where the prefix V as in VMOV instead of MOV means it's a vector instruction.

Figure 7:Assembly vectorized code for linear reconstruction.

Figure 8 shows the differences among registers for the Intel SSE2, Intel AVX, and Intel AVX2 instructions sets⁶.

Figure 8:Registers and instruction set.

This simple auto-vectorization permitted a gain in performance of up to three times faster.

Identifying other Hotspots

Considering that four threads can be executed by two physical cores and only optimizations made so far regards on compiler auto-vectorization, there is room for more optimization and to achieve that, bottlenecks need to be recognized. The original source code of the Flow Application has many lines with division instead of multiplying by the inverse.

The -no-prec-div option instructs the compiler to change division by inverse multiply, and its use has been shown to improve performance of the Flow Application.

Figure 9 shows a piece of source code where division is widely used. The code is inside the “udd_rt function”, one the bottlenecks of the code, according to analysis presented in Figure 4.

Figure 9:Many divisions in a critical piece of code.

The use of -no-prec-div has one problem: according to⁵ sometimes the value obtained is not as accurate as IEEE standard division. If it is important to have fully precise IEEE division, use this option to disable the floating-point division-to-multiplication optimization, leading to a more accurate result, with some loss of performance.

Another important issue that Intel can support is the fact that the flow application has a significant number of modules in Fortran which can be a challenge to optimization. To guarantee the optimization of inter-procedures the -ipo option instructs the compiler and linker to perform Interprocedural Optimization. Some part of the optimization is left to the linker phase.

With the use of -ipo, -xHost, and -no-prec-div, the total elapsed time is reduced even more, according to Figure 10.

Figure 10:ipo, xHost and no-prec-div options.

To better understand the role of -no-prec-div, a test without it was executed and the total elapsed time increased to 84.324 seconds, revealing that this option is responsible for a gain of about 1.5 times faster. Another test without the -ipo option led to an elapsed time equaling 69.003 seconds, revealing a gain of about 1.3 times faster.

Using Intel® VTune™ Amplifier General Exploration to Identify Core and Memory Bounds

Intel VTune Amplifier does more than inform elapsed time, CPI, and instructions retired. Using General Exploration it’s possible to obtain information about program behavior inside the CPU pipeline, revealing where performance can be improved by detecting core and memory bounds. Core bound refers to issues that are unrelated to memory transfer and affect performance in out-of-order execution or saturating some execution units. Memory bound refers to fractions of cycles where pipeline could be stalled due to demand load/store instructions. Combined, these two metrics define the so-called back-end-bound, which combined with the front-end-bound inform stalls that cause unfilled pipeline slots.

A typical modern out-of-order pipeline is shown in Figure 11, which identifies two different parts: front-end and back-end¹.

Figure 11:Out-of-order Intel CPU pipeline.

At the back-end part are the ports to deliver instructions to ALU, Load, and Store units. Unbalanced utilization of those ports will in general negatively affect the whole performance, because some of execution units can be idle or used less frequently while others can be saturated.

For the application studied herein, this analysis showed that even with the auto-vectorization by the compiler, there's still some core bound due to unbalanced instruction dispatches among the ports.

Figure 12:Core bound.

Figure 12 shows there was a high rate 0.203, which means 20.3 percent of clock cycles had no ports utilized, meaning poor utilization. Also there was a high rate 0.311, meaning 31.1 percent of cycles had three or more ports utilized, which caused saturation. Clearly there is unbalanced port utilization.

In the bottom-up Intel VTune Amplifier analysis presented in Figure 13, it’s possible to identify the causes of core bound mentioned above. In this case the Intel VTune Amplifier indicates a high back-end bound in four functions (fkt, udd_rt, xlo, and xmax_speed) in the program and in the “transport” module. A high rate of bad speculation is also detected for the “xmax_speed” function.

The next step for using Intel VTune Amplifier as a tool to analyze those issues involves looking into each function in order to identify pieces of code where time (clock cycles) is being wasted.

Figure 13:Back-end bound.

By analyzing each function that appears in the back-end bound, it’s possible to identify where those bounds occur. As shown in Figures 14–17, Intel VTune Amplifier exhibits the source code with corresponding metrics for each line, such as retiring, bad speculation, clockticks, CPI, and back-end bound. In general, a high value of clockticks indicates the critical lines in the code, which Intel VTune Amplifier exhibits by highlighting the correspondent value.

In the “transport” function, there is a high bound as shown in Figure 14.

Figure 14:Back-end bound in transport function.

The “fkt” function seams to be critical since it calls the “g” and “h” functions to compute flux and the “udd_rt” function to compute saturation on the anterior element of the mesh as can be observed at the piece of code analyzed in Figure 15.

Figure 15:Back-end bound in fkt function.

Functions “f” and “g” called by “fkt” are listed below.

function g(uu,vv,xk,VDISPY,POROSITY)
!
!	computes f in a node or volume
!
use mPropGeoFisica, only: xlw,xlo,xlt,gf2,rhoo,rhow

implicit none
real(8) :: g,uu,vv,xk
REAL(8) :: VDISPY,POROSITY
real(8) :: fw
real(8) :: xlwu,xlou
real(8) :: xemp,xg

xlwu=xlw(uu)
xlou=xlo(uu)
fw=xlwu/(xlwu+xlou)
xg = gf2
xemp=fw*xg*xlou*xk*(rhoo-rhow)
g=fw*vv-xemp

end function

function h(uu,vv,xk,VDISPZ,POROSITY)
!
!	computes f in a node or volume
!
use mPropGeoFisica, only: xlw,xlo,xlt,gf3,rhoo,rhow

implicit none
real(8) :: h,uu,vv,xk
REAL(8) :: VDISPZ,POROSITY
real(8) :: fw
real(8) :: xlwu,xlou
real(8) :: xemp,xg

xlwu=xlw(uu)
xlou=xlo(uu)
fw=xlwu/(xlwu+xlou)
xg = gf3
xemp=fw*xg*xlou*xk*(rhoo-rhow)
h=fw*vv-xemp

end function

It's important to emphasize that “f” and “g” are 2D components of vector F, which represents a flux of conserved quantity “s” according to

Function “xlo” that computes total mobility according to water and oil mobilities, presents a high back-end bound too, according to Figure 16.

Figure 16:Function xlo.

Figure 17 shows the function “xmax_speed” that computes the maximum speed in the x direction. “xmax_speed” is the other function responsible for the high back-end bound.

Function xlo.
Figure 17:Function xmax_speed.

Intel VTune Amplifier uses the four functions presented above—“transport”, “fkt”, “xlo,” and “xmax_speed”—to indicate back-end bound.

Back-end bound can be caused by core or memory bound. Once the auto-vectorization has been already performed, it's highly probable that the results shown above are caused by memory issues (see Figure 18). Indeed, Intel VTune Amplifier indicates some L1 cache bound.

Figure 18:Memory bound.

Once this application has a great number of data structures, such as vectors, that are necessary in many different functions, they must be studied in detail to identify actions to improve performance by reducing this memory latency. The issue of data alignment must also be considered in order to assist vectorization⁴.

Using Compiler Reports to Identify More Hints and Opportunities

In order to identify more opportunities to optimize the code via vectorization, a compiler report should be used because it gives hints about potential vectorizable loops. The MTRANSPORT module becomes the focus of analysis once computational effort is concentrated there.

Some outer loops are not vectorized because the corresponding inner loop is already vectorized by the auto-vectorization approach, which was explained previously. The compiler message for this case is:

remark #####: loop was not vectorized: inner loop was already vectorized

In some cases the compiler instructs the programmer to use a SIMD directive, but when this is done, the loop cannot be vectorized because it performs an assignment to a scalar variable, as shown in the following loop.

!DIR$ SIMD
      do n=1,nstep+1
      ss=smin+(n-1)*ds

The corresponding compiler message is

LOOP BEGIN at ./fontes/transporte.F90(143,7)
     remark #15336: simd loop was not vectorized: conditional
     assignment to a scalar   [ ./fontes/transporte.F90(166,23) ]
     remark #13379: loop was not vectorized with "simd"

Most parts of loops were vectorized automatically by the compiler, and in some cases the potential speedup estimated by the compiler varies from 1.4 (most of them) to 6.5 (only one).

When the compiler detects that vectorization will produce a loss of performance (a speedup of less than 1.0), it doesn't perform vectorization. One example is shown below.

LOOP BEGIN at ./fontes/transporte.F90(476,7)
      remark #15335: loop was not vectorized: vectorization
		possible but seems inefficient. Use vector always directive
		or -vec-threshold0 to override
      remark #15451: unmasked unaligned unit stride stores: 1
      remark #15475: --- begin vector loop cost summary ---
      remark #15476: scalar loop cost: 3
      remark #15477: vector loop cost: 3.000
      remark #15478: estimated potential speedup: 0.660
      remark #15479: lightweight vector operations: 2
      remark #15481: heavy-overhead vector operations: 1
      remark #15488: --- end vector loop cost summary ---
LOOP END

The real speedup obtained with auto-vectorization is about three times, according to compiler estimated mean value.

Analyzing the reasons that the compiler didn't vectorize some loops should be considered before performing more optimizations on the application.

Experiments with Other Grid Sizes

All tests made up to this point were executed with a 100×20×100 mesh that computes 641 transport steps with one thread.

In this section other sizes of mesh will be considered in order to compare the total speedup obtained via auto-vectorization. Table 1 shows this comparison.

Mesh	Time without Vectorization	Time with Vectorization	Speedup	Number of Transport Steps
100×20×100	186,628	53,590	3,4	641
50×10×50	17,789	5,277	3,4	321
25×5×25	0,773	0,308	2,5	161
10×2×10	0,039	0,0156	2,5	65

Table 1:Comparison of total speedup obtained using auto-vectorization.

These tests show the tendency of speedup to be achieved with auto-vectorization as it is greater the bigger the mesh is. It's an important conclusion because real problems require more refined (bigger) meshes.

In further phases of this project, more refined meshes need to be considered to validate this hypothesis.

Parallel OpenMP* multithread approach

Once semi-auto vectorization is achieved, it's possible to investigate how fast the application can become when executed with more than a single thread. For this purpose, this section presents an initial experiment comparing the analysis via Intel VTune Amplifier with one and two threads.

The tests described herein were performed in a different machine than the one described in section 4. For this section seven tests were executed in a 14.04.3 LTS Ubuntu computer with a Intel® Core™ i7-5500U processor @ 2.40 GHz with two real cores and hyper-threading (a total of four simultaneous threads). This processor belongs to the Intel Broadwell family that has full support of Intel AVX2. This computer has 7.7 GB of RAM and 4 MG of cache. Intel Parallel Studio XE 2016 was used to analyze performance and guide toward the optimization process. Broadwell is Intel's codename for the 14-nanometer die shrink of its Haswell microarchitecture. It is a "tick" in Intel's tick-tock model as the next step in semiconductor fabrication.

The first test was executed in a single thread and with the -ipo, -xHost and -no-prec-div compiler options, just like in section 5.2. The same 100×20×100 mesh that executes 641 transport steps took 39.259 seconds, which means a gain about 1.4 times faster than the result obtained with the Haswell processor as described in section 6, Table 1. Figure 19 shows part of the Intel VTune Amplifier summary for this case.

Figure 19:Intel® VTune™ Amplifier summary.

An important issue to be considered is that with a processor that is 1.7 times faster, the gain obtained was just 1.4, which suggests that the microarchitecture resources are not being efficiently used. The Intel VTune Amplifier summary presented in Figure 19 shows a back-end bound equal to 0.375 suggesting that even by increasing clock frequency the causes of some bounds may not be solved.

Figure 20: Intel® VTune™ Amplifier analysis.

Figure 20 shows that there is a high back-end bound due to a core bound that is caused by the divider unit latency. There are other core issues unrelated to the divider unit referred as "Port Utilization” in VTune. This “Port Utilization" metric indicates how much efficient execution ports inside ALU are used. As cited in⁸, core bound is a consequence of the demand on the execution units or lack of Instructions-Level-Parallelism inside the code. Core bound stalls can manifest among other causes, because sub-optimal execution port utilization. For example, a long latency divide operation might serialize execution, increasing the demand for an execution port that serves specific types of micro-ops and as a consequence it will cause a small number of ports utilized in a cycle.

It indicates that the code must be improved in order to avoid data dependencies and a more efficient vectorization approach beyond the semi-auto vectorization applied so far.

Figure 21 shows the Intel VTune Amplifier bottom-up analysis where the parts of the code are identified as the cause of back-end bound, which are the same that were identified in section 5.3, Figure 13.

Figure 21: Intel® VTune™ Amplifier bottom-up analysis.

It's possible to conclude that even by increasing the CPU clock frequency the bottlenecks tend to remain the same or eventually appear in new regions of the code, because of the out-of-order execution and the faster demand to computational resources.

Two OpenMp Threads

It's crucial to analyze the impact of parallelism in the speedup for this application with two and more OpenMP multi-threads. Since the hardware available has two real cores with hyper-threading, tests were made with two threads to compare how a real core can contribute to reducing execution time.

Figure 22:Intel® VTune™ Amplifier summary analysis.

Figure 22 shows an Intel VTune Amplifier summary analysis for the similar case presented in the previous section with the only difference being that the OMP_NUM_THREADS environment variable was set to 2. The gain is about 1.8 times faster but still has a high back-end bound due to the reasons explained before.

Figure 23 shows the same bottlenecks and in addition a bad speculation in the “xlo” function and a high retiring rate in the “xlw” function.

Figure 23: Intel® VTune™ Amplifier bottom-up analysis.

According to⁸, a bad speculation indicates that slots are wasted due to wrong speculation that can be cause by two different reasons: (1) slots used in micro-ops that eventually don’t retire, and (2) slots in which the issue pipeline was blocked due to recovery from earlier mis-speculations. When the number of threads is increased the data dependencies between them starts to happen, which can cause mis-speculations that are speculations discarded during out-of-order pipeline flow. The increase in retire rate for the “xlw” function indicates that parallelism improved the slot utilization, even slightly.

Figure 24: Finite Element Mesh.

Figure 24 shows the stencil used in the numerical method, which leads to a data dependence between threads in the parallel approach. Each element “O” in the mesh used by the numerical method depends on data coming from its four neighbors: left, right, up, and down. The light gray indicates the regions in the mesh that are shared between the element and the neighbors. The dark gray indicates the regions that are shared among the element itself and the other two neighbors. The points xl, xln, xlc, xr, xrn, xrc, yun, yu, yuc, ydc, yd and ydn are points shared among adjacent elements and here show the possible bottleneck for a parallel approach, since data must be exchanged among parallel threads, and synchronization methods need to be implemented via OpenMP barriers in the transport module.

In order to verify whether the size of the problem affects the parallel speedup, other mesh sizes were tested and no significant differences were observed for smaller meshes (see Table 2), except for the 10×2×10 mesh which presents a very low workload. The OpenMP overhead for creating and destroying threads is also a limiting factor.

Mesh	Time with One Thread	Time with Two Threads	Speedup	Number of Transport Steps
100×20×100	39.259	21.429	1.83	641
50×10×50	2.493	1.426	1.74	321
25×5×25	0.185	0.112	1.65	161
10×2×10	0.014	0.015	0.94	65

Table 2: Comparison among meshes, threads, speedup and number of transport steps.

Figure 25 shows an Intel VTune Amplifier summary analysis for a 10×2×10 mesh with one thread, while Figure 26 shows the summary analysis for two threads.

Figure 25: Intel® VTune™ Amplifier summary analysis.

Notice that the front-end bound is zero and the back-end bound is 1.0 for one thread and the exact inverse for two threads, due to the OpenMP overhead previously mentioned.

Figure 26: Intel® VTune™ Amplifier summary analysis.

Intel® Parallel Advisor Tips for Multithreading

Intel Parallel Advisor is an important tool that helps analyze the impact of parallelism before actually doing it. This tool identifies the most expensive pieces of code and estimates the speedup achieved by multithreading.

The use of Intel Parallel Advisor consists of four steps: (1) survey analysis, (2) annotation analysis, (3) suitability analysis, and (4) correctness analysis. The survey analysis shows the basic hotspots that can be candidates for parallelization. Figure 27 shows a top-down view, where the top of the figure indicates that the “transport” module is the most expensive candidate. The Intel Parallel Advisor screen exhibits information like the time spent in each function, the loop type (vector or scalar), and the instruction set (Intel AVX, Intel AVX2, Intel SSE, and so on). The same information is shown in the bottom part of the figure, but in a caller-callee tree structure. This gives a top-down view for a module’s behavior, providing information about microarchitecture issues as well as the identification of the most expensive loops that become candidates to parallelism. In this case, the main loop in the function “ktdd_rt” inside the “transport” module that takes 98 percent of total time is shown in the bottom half of the figure.

Figure 27: Intel® Parallel Advisor.

Once the loops with the most computational effort are identified, annotations are inserted in the source code⁹. Annotations are inserted to mark the places in the serial program where Intel Advisor should assume parallel execution and synchronization will occur. Later the program is modified to prepare it for parallel execution by replacing the annotations with parallel framework code so that parts of the program can execute in parallel. Identify parts of the loops or functions that use significant CPU time, especially those whose work can be performed independently and distributed among multiple cores. Annotations are either subroutine calls or macro uses, depending on which language you are using, so they can be processed by your current compiler. The annotations do not change the computations of the program, so the application runs normally.

The annotations that are inserted mark the proposed parallel code regions for the Intel Advisor tools to examine in detail. The three main types of annotations mark the location of:

A parallel site. A parallel site encloses one or more tasks and defines the scope of parallel execution. When converted to a parallel code, a parallel site executes initially using a single thread.
One or more parallel tasks within the parallel site. Each task encountered during execution of a parallel site is modeled as being possibly executed in parallel with the other tasks and the remaining code in the parallel site. When converted to parallel code, the tasks will run in parallel. That is, each instance of a task's code may run in parallel on separate cores, and the multiple instances of that task's code also runs in parallel with multiple instances of any other tasks within the same parallel site.
Locking synchronization, where mutual exclusion of data access must occur in the parallel program.

For our flow media problem, annotations were inserted in places shown in the code sample below. The gain estimate analyses performed by Intel Parallel Advisor are shown in Figures 28 and 29. The piece of code where site and task annotations were inserted are shown below. The first site is around the function “maxder” and the second is around the loop that computes velocity field for each finite element in the mesh.

call annotate_site_begin("principal")
	call annotate_iteration_task("maxder")
	call maxder(numLadosELem,ndofV,numelReserv,vxmax,vymax,vzmax,vmax,phi,permkx,v)
call annotate_site_end

call annotate_site_begin("loop2")
	do nel=1,numel
		call annotate_iteration_task("loop_nelem")
		[…]
	end do
call annotate_site_end

The next step is to execute the suitability report to obtain an estimate of how much faster the code can run with more than one thread. Figure 28 shows the performance gain estimate for the first site annotated (“principal” loop) and it's linear until 32 threads, as can be seen in the green region of the graphics of gain versus the CPU count (threads).

Figure 28: Intel® VTune™ Amplifier summary analysis.

Figure 29 shows that the gain is estimated to be linear until 64 threads for the second site annotated (“loop2” loop).

Figure 29:Intel® VTune™ Amplifier summary analysis.

At the Impact to Program Gain area on the screen, Intel Advisor suggests a gain of performance about 2.5 times faster (2.47 for the first site and 2.50 for the second site) with four OpenMP parallel threads (CPU Count set to four) in places where annotation sites were put in. This value is applicable to only these two sites. For a total estimate of the gain of performance, it's necessary to insert annotations in some more expensive loops.

Parallelization in multithread environment

In order to verify how fast the code can be executed, some tests were executed in a cluster node composed of four Intel® Xeon® processor E5-2698 v3 (40 M cache, 2.30 GHz) processors with 72 total cores, including hyper-threading.

Two different meshes were executed, 100×20×100 and 200×40×200, in order to compare how speedup varies as a function of the number of threads in each one. In the previous section an initial analysis was performed using Intel Parallel Advisor, which indicated that for the sites annotated, one can get a linear speedup at up to 32 threads and in this section one can verify whether those predictions were confirmed.

The code could be executed only until 47 threads without any kind of problem. At greater values, a segmentation fault occurs. The reason for this problem is left for further investigation.

For the first mesh (100×20×100) a linear speedup was observed in Figure 30, only until 25 threads where this speedup decreases and varies linearly again until 32, with the same value for 25 threads. For more than 32 threads, the speedup increases again but above a linear rate. This result shows there are three different ranges where gain of performance increases: (1) from 1 to 25, (2) from 26 to 32, and (3) from 33 to 47. This apparently strange behavior can be possibly caused by two events: the hyper-threading and the so called “numa-effect.”

Figure 30: Speedup x number of threads.

As mentioned in¹⁰, Non Uniform Memory Access (NUMA) is a solution to scalability issues in SMP architecture and also a bottleneck for performance due to memory access issues.

“As a remedy to the scalability issues of SMP, the NUMA architecture uses a distributed memory design. In a NUMA platform, each processor has access to all available memory, as in an SMP platform. However, each processor is directly connected to a portion of the available physical memory.

In order to access other parts of the physical memory, it needs to use a fast interconnection link. Accessing the directly connected physical memory is faster than accessing the remote memory. Therefore, the OS and applications must be written having in mind this specific behavior that affects performance.”

Figure 31: Typical NUNA architecture.

Since the cluster node is composed of four CPUs (multi-socket system), as shown in Figure 31, which one connected at its particular cache memory, NUMA effect can occur as a consequence.

As shown in Figure 32, Intel® Hyper-Threading Technology permits one physical core to execute two simultaneous threads by the use of a pair of registers sets. When a thread executes an operation that lets the execution unit be idle, the second thread is put to run, increasing CPU throughput (see Figure 33). However, there is only one physical core and the gain of performance will not be double for a high CPU bound application.

Figure 32:Intel® Hyper-Threading Technology.

As explained in¹¹, “In the diagram below, we see an example of how processing performance can be improved with Intel HT Technology. Each 64-bit Intel Xeon processor includes four execution units per core. With Intel HT Technology disabled, the core’s execution can only work on instructions from Thread 1 or from Thread 2. As expected, during many of the clock cycles, some execution units are idle. With Hyper-Threading enabled, the execution units can process instructions from both Thread 1 and Thread 2 simultaneously. In this example, hyper-threading reduces the required number of clock cycles from 10 to 7.”

Figure 33: Intel® Hyper-Threading Technology.

This reduction from 10 to 7 clock cycles represents a gain of performance up to 1.5 times faster, not more than that, for this particular example.

Figure 34 shows the graphics for the second mesh (200×40×400) that executes 1281 transport steps and is 1000 times bigger than the previous (100×20×1000) mesh, so it's a high and more critical CPU bound task. Results show that linear speedup is achieved only until 16 threads.

Figure 34: Second mesh.

As in the previous mesh, this one presents a pattern of three different ranges where speedup has a particular behavior. Region one presents a linear gain (1 to 16 threads) as does the second one (19 to 32 threads). The third region presents almost no gain.

Regardless of the differences in execution time between the two tests, the similarities are important to point out, because they indicate that the causes for not obtaining linear speedup from 1 to 36 threads as predicted by Intel Parallel Advisor are the same: the NUMA effect and hyper-threading.

Tables 3 and 4 show the execution time as a function of the number of threads.

Table 3:Threads × time in seconds - 100×20×100.

Table 4:Threads × time in seconds - 200×40×200.

In order to achieve a better gain, other tests can be performed with a higher workload. Amdahl's Law¹² affirms that the speedup possible from parallelizing one part of a program is limited by the portion of the program that still runs serially, which means that despite the number of parallel regions in the program and the number of threads, the total gain is bound by the serial portion of the program. In this case, for the tests presented further, the total number of transport steps is increased via a diminishing Courant number (Cr) (from 0.125 to 0.0125), keeping the total simulation final time equal to 0.8 days, just like all the previous tests. For the 100x20×100 mesh, the number of transport steps increased from 641 to 6410 and for the 200×40×200 mesh, the number increased from 1281 to 12801. For both meshes the amount of work was 10 times the previous ones, because the Cr was also diminished by a factor of 10.

Figure 35 shows speedup for the mesh 100x20x100 with Cr = 0.0125, where one can notice that the gain increases practically linearly until 34 threads, which corresponds to the number of physical cores in the cluster node, since it has 4 processors with 9 cores each and hyper-threading. As the previous test showed in Figure 30, hyper-threading was not efficient enough to achieve more gain. Nevertheless, as opposed to the previous test, this one presented a gain near to linear until the number of threads matched the number of physical cores. This result indicates that for a higher workload, the code benefits from multicore architecture, but one question still remains: Even though the test with the 200×40×200 mesh (Figure 34) has a higher workload, the speedup is good only until 16 threads, no more than that. What kind of work can be augmented such that a gain appears?

Basically, it's possible to increase workload by two different ways: (1) a more dense mesh, which has more points and therefore more finite elements inside a particular region and (2) more transport steps that can be caused by two different factors: a higher simulation total final time and a lower CR. The test shown in Figure 35 was performed only with additional time steps; the total elapsed time (total wallclock time) for one thread was 403,16 seconds and for 36 threads was 14,22 seconds which is about 28 times faster.

The other approach, which consists of increasing both the number of transport steps and the mesh density, is presented in Figure 36 where the total time for one thread is 6577,53 seconds and for 36 threads is 216,16 seconds, leading to a speed-up about 30 times faster. With these two ways of augmenting the workload, the gain still remained almost linear for the number of physical cores, indicating that the factor that helps parallelism to achieve a good speed-up is the increase of time steps and not the mesh density.

Figure 35:Mesh 100×20×100 with Cr = 0.0125.

Figure 36:Mesh 200×40×40 with Cr = 0.0125.

Table 5 compares both strategies for increasing workload and whether the gain is linear (good or bad speedup).

	100×20×100 mesh	200×40×200 mesh
Cr = 0.125	Bad speedup	Bad speedup
Cr = 0.0125	Good speedup	Good speedup

Table 5: Meshes x Different Courant Numbers.

Table 5 suggests that the factor that contributes to a good gain is the increase in total transport steps and not the mesh density. The execution with more transport steps decreases the percentage of time spent accessing the cache memory of other sockets, therefore reducing the NUMA effect. This result indicates which cases will probably benefit from many core processor architecture such as the Intel® Xeon Phi™ coprocessor.

Conclusion and Future Work

This paper presented an analysis of the gain in performance through the use of auto-vectorization provided by the compiler and also showed where the most impacted bottlenecks in the code are to vectorize and improve performance even more. Memory bound seems to be the cause of back-end bound, creating some difficulties for achieving the optimization goal. To accomplish that, some changes should be made in data structures and in the code that accesses them. Compiler reports are considered for analyzing the reasons that some parts of code could not be vectorized automatically. Identifying those reasons is a crucial point to permit more improvements. The application was tested with different sizes of problems to compare the speedup behavior. It was clear that auto-vectorization reduced execution time by a factor of 2.5 for small cases and by higher than 3.0 for bigger cases.

For future steps in this project, semi-auto vectorization should be studied and a refactoring in the code may be necessary to simplify the manipulation of data structures and vectorization of the loops the compiler could not vectorize by itself. To amplify the test cases, bigger meshes have to be used to investigate how faster it’s possible to achieve for real problems.

Another important issue considered in this paper is the multi-thread (OpenMP) approach that showed that with two threads a speedup of 1.8 times faster can be achieved despite the issues still to be considered. Analysis with Intel Advisor indicated that the potential speedup is about 32, but tests performed in a cluster node showed that a linear gain was achieved only for 25 or 16 threads. This difference between what Intel Advisor predicted and what actually executed is suspected of being caused by the NUMA effect and hyper-threading. Tests increasing the workload in time instead of in space minimized this problem by reducing the percentage of time spent accessing the cache memory of other sockets, therefore reducing the NUMA effect and linear speedup for the number of physical cores should be accomplished.

References

¹“How to Tune Applications Using a Top-Down Characterization of Microarchitectural Issues.” Intel® Developer Zone, 2016.

² Correa, M. R. and Borges, M. R., “A semi-discrete central scheme for scalar hyperbolic conservation laws with heterogeneous storage coefficient and its applications to porous media flow.” International Journal for Numerical Methods in Fluids. Wiley Online Library. 2013.

³“Intel Architecture Instruction Set Extensions Programming References.” Intel® Developer Zone, 2012.

⁴“Fortran Array Data and Arguments and Vectorization.” Intel® Developer Zone, 2012.

⁵“A Guide to Vectorization with Intel Compilers,” Intel® Developer Zone, 2012.

⁶“Introduction to Intel Vector Advanced Vector Extension.” Intel® Developer Zone, 2011.

⁷“Clockticks per Instructions Retired.” Intel® Developer Zone, 2012

⁸ Optimization Manual

⁹ About Annotations (https://software.intel.com/en-us/node/432316)

¹⁰ NUMA effects on multicore, multi socket systems (https://static.ph.ed.ac.uk/dissertations/hpc-msc/2010-2011/IakovosPanourgias.pdf)

¹¹https://www.dasher.com/will-hyper-threading-improve-processing-performance/

¹²https://software.intel.com/en-us/node/527426

↧

Intel® Software Guard Extensions Tutorial Series: Part 2, Application Design

July 26, 2016, 2:03 pm

Latest and popular articles on Intel Technologies

≫ Next: Installing Visual Studio 2015 for Use with Intel Compilers

≪ Previous: Fine-Tuning Optimization for a Numerical Method for Hyperbolic Equations Applied to a Porous Media Flow Problem with Intel® Tools

The second part in the Intel® Software Guard Extensions (Intel® SGX) tutorial series is a high-level specification for the application we’ll be developing: a simple password manager. Since we’re building this application from the ground up, we have the luxury of designing for Intel SGX from the start. That means that in addition to laying out our application’s requirements, we’ll examine how Intel SGX design decisions and the overall application architecture influence one another.

Read the first tutorial in the series or find the list of all of the published tutorials in the article Introducing the Intel® Software Guard Extensions Tutorial Series.

Password Managers At-A-Glance

Most people are probably familiar with password managers and what they do, but it’s a good idea to review the fundamentals before we get into the details of the application design itself.

The primary goals of a password manager are to:

Reduce the number of passwords that end users need to remember.
Enable end users to create stronger passwords than they would normally choose on their own.
Make it practical to use a different password for every account.

Password management is a growing problem for Internet users, and numerous studies have tried to quantify the problem over the years. A Microsoft study published in 2007—nearly a decade ago as of this writing—estimated that the average person had 25 accounts that required passwords. More recently, in 2014 Dashlane estimated that their US users had an average of 130 accounts, while the number of accounts for their worldwide users averaged in the 90s. And the problems don’t end there: people are notoriously bad at picking “good” passwords, frequently reusing the same password on multiple sites, which has led to some spectacular attacks. These problems boil down to two basic issues: passwords that are hard for hacking tools to guess are often difficult for people to remember, and having a greater number of passwords makes this problem more complex by having to remember which password is associated with which account.

With a password manager, you only need to remember one very strong passphrase in order to gain access to your password database or vault. Once you have authenticated to your password manager, you can look up any passwords you have stored, and copy and paste them into authentication fields as needed. Of course, the key vulnerability of the password manager is the password database itself: since it contains all of the user’s passwords it is an attractive target for attackers. For this reason, the password database is encrypted with strong encryption techniques, and the user’s master passphrase becomes the means for decrypting the data inside of it.

Our goal in this tutorial is to build a simple password manager that provides the same core functions as a commercial product while following good security practices and use that as a learning vehicle for designing for Intel SGX. The tutorial password manager, which we’ll name the “Tutorial Password Manager with Intel® Software Guard Extensions” (yes, that’s a mouthful, but it’s descriptive), is not intended to function as a commercial product and certainly won’t contain all the safeguards found in one, but that level of detail is not necessary.

Basic Application Requirements

Some basic application requirements will help narrow down the scope of the application so that we can focus on the Intel SGX integration rather than the minutiae of application design and development. Again, the goal is not to create a commercial product: the Tutorial Password Manager with Intel SGX does not need to run on multiple operating systems or on all possible CPU architectures. All we require is a reasonable starting point.

To that end, our basic application requirements are:

The first requirement may seem strange given that this tutorial series is about Intel SGX application development, but real-world applications need to consider the legacy installation base. For some applications it may be appropriate to restrict execution only to Intel SGX-capable platforms, but for the Tutorial Password Manager we’ll use a less rigid approach. An Intel SGX-capable platform will receive a hardened execution environment, but non-capable platforms will still function. This usage is appropriate for a password manager, where the user may need to synchronize his or her password database with other, older systems. It is also a learning opportunity for implementing dual code paths.

The second requirement gives us access to certain cryptographic algorithms in the non-Intel SGX code path and to some libraries that we’ll need. The 64-bit requirement simplifies application development by ensuring access to native 64-bit types and also provides a performance boost for certain cryptographic algorithms that have been optimized for 64-bit code.

The third requirement gives us access to the RDRAND instruction in the non-Intel SGX code path. This greatly simplifies random number generation and ensures access to a high-quality entropy source. Systems that support the RDSEED instruction will make use of that as well. (For information on the RDRAND and RDSEED instructions, see the Intel® Digital Random Number Generator Software Implementation Guide.)

The fourth requirement keeps the list of software required by the developer (and the end user) as short as possible. No third-party libraries, frameworks, applications, or utilities need to be downloaded and installed. However, this requirement has an unfortunate side effect: without third-party frameworks, there are only four options available to us for creating the user interface. Those options are:

Win32 APIs
Microsoft Foundation Classes (MFC)
Windows Presentation Foundation (WPF)
Windows Forms

The first two are implemented in native/unmanaged code while the latter two require .NET*.

The User Interface Framework

For the Tutorial Password Manager, we’re going to be developing the GUI using Windows Presentation Foundation in C#. This design decision impacts our requirements as follows:

Why use WPF? Mostly because it simplifies the UI design while introducing complexity that we actually want. Specifically, by relying on the .NET Framework, we have the opportunity to discuss mixing managed code, and specifically high-level languages, with enclave code. Note, though, that choosing WPF over Windows Forms was arbitrary: either environment would work.

As you might recall, enclaves must be written in native C or C++ code, and the bridge functions that interact with the enclave must be native C (not C++) functions. While both Win32 APIs and MFC provide an opportunity to develop the password manager with 100-percent native C/C++ code, the burden imposed by these two methods does nothing for those who want to learn Intel SGX application development. With a GUI based in managed code, we not only reap the benefits of the integrated design tools but also have the opportunity to discuss something that is of potential value to Intel SGX application developers. In short, you aren’t here to learn MFC or raw Win32, but you might want to know how to glue .NET to enclaves.

To bridge the managed and unmanaged code we’ll be using C++/CLI (C++ modified for Common Language Infrastructure). This greatly simplifies the data marshaling and is so convenient and easy to use that many developers refer to it as IJW (“It Just Works”).

Figure 1: Minimum component structures for native and C# Intel® Software Guard Extensions applications.

Figure 1 shows the impact to an Intel SGX application’s minimum component makeup when it is moved from native code to C#. In the fully native application, the application layer can interact directly with the enclave DLL since the enclave bridge functions can be incorporated into the application’s executable. In a mixed-mode application, however, the enclave bridge functions need to be isolated from the managed code block because they are required to be 100-percent native code. The C# application, on the other hand, can’t interact with the bridge functions directly, and in the C++/CLI model that means creating another intermediary: a DLL that marshals data between the managed C# application and the native, enclave bridge DLL.

Password Vault Requirements

At the core of the password manager is the password database, or what we’ll be referring to as the password vault. This is the encrypted file that will hold the end user’s account information and passwords. The basic requirements for our tutorial application are:

The requirement that the vault be portable means that we should be able to copy the vault file to another computer and still be able to access its contents, whether or not they support Intel SGX. In other words, the user experience should be the same: the password manager should work seamlessly (so long as the system meets the base hardware and OS requirements, of course).

Encrypting the vault at rest means that the vault file should be encrypted when it is not actively in use. At a minimum, the vault must be encrypted on disk (without the portability requirement, we could potentially solve the encryption requirements by using the sealing feature of Intel SGX) and should not sit decrypted in memory longer than is necessary.

Authenticated encryption provides assurances that the encrypted vault has not been modified after the encryption has taken place. It also gives us a convenient means of validating the user’s passphrase: if the decryption key is incorrect, the decryption will fail when validating the authentication tag. That way, we don’t have to examine the decrypted data to see if it is correct.

Passwords

Any account information is sensitive information for a variety of reasons, not the least of which is that it tells an attacker exactly which logins and sites to target, but the passwords are arguably the most critical piece of the vault. Knowing what account to attack is not nearly as attractive as not needing to attack it at all. For this reason, we’ll introduce additional requirements on the passwords stored in the vault:

This is nesting the encryption. The passwords for each of the user’s accounts are encrypted when stored in the vault, and the entire vault is encrypted when written to disk. This approach allows us to limit the exposure of the passwords once the vault has been decrypted. It is reasonable to decrypt the vault as a whole so that the user can browse their account details, but displaying all of their passwords in clear text in this manner would be inappropriate.

An account password is only decrypted when a user asks to see it. This limits its exposure both in memory and on the user’s display.

Cryptographic Algorithms

With the encryption needs identified it is time to settle on the specific cryptographic algorithms, and it’s here that our existing application requirements impose some significant limits on our options. The Tutorial Password Manager must provide a seamless user experience on both Intel SGX and non-Intel SGX platforms, and it isn’t allowed to depend on third-party libraries. That means we have to choose an algorithm, and a supported key and authentication tag size, that is common to both the Windows CNG API and the Intel SGX trusted crypto library. Practically speaking, this leaves us with just one option: Advanced Encryption Standard-Galois Counter Mode (AES-GCM) with a 128-bit key. This is arguably not the best encryption mode to use in this application, especially since the effective authentication tag strength of 128-bit GCM is less than 128 bits, but it is sufficient for our purposes. Remember: the goal here is not to create a commercial product, but rather a useful learning vehicle for Intel SGX development.

With GCM come some other design decisions, namely the IV length (12 bytes is most efficient for the algorithm) and the authentication tag.

Encryption Keys and User Authentication

With the encryption method chosen, we can turn our attention to the encryption key and user authentication. How will the user authenticate to the password manager in order to unlock their vault?

The simple approach would be to derive the encryption key directly from the user’s passphrase or password using a key derivation function (KDF). But while the simple approach is a valid one, it does have one significant drawback: if the user changes his or her password, the encryption key changes along with it. Instead, we’ll follow the more common practice of encrypting the encryption key.

In this method, the primary encryption key is randomly generated using a high-quality entropy source and it never changes. The user’s passphrase or password is used to derive a secondary encryption key, and the secondary key is used to encrypt the primary key. This approach has some key advantages:

The data does not have to be re-encrypted when the user’s password or passphrase changes
The encryption key never changes, so it could theoretically be written down in, say, hexadecimal notation and locked in a physically secure location. The data could thus still be decrypted even if the user forgot his or her password. Since the key never changes, it would only have to be written down once.
More than one user could, in theory, be granted access to the data. Each would encrypt a copy of the primary key with their own passphrase.

Not all of these are necessarily critical or relevant to the Tutorial Password Manager, but it’s a good security practice nonetheless.

Here the primary key is called the vault key, and the secondary key that is derived from the user’s passphrase is called the master key. The user authenticates by entering their passphrase, and the password manager derives a master key from it. If the master key successfully decrypts the vault key, the user is authenticated and the vault can be decrypted. If the passphrase is incorrect, the decryption of the vault key fails and that prevents the vault from being decrypted.

The final requirement, building the KDF around SHA-256, comes from the constraint that we find a hashing algorithm common to both the Windows CNG API and the Intel SGX trusted crypto library.

Account Details

The last of the high-level requirements is what actually gets stored in the vault. For this tutorial, we are going to keep things simple. Figure 2 shows an early mockup of the main UI screen.

Figure 2:Early mockup of the Tutorial Password Manager main screen.

The last requirement is all about simplifying the code. By fixing the number of accounts stored in the vault, we can more easily put an upper bound on how large the vault can be. This will be important when we start designing our enclave. Real-world password managers do not, of course, have this luxury, but it is one that can be afforded for the purposes of this tutorial.

Coming Up Next

In part 3 of the tutorial we’ll take a closer look at designing our Tutorial Password Manager for Intel SGX. We’ll identify our secrets, which portions of the application should be contained inside the enclave, how the enclave will interact with the core application, and how the enclave impacts the object model. Stay tuned!

Read the first tutorial in the series, Intel® Software Guard Extensions Tutorial Series: Part 1, Intel® SGX Foundation or find the list of all the published tutorials in the article Introducing the Intel® Software Guard Extensions Tutorial Series.

↧

Installing Visual Studio 2015 for Use with Intel Compilers

July 27, 2016, 10:54 am

Latest and popular articles on Intel Technologies

≫ Next: Visual Studio Debugger: Cannot find or open the PDB file

≪ Previous: Intel® Software Guard Extensions Tutorial Series: Part 2, Application Design

The Intel® C++ and Fortran compilers for Windows* require that Microsoft Visual C++* support be present. For Fortran, the C++ libraries and tools are required to build applications. Unlike prior versions, Microsoft Visual Studio 2015 does not install C++ support by default, so if you want to use the Intel compilers with Visual Studio 2015 you must customize the install and enable C++ support.

When you run the Visual Studio 2015 installer, you will see a screen that looks like this. (Community Edition shown here, but other editions are similar):

Select Custom and click Next. You will then be shown a screen where you can choose components to install (Community Edition shown here; other editions will have additional components):

Check the box for Visual C++. For Intel® C++, leave all the subsidiary boxes checked. For Fortran you need only "Common Tools", though if you will be building applications to run on Windows XP* you should check the box for that. Click Next and proceed with the install.

If you have already installed Visual Studio 2015 and did not select C++ support, rerun the Visual Studio installer and change the options.

If you need further assistance, please ask in our User Forums or use Intel® Premier Support.

↧

Visual Studio Debugger: Cannot find or open the PDB file

July 27, 2016, 11:24 am

Latest and popular articles on Intel Technologies

≫ Next: Top Reasons Why You Should Invest in Mobile App Development

≪ Previous: Installing Visual Studio 2015 for Use with Intel Compilers

When starting a program under the Microsoft Visual Studio* debugger, you will generally see in the Output pane a series of messages similar to the following:

'Project01.exe' (Win32): Loaded '<C:\Users\yourname\Documents\Visual Studio 2013\Projects\Project01\Debug\Project01.exe>'. Symbols loaded.'Project01.exe' (Win32): Loaded 'C:\Windows\SysWOW64\ntdll.dll'. Cannot find or open the PDB file.'Project01.exe' (Win32): Loaded 'C:\Windows\SysWOW64\kernel32.dll'. Cannot find or open the PDB file.'Project01.exe' (Win32): Loaded 'C:\Windows\SysWOW64\KernelBase.dll'. Cannot find or open the PDB file.'Project01.exe' (Win32): Loaded 'C:\Windows\SysWOW64\msvcp120d.dll'. Cannot find or open the PDB file.'Project01.exe' (Win32): Loaded 'C:\Windows\SysWOW64\msvcr120d.dll'. Cannot find or open the PDB file.

These are not error messages - they are informational messages from the Visual Studio debugger listing the various executables and Dynamic Link Libraries (DLLs) that have been loaded, and whether or not the PDB (Program DataBase) file containing debug symbol information was found. Normally, the PDB file for your application will be found, as it was here, but PDBs for run-time and system DLLs will not. Unless you plan to debug the system DLLs you don't need their PDB files and should ignore these messages.

You may also see a message similar to:

The program '[11536] Project01.exe' has exited with code 0 (0x0).

This indicates that the program ran to completion and exited - here with a success status of zero. If you see this message immediately following the "Loaded" messages above, without having the program break into the debugger, you will want to set a breakpoint at the first executable line in the program and then run it again. For C++ applications this is usually not necessary, but it is required for Fortran applications.

If you need further assistance, please ask in our User Forums or use Intel® Premier Support.

↧

Top Reasons Why You Should Invest in Mobile App Development

November 11, 2016, 11:00 am

Latest and popular articles on Intel Technologies

≫ Next: Debug usages of Intel System Studio for Microcontroller – print debug message without using UART cable

≪ Previous: Visual Studio Debugger: Cannot find or open the PDB file

Daniel Kaufman
Co-Founder at Brooklyn Labs

With mobile apps made for mobile OS from Android, Apple and others, you can make brand awareness and reliability between the vast number of present and potential customers. Several customers now expect a business or brand to have its own reliable mobile app. This indicates that it is not only becoming a need to get a reasonable edge over other businesses. Having a dedicated mobile app enhances to the reliability of the brand.

Bear in mind the significance of that mobile application holds nowadays in society; it is only intelligent to make one for your business. Here is some reason why you should invest in mobile app development.

1. Mobile Apps Deliver On-The-Go Advertising

With mobile apps your present customers can contact your business from any place and at any time in a customer friendly environment. Regular use of your app will emphasize your brand or business. This means that when they want to buy something, probabilities are they will come to you. You have formed a relationship with them using the app and this is equal to placing your business in your users’ pockets.

2. The World has gone Mobile

There is no fraction that the globe has gone mobile and there is no turning back. Customers are using their phones to find local commerce. Your online branding attempts are being viewed by mobile networks. Therefore, just having a site is not sufficient anymore. Consumers are turning away from the desktop browsers and depend on mobile apps. Unlike outdated websites which overwhelm your six inch mobile displays, apps succeed as a spontaneous purchasing and browsing substitute.

3. Apps Increase Interest

When you develop an app, it delivers you a simple way to show your products or services to your present and future customers. Every time they need to buy something, they can simply use it as a one-stop point to obtain all the info they want.

4. A Larger, Fresher Viewers

Most young persons went mobile a long time before. Nearly 75 percent of the millennial age group will have smartphones by the end of the year. It is tough to involve the youth age group using old-fashioned techniques. Young person’s select to depend on their mobile devices, even though they might have access to an outdated personal computer. Smartphones have become the novel tool for talking with families and friends, browsing and buying goods and services online. To reach these viewers, you want to have a mobile app.

5. It Can Be a Social Platform

It nearly goes without talk about that persons are passionate with social media. So you will need to be a part of their things well. Adding social features such as likes, comments,in-app messaging and thus onward in your app can aid your business increase its social standing. People spend more time on social media, particularly Facebook and Twitter. Thus, by having a mobile app that provides them whole features they get into social media means that they’ll spend more and more time in your mobile app.

↧

Debug usages of Intel System Studio for Microcontroller – print debug message without using UART cable

July 28, 2016, 10:24 am

Latest and popular articles on Intel Technologies

≫ Next: When to Use the Intel® Edison Board

≪ Previous: Top Reasons Why You Should Invest in Mobile App Development

This video tutorial will show you to how to output your printf debug messages to the console window of Intel System Studio for Microcontroller without using UART cable. The UART cable is also named serial, RS232 or com port cables. This dynamic printf debugger feature is convenient for you to start to debug your codes right away once received the Intel MCU developer kit, like D1000, D2000 development boards. We will use accel this built-in example program to demonstrate this feature.

The video tutorial linkl is listed below.

Download application/octet-stream Download

↧

When to Use the Intel® Edison Board

July 28, 2016, 10:55 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® Quark™ Microcontroller D2000 - Accelerometer Tutorial

≪ Previous: Debug usages of Intel System Studio for Microcontroller – print debug message without using UART cable

The Internet is a powerful platform for many technological advancements. Now, you can connect embedded systems over the Internet to form large networks for monitoring and controlling operations from a centralized location. A lot of open source hardware has been designed to allow embedded system programmers to harness the power of the Internet of Things (IoT). The Intel® Edison Module is a good example of such hardware (Figure 1).

Figure 1. The Intel® Edison board is designed for wireless projects.

The Intel® Edison board is based on the Intel® Atom™ Z34XX chip. It’s designed to be the perfect hardware for your wireless projects, whether you’re using Bluetooth* or Wi-Fi. The Intel® Edison board has more processing power than most other open source boards, but is comparatively smaller in size. You can accomplish a lot with it this board if you know when and where to use it. So, let’s look at a few instances where it’s ideal to use the Intel® Edison board.

Running Heavy Server-Side Applications

You can use any development board that has Wi-Fi capabilities for small-scale IoT projects because such projects don’t require heavy server-side applications, but if you want to build an elaborate system, such as a wireless image-recognition platform, or monitor operations at a huge facility, you need a powerful IoT board such as the Intel® Edison board. It can run server-side applications that require a lot of computing power. Even when devices can communicate with powerful servers in data centers, some tasks, like facial recognition, are best done locally on the board.

By default, the Intel® Edison board uses the Yocto Project* Linux* operating system, and can support powerful server-side applications, because it uses frameworks such as Node.js*. Node.js is a powerful runtime environment for developing server-side web applications. With Node.js and Intel® XDK, you can create applications to control and monitor the Intel® Edison board using a web interface. This ability to run as a lightweight client (to send data to a remote web server, for example) but also to function as a server that can run applications is the strength of the Intel® Edison board.

In addition to running Node.js, the Intel® Edison board has a powerful dual-core Intel® Atom™ Z34XX processor with a high clock speed. This gives it enough power to run complex applications efficiently and eliminates the issues that arise from timeout errors, such as loss of valuable data. The board also has ample space for data storage. The 4 GB of internal storage space is more than enough for web apps and other related files. Moreover, the board’s 1 GB of RAM is adequate for handling runtime data.

Space Constraints

The Intel® Edison board is small in size—approximately 35 mm by 25 mm by 3.9 mm—so you can easily fit it in your project without making huge changes to the design of the enclosure. The small form factor makes the Edison board ideal for portable projects and wearables.

Interacting With the Intel Family of Boards

Interacting with other Intel boards, like the Intel® Galileo board, is easy when you use the Intel® Edison board. No external hardware is needed because all the Intel boards operate with the same logic voltage of 1.8 V. The same does not apply when you’re using the Intel® Edison board with other boards and Arduino* shields, however, because most of them operate at a logic voltage of 3.3 V or 5 V. To use the Intel® Edison board with other boards, such as Arduino boards or the Raspberry Pi* board, you might have to use an interface board that has level shifters.

Wireless Communication

The Intel® Edison board is one of the few boards available that have both integrated Wi-Fi and Bluetooth low energy features. It has an on-board Broadcom* CM43340 BT/Wi-Fi chip that allows the board to communicate over Wi-Fi and Bluetooth 4.0. Therefore, it’s the best board to use in applications that require wireless monitoring and control, such as wireless logging of sensor data.

Using Arduino* Shields

If you have a project on which you need to use an Arduino shield, don’t shy away from the Edison board. The Intel® Edison Kit for Arduino acts as an expansion board for the Intel® Edison board (Figure 2). The kit has a pin configuration compatible with the Arduino UNO R3 pin configuration and is fitted with level shifting circuits that convert the logic voltage of the Intel® Edison board to that of the shield mounted on the kit. You can mount any UNO R3 shield on the kit and control it directly from the board: The kit creates a direct connection between the board and the shield, increasing the scope of projects that you can create with the Edison board.

Figure 2. The Intel® Edison Kit for Arduino* makes the Edison board compatible with Arduino shields.

Programming Versatility

You can program the Intel® Edison board, like other Intel boards, using different development tools. Common tools include Intel® XDK, the Arduino integrated development environment, and Wind River Rocket*. This range is an important feature: Every tool has its own strengths, so you can choose the right tool to maximize the potential of the Intel® Edison board in your projects.

In addition, you can choose from a wide range of programming languages for the Intel® Edison board, including C, C++, Python*, JavaScript*, HTML5, and Perl. Which programming language you use depends on the development tool you select. Some tools allow programming in different languages, so you don’t necessarily have to learn a new language to program the board.

Conclusion

The success of your product or project will depend heavily on your choice of development hardware and software. Always make sure that you understand all the requirements and features of your project. This way, you will be able to make an informed choice. For instance, if your application involves a heavy server-side application and requires wireless communication, then the Intel® Edison board is what you need.

Additional Reading:

Visit the Intel Edison® Module web site to learn more.

↧

Intel® Quark™ Microcontroller D2000 - Accelerometer Tutorial

July 28, 2016, 10:58 am

Latest and popular articles on Intel Technologies

≫ Next: Combining Language and Ecosystem for Desired Results

≪ Previous: When to Use the Intel® Edison Board

Intel® System Studio for Microcontrollers includes multiple sample applications to help you get up to speed with basic functionality and become familiar with the Intel® Quark™ Microcontroller Software Interface (QMSI). This example reads and outputs accelerometer data to the console. The example (optionally) uses the Intel® Integrated Performance Primitives (Intel IPP) to compute root mean square, variance, and mean for the last 15 Z axis readings.

Author: Richard, A.

Hardware Required

Intel® Quark™ Microcontroller D2000 developer board and USB cable

Instructions

Create a project with the "Accelerometer" sample project file
1. From the File menu, select New and then select Intel® Project
2. Follow the setup wizard to create your project:
  1. Developer board: Select the D2000 developer board
  2. Project type: Intel® QMSI (1.1)
  3. Selection Connection: USB-Onboard
  4. Project example: Accelerometer
3. Click Finish.
Select the "Accelerometer" project in the Project Explorer window
Click the Build button to compile your project
Click the Debug drop down and select "Accelerometer (flashing)"
- Note: this option will prompt you to switch to the debug perspective and will write the application to the board. It also automatically places a breakpoint in the first line of your main function
Setting up output
1. In the source window (select main.c) locate the “static void print_accel_callback (void *data)” function and copy the contents of the print function QM_PRINTF as seen below:
  “x %d y %d z %d\n”, accel.x, accel.y, accel.z
2. Right click on the left hand side of window and select “Add Dynamic Printf…”
3. Replace the printf( content with the contents you copied in 5a:
- “x %d y %d z %d\n”, accel.x, accel.y, accel.z
- Note: This is the data that is computed by the accelerometer sensor and will be displayed in the Console once completing these steps
Click on the Resume [icon] button to continue running the Accelerometer application
View X, Y and Z values from the accelerometer in the Console window

Note: Move around your board to see the x, y, z data change accordingly

↧

Combining Language and Ecosystem for Desired Results

July 28, 2016, 11:04 am

Latest and popular articles on Intel Technologies

≫ Next: The Basics of Inputs and Outputs, Part 2: Understanding Protocols

≪ Previous: Intel® Quark™ Microcontroller D2000 - Accelerometer Tutorial

The Internet of Things (IoT) ecosystem is made up of a variety of elements that perform different tasks in the collection, aggregation, and analysis of data. In a simplified ecosystem, these tasks are performed at three stages: at the edge devices, at the gateway, and in the cloud (see Figure 1). Each element differs in the resources available to it and the constraints placed on it; for this reason, the development approach and assets can differ, as well.

Figure 1. A simplified Internet of Things ecosystem

This article offers an example IoT ecosystem. It shows the constraints and challenges for each element and how the elements work together to meet the desired requirements.

A Sample Use Case

Imagine the following scenario: Monitoring patients in a residential care setting for heart abnormalities requires that those patients be connected to a variety of equipment, effectively rendering them immobile. The IoT changes this requirement through the use of wearable technology and real-time communication of data for immediate discovery. Wearable heart monitors (shown in Figure 2) communicate their data to the cloud through on-premises IoT gateways (distributed through the residential care facility).

Figure 2. Wearable heart monitors communicating data through Internet of Things gateways to the cloud

Let’s explore this scenario along with the individual elements and their development approaches.

Edge Device

The edge device in this scenario is a wristband that measures heart rate. At the edge, typically less processing capability is available, so the patient’s heart signal is captured and communicated to a local IoT gateway through the wristband’s Bluetooth* interface. This wearable device requires onboard processing of sensor data and minimal analysis of the signal, and then communication through a wireless protocol to the gateway.

An ideal device here is the Intel® Edison board (see Figure 3), which is capable of complex processing (it has a dual-core Intel® Atom™ processor) but in a small package and with minimal power requirements. The board is capable of processing the signal data and performing an initial assessment of the data and its meaning to the patient. Communication between the edge device and the gateway can use traditional TCP/IP over Bluetooth, a standard networking protocol used worldwide. With TCP/IP, a simple sockets-based interface permits communication of data through streams.

Figure 3. The Intel® Edison module

You can write firmware for your Intel® Edison board in a variety of languages, but the most common is the lingua franca of embedded development: C. Because the Intel® Edison board is capable of running Linux*, C is a simple choice and a staple language for this platform.

The Gateway

The IoT gateway is the aggregation point for collecting data from nearby edge devices—in this case, the wearable heart monitors. When the edge device has sufficient data to communicate or an event that requires immediate attention occurs, the device opens a stream socket over Bluetooth to the IoT gateway. When the connection is established, the data is streamed to the gateway (as a function of the size of data to communicate, battery level, and so on). The data could also be streamed in parts as the patient moves in and out of the IoT gateway’s range. If the data indicate an emergency, the gateway’s Wi-Fi interface communicates immediately to an attendant in the facility. The IoT gateway can also identify the patient’s location.

Nonemergency data is compressed and communicated to the cloud using Message Queueing and Telemetry Transport (MQTT). This protocol was designed with the IoT in mind and is ideal for communicating data (called topics within MQTT) using a subscribe–publish model.

Intel® IoT Gateway technology supports all these requirements (see Figure 4). In addition, Intel® IoT Gateways provide management protocols and security capabilities to ensure that communication is secure and the device is manageable. The gateway is also capable of supporting several operating systems, from Wind River* Linux to Windows® 10 IoT Core and Snappy Ubuntu* Core (Linux).

Figure 4. An Intel® IoT Gateway

The gateway software is most commonly developed in C or C++ and relies on interfaces such as the sockets library (for TCP/IP streams) and numerical libraries for signal analysis and compression. You could also use interpreted languages such as Python* in this scenario, but depending on the number of edge devices you have to manage, C/C++ would provide better performance.

The Cloud

In the cloud, patient data is received through MQTT publish messages (based on a previous subscribe message for each patient from the cloud application). This data is then ingested in the Apache Hadoop* Distributed File System (HDFS), which is a specialized file system for large datasets. Once here, the data can easily be processed in one of two ways: batch or real time.

Similar to the early days of computing, early big data processing systems were batch oriented. You’d create an application to process your data, and then unleash it on a system to have your results delivered later. Batch-oriented big data processing is similar, but like computing, big data has grown up. In addition to batch, there’s also stream-oriented big data processing, which supports real-time processing of data as it arrives.

But batch and stream big data processing aren’t distinct: They can work together on a cluster. As Figure 5 shows, HDFS supports Yet Another Resource Negotiator (YARN), a resource manager that allows multiple big data frameworks to operate on the same cluster and data. Above YARN are two separate frameworks: the batch side, supported by traditional MapReduce functionality, and the real-time side, supported by Apache Spark*. Clusters that support these use models commonly use powerful CPUs such as the Intel® Xeon® processor family.

Figure 5. Batch and real-time processing with Apache Hadoop* and Apache Spark*

For immediate processing of patient data, you rely on the Spark side of the cluster. Spark enables you to build applications that process data as it arrives as a stream. In this way, patient data can immediately be analyzed for irregularities. You can write Spark applications in the Scala, Java*, or Python language.

The patient data can also be batch-processed to look for patterns in the data that multiple patients share. Hadoop supports processing of data using the MapReduce paradigm but simplifies it by allowing higher-level scripts to generate the MapReduce applications, such as Apache Hive* and Apache Pig*. Hive and Pig allow you to generate pipelines of queries over large datasets. For machine learning applications, you can also apply Apache Mahout*, which is a set of algorithm libraries for Hadoop. Mahout includes collaborative filtering, clustering algorithms, and classification algorithms.

Using Mahout, you can analyze your patient data to first group your patients who have similar characteristics into clusters, and then search for patterns within these clusters to better understand them using collaborative filtering.

Summary

The benefits of this type of application go beyond protecting individual lives. Using the patient data collected from a large population permits predictive analysis in the cloud to optimize the search for signals that could precede an event. The data, coupled with information about the user, could also help tune the analysis of patient data as a function of the user’s race, age, and other factors. The IoT has the potential to make contributions at the individual level and, with cloud-based analytics, the overall population. Intel processors and gateways help simplify the development of IoT ecosystems.

Additional Reading:

Visit the Intel® IoT Gateway web site to learn more.
Visit the Intel® Edison Module web site to learn more.
Visit the Wind River* Linux* web site to learn more.

↧

The Basics of Inputs and Outputs, Part 2: Understanding Protocols

July 28, 2016, 11:27 am

Latest and popular articles on Intel Technologies

≫ Next: Do You Need a Website for Your App?

≪ Previous: Combining Language and Ecosystem for Desired Results

The Arduino 101*/Genuino 101* board is the first development board to be powered by Intel® Curie™ technology. It counts with the same peripheral list and controllers as the Arduino* UNO board. Sharing the same physical shape and pin layout as the UNO board, the Arduino 101 board accepts many of the shields that fit in its predecessor; you can also interchange cases and mounts between the two boards.

The Arduino board has a dual-core x86 32-bit Intel® Quark™ processor, which allows for multithreaded applications and enables the use of powerful peripherals that the UNO board did not support. The board’s operating voltage is 3.3 V input/output (I/O), with 5 V tolerance, meaning that you can also connect 5 V components. It has 20 general-purpose I/O (GPIO) pins: 14 digital I/O pins and 6 pins for analog-to-digital conversion (ADC). It is equipped with four pulse width modulation (PWM) pins and three channel logic converters connected to the GPIO pins. The Arduino 101 board has an in-circuit serial programming header with serial peripheral interface (SPI) signals that you can use to access microcontrollers and I2C dedicated pins. With a built-in six-axis accelerometer, gyroscope, and Bluetooth* low energy (Bluetooth* LE), you can easily create Internet of Things (IoT) apps that allow you to use your smartphone to control your board (Figure 1).

Arduino 101 Board
Figure 1. Arduino 101* board with Intel® Curie™ technology

ADC and PWM

So, what does all this mean? To understand ADC and PWM, you first need to take a closer look at analog and digital signals. For components to communicate, they must send signals—think of them as messages—that reflect a change of state in a given time. In electronics, the quantity being measured is voltage. How the change takes place is what differentiates these two types of signals. You can represent signals (change in voltage over time) in graphs: Analog signals are shown as smooth, continuous waves given that the transition in voltage is gentle, while digital signals are represented in square waves where there is an “on” (5 V) or “off” (0 V) state only (Figure 2).

Digital and Analog Signals
Figure 2. Digital (squared) and analog (smooth) signals

Most communication between integrated circuits is digital, and ADC allows microcontrollers to read signals from analog devices. Conversely, digital-to-analog conversion allows the microcontroller to generate an analog signal and communicate with the analog interface. PWM comes into play when you want to control the time variable in a digital signal, allowing for a longer on or off state. The amount of time the signal is on (5 V) is known as duty cycle.

Figure 3 provides a graph representation of various duty cycles. You can alter the time unit to have the on state for a longer or shorter period. The bigger the duty cycle percentage, the longer the LED will be lit and so reducing the time the LED is off (0 V). PWM can be demonstrated using an LED; a common example is the fade effect, where by switching the on and off state very fast, you can mimic analog behavior.

Remember that in analog waves, the transition between on and off is smoother; therefore, to the observer, the LED will seem to fade. In a digital setup, the change is more abrupt, and you will only see the LED rapidly switching between on and off. PWM allows the LED to trick your eye by making it seem as if it’s fading even when using digital signals. On an RGB LED (composed of one red, one green, and one blue LED), each color can be represented at a different percentage, which affects the light’s overall color. If you have green and blue at a 50 percent duty cycle (equal quantities) while red is at 0 percent, the light generated will be teal.

LEDs Pulse Modulation
Figure 3. LEDs whose pulse width modulation is set at 50 percent, 25 percent, and 100 percent duty cycles, respectively

Serial Communication: Interintegrated Circuit, SPI, and Universal Asynchronous Receiver/Transmitter

The Arduino 101 board uses serial communication between its microcontrollers and peripherals. In serial systems, you can send only 1 bit at a time, while in parallel communications, you send multiple bits. You can think of serial as a single-lane road and parallel as a multilane highway. SPI and interintegrated circuit (I²C) are serial buses that communicate through digital signals. Think of a bus as a road for data transpiration: Its purpose is to allow communication between multiple master–slave chips, where one device acts as the controller (master) while others are being controlled (slaves; see Figure 4). The board uses I²C for communication with slower, on-board peripherals that it accesses occasionally. It only requires 2 wires to have up to 127 devices, where each receives a unique identifier that the master uses to recognize its slaves. I²C can read and write data from and to many sensors. Synchronous systems require a clock to ensure that the data is written and read at the correct tempo. The master defines which device is performing which action (read/write). The serial clock (SCL) wire provides the clock signal that the master generates, which synchronizes the data transfer between the devices on the I²C bus. The serial data (SDA) is the second wire, and it carries the data flowing through the bus. Because of its simplicity and being a low-bandwidth protocol, it’s a great option for infrequently accessed devices, but it’s slower than SPI.

Figure 4. Interintegrated circuit master–slave bus with serial data and serial clock lines.

SPI allows one master and one or multiple slaves using four signal types. Think of them as four channels (lines) of communications between the masters and slaves:

Master-out-slave-in (MOSI). Carries data from master to slave (data line)
Master-in-slave-out (MISO). Carries data from the slave to the master (data line)
Serial-clock (SCLK). Clock signal that synchronizes the system
Slave-select line (SS_). Used to select and enable the slave on the bus; notifies the slave that it should wake up and send or receive data

You can set up SPI in two ways. If all slaves share the same slave-select line, SPI requires a minimum of four wires (MOSI, MISO, SCLK, SS_). Another option is to give each slave have its own SS_ line to connect to the master; here, you need an additional wire per slave. Figure 5 summarizes these two setups.

Figure 5. A serial peripheral interface with one shared slave-select line (SS_) versus each slave with its own SS_ line

When contrasting I²C and SPI, you should also understand the concept of full duplex and half-duplex. Full-duplex communication is where two parties (slave and master in this context) can send data at the same time. In half-duplex communication, one party has to wait until the other is done sending data. Taking a closer look, given that SPI is full duplex, it has an additional data wire compared to half-duplex I²C, where there is only one data line and it can’t send and receive data at the same time (Figure 6).

Figure 6. Serial peripheral interface versus interintegrated circuit communication

In asynchronous systems, there is no clock to dictate when the information is sent. Instead, the sender and receiver must agree on a speed (bit rate) at which the information will be communicated. If the sender is delivering at a faster rate, the receiver won’t be able to understand the data being sent because it will be processing it at a lower tempo. Therefore, it’s essential that you establish a bit rate to effectively use asynchronous communication. The Universal Asynchronous Receiver/Transmitter (UART) is an asynchronous unit now commonly included in microcontrollers, and it acts as an intermediary between parallel and serial interfaces. It’s intended to have two connected devices, where the transmitter’s output, labeled Tx, is connected through a unidirectional data line to the receiver (Rx input; Figure 7). In this fashion, one UART takes bytes of data (1 byte = 8 bits) and transmits each individual bit through the Tx line to the receiving UART (Rx line). On the receiver end, the UART reassembles the bits, generating the byte that was originally sent.

Figure 7. Universal Asynchronous Receiver/Transmitter send and receive bus

Summary

The Arduino 101 board is equipped with a broad array of pins, chips, and other devices that communicate by using synchronous and asynchronous methods. Complex, time-sensitive solutions have been designed to ensure that data is sent, received, and processed correctly. This article explored how PWM, ADC, I²C, SPI, and UART work together to enable communication between microcontrollers and peripherals, helping your IoT projects come to life.

For More Information

Check out the tutorial “Analog vs. Digital” at https://learn.sparkfun.com/tutorials/analog-vs-digital
Check out the tutorial “PWM” at https://www.arduino.cc/en/Tutorial/PWM
David Kalinsky and Roee Kalinsky provide a good introduction to I²C at http://www.embedded.com/electronics-blogs/beginner-s-corner/4023816/Introduction-to-I2C

Additional Reading:

Learning About Input and Output - Part 1
Learn more about the Intel® Curie™ Module and Arduino 101.
Visit the Intel Developer Zone web site to learn more about Intel IoT Technology.

↧

Do You Need a Website for Your App?

July 28, 2016, 1:52 pm

Latest and popular articles on Intel Technologies

≫ Next: Recipe: Building and Running YASK (Yet Another Stencil Kernel) on Intel® Processors

≪ Previous: The Basics of Inputs and Outputs, Part 2: Understanding Protocols

When people start a new business or product, one of the first things they usually do is put up a website. But does your app need one? Your app will primarily be downloaded from an app store, and interacted with on devices—and website’s cost money to build and maintain. On the other hand, a website would allow you to communicate directly with your customers, providing more in-depth information, in a space that’s completely under your control.

Read on as you consider the pros and cons of an app website, and how to include it in your roll-out plan.

Get More Visual

One of the benefits of building a website is that it gives you a place to bring your app to life in a very visual way. On an app store, space is limited and graphics need to adhere to certain specs, but you're free from those limitations on your own website. With high-quality graphics and professional web design, you can really bring the experience of your app to life in a different and engaging way. That includes both content to demonstrate the gameplay or functionality, such as screenshots and video, as well as the look and feel of the site itself, which should tie back to the app's overall branding.

Go Deep!

On your own website, you have the time and space to explain all your app's features and benefits in detail. Some customers will be happy to download your app based on the basic app description, but others need more information, and want the ability to dig much deeper. The best way to engage with those consumers is to provide rich content and extended info, such as:

Detailed app features
Detailed app benefits
How it compares to other apps
How it compares to other (non-app) solutions
Company bios and profile

Engage Directly with Consumers

Many of the things that happen on the app store itself remain a mystery to app creators. You can't access customer contact and engagement based on analytics, so it isn't something you can use in order to improve the app experience or increase engagement with the app. Additionally, although public reviews are visible and able to be acted upon, this isn't as valuable as having a direct conversation. The ability to directly communicate with customers is one of the best reasons for building a website.

Here are some ways to connect with customers on your website:

Sign-up for future contact:

Whether your app is already in the market, or you're just beginning your pre-launch efforts, your website is the best place to build your audience. Ask users to sign up to receive information about your app in the future, and when possible, provide them with incentives to do so.

Feedback form:

You should always engage with reviews on the app store, of course, but it's also a really good idea to have a feedback form directly on your own site. Not only does it signal to your customers that you're listening to them, it also gives you an opportunity to address any issues early on, and build an ongoing relationship with your customers.

Modify messages over time:

Because you own your site, you're able to update the content and messaging at will, making it really easy to quickly respond to consumers and to demonstrate your responsiveness.

Modify messages for particular audiences:

If an app has several distinct audiences—for example, a photo-sharing tool that appeals to both grandparents and youth sports coaches—it might make sense to set up unique websites, or sections of your website, that address them directly. Because it's your own space, you're able to position the message in a unique way, and maximize those individual contacts.

Access to your own analytics:

This one is really key. When it comes to activity on the app store, you don't have any way of knowing about overall traffic, or how many potential customers are watching videos, but then dropping off instead of downloading. You can set up custom, in-depth analytics on your own website in order to understand how people are interacting with your content—and how to improve it.

Add new content, products, and news:

Your website is also, of course, a great way to share all kinds of new content directly with your audience. Post important updates and news about the app—and share new use cases. Your site can also be your launching point for new apps or products, especially ones that relate or extend the offering of your current app.

What Does Your Website Need? Key Elements:

1.App Icon/Name
This should be clear and visible throughout the site--and it should be identical to your icon and name in the app store.

2.Pitch/Tagline
One line that provides the key selling point. Why should they download your app? What problem does it solve in their life?

3.Call to Action
Don't forget to be clear about what you'd like visitors to do! Provide a link to app stores/download/sign-up.

4.Photo of the App in a Device
Visually, this makes it clear right away that this product is an app. It also quickly lets people know which platforms it’s available on, based on the type of device displayed.

5.Testimonials
Real customers who can speak to the benefits of downloading or purchasing your app. Ratings can also be a powerful way of demonstrating buy-in by like-minded peers.

6.Social Mentions
Tap into social media mentions and display those on your site to let customers do the talking directly for you. This is most successful when it's tied to a social media campaign which encourages interaction.

7.Blogs or Reviews
The more professional write-ups, the better. This is another reason to make sure that your app is sent early to reviewers.

8.Contact Info
If they aren't ready to buy quite yet, or have some urgent questions, make sure there's a clear way for them to reach out.

9.Social Media Links
Your website exists as part of portfolio of digital pieces, and it's important to cross-promote them. This allows customers to interact with you in the places they're most comfortable.

10.Blog
Keep in mind that blogs require ongoing content in order to avoid looking stale, so only include a blog if you can continuously provide content for it.

When a Website Doesn't Make Sense

Although there are a lot of good reasons to build a website for your app, it doesn't always make sense, and it definitely isn't a requirement. What are some reasons you might choose not to build? If you have a really limited budget, that money might be better spent on product development. A website that's not up to par can have a negative impact on your brand, so it may make sense to wait until you can do it right. If your audience doesn't spend much time online, or is only engaged in specific social media sites, then you might not want or need a site—remember to customize your plan based on your own unique audience and where they're most likely to be.

What About a Landing Page Instead?

If you only have one app, and the functionality of your app is simple, you might only need a landing page. A landing page is really just a one-page overview with branding, a link to download, and contact info. When you're just getting started, this is a great way to get something out there. You can have more in-depth information elsewhere, and provide that only to the investors or customers who need more info.

Timing! When Should You Launch?

When is the right time to build your website? Like most things related to the business and marketing of your app--earlier is better. If you're able to build your website while your product is still under development, then you have the added benefit of being able to ask for early sign-ups from people who are interested, and even incorporate some of their requests and feedback into the app itself. Especially before the product launches, in can be a good idea to offer an incentive or a premium in exchange for providing their info and being an early adopter. How can you make sure they feel special and rewarded for their early faith in your product? This can come in many forms, such as an elevated status in the game, or additional premium services or content.

How Should You Build It?

But wait—once you've decided to build a website for your app, how should you go about doing it? That depends on your skill level and your bandwidth.

1.Do it yourself.
If you have the expertise and bandwidth to build your own website, go for it! This gives you firsthand control over all the code and content, and the ability to make it as custom and unique as you want it to be.

2.Hire an agency.
On the other hand, if this isn't something you've done before, and not something you're interested in, it can be a great thing to hire out. Even if you have the skills, your time might be better spent on some other aspect of the business or product development. There are a lot of great digital agencies out there, so take the time to find the one that best fits your vision and working style.

3.Use a do-it-yourself tool.
If you want to control and launch the content on your own, but don't want or need a custom experience, then a do-it-yourself tool might be the perfect solution for you. Products such as Squarespace and Wix allow you to quickly build a simple site without needing to do any custom coding.

Don't Forget to Keep it Fresh!

Once you've launched your website, it's important to remember that it's there! Have an active, ongoing plan for maintaining and updating content. Even outside of a blog, customers notice whether or not there's any fresh content, and that has an impact on whether they download or continue engaging with your product. Make sure to post any updated news, especially industry write-ups or reviews, and once you have an active social presence, you can also incorporate those directly into the site, so that new social mentions, such as twitter posts show up automatically.

↧

Recipe: Building and Running YASK (Yet Another Stencil Kernel) on Intel® Processors

July 29, 2016, 2:59 pm

Latest and popular articles on Intel Technologies

≫ Next: CoreCLR JIT Profiling with Intel® VTune™ Amplifier on Windows*

≪ Previous: Do You Need a Website for Your App?

Overview

YASK, Yet Another Stencil Kernel, is a framework to facilitate design exploration and tuning of HPC kernels including vector folding, cache blocking, memory layout, loop construction, temporal wave-front blocking, and others. YASK contains a specialized source-to-source translator to convert scalar C++ stencil code to SIMD-optimized code. Proper tuning of a stencil kernel can show a performance boost on the Intel® Xeon Phi™ processor of up to 2.8 times the performance of the same program on an Intel® Xeon® processor. The performance advantage of the Xeon Phi processor can be attributed to the high bandwidth memory and 512 bit SIMD instructions.

Introduction

A very important subset of HPC computing is the use of stencil computations to update temporal and spatial values of data. Conceptually, the kernel of a typical 3D iterative Jacobian stencil computation can be shown by the following pseudo-code that iterates over the points in a 3D grid:

for t = 1 to T do
  for i = 1 to Dx do
    for j = 1 to Dy do
       for k = 1 to Dz do
          u(t + 1, i, j, k) ← S(t, i, j, k)
      end for
    end for
  end for
end for

where T is the number of time-steps; Dx, Dy, and Dz are the problem-size dimensions; and S(t, i, j, k) is the stencil function. For very simple 1D and 2D stencils, modern compilers can often recognize the data access patterns and optimize code generation to take advantage of vector registers and cache lines, but for more complicated stencils, combined with modern multi-core processors with shared caches and memories, the task of producing optimal code is beyond the scope of most compilers.

YASK is a tool which allows a user to experiment with different types of data distribution, including vector folding and loop structures which may yield better performing code than straight compiler optimizations. YASK is currently focused on single node OpenMP optimizations.

The following graphic shows the typical YASK usage model:

High-level components

Introductory Tutorial

A complete introductory tutorial can be found in the documentation section of the Yask website. This tutorial will walk a user through the necessary steps to build and execute YASK jobs.

Vector Folding Customization

Vector folding, otherwise known as multi-dimensional vectorization is the process of packing vector registers with blocks of data which are not necessarily contiguous in order to optimize data and cache reuse. For a complete discussion of vector folding, please refer to the document titled: “Vector Folding: improving stencil performance via multi-dimensional SIMD-vector representation.” Vector folding by hand is complicated and error prone, so YASK presents a software tool for translating standard sequential code into new code which can then be compiled to produce faster, more efficient code.

Download detailed Vector Folding paper [PDF 330 KB]

Loop Structure Customization

In combination with vector folding, the execution of loops across multiple threads gains additional performance. By allowing a user to experiment with loop structure via OpenMP constructs, YASK offers yet another avenue for code optimization. There are three main loop control customizations: ‘Rank’ loops break the problem in OpenMP regions, ‘Region’ loops break each OpenMP region into cache blocks, and ‘Block’ loops iterate over each vector cluster in a cache block.

Performance

AWP-ODC: One of the stencils included in YASK is awp-odc, a staggered-grid finite difference scheme used to approximate the 3D velocity-stress elastodynamic equations: http://hpgeoc.sdsc.edu/AWPODC. Applications using this stencil simulate the effect of earthquakes to help evaluate designs for buildings and other at-risk structures. Using a problem size of 1024*384*768 grid points, the Intel® Xeon Phi™ processor 7250 improved performance by up to 2.8x compared to the Intel® Xeon® processor E5-2697 v4.

AWP-ODC

Configuration details: YASK HPC Stencils, AWP-ODC kernel

Intel® Xeon® processor E5-2697 v4: Dual Socket Intel® Xeon® processor E5-2697 v4 2.3 GHz (Turbo ON) , 18 Cores/Socket, 36 Cores, 72 Threads (HT on), DDR4 128GB, 2400 MHz, Red Hat Enterprise Linux Server release 7.2

Recipe:

Download code from https://github.com/01org/yask and install per included directions
make stencil=awp arch=hsw cluster=x=2,y=2,z=2 fold=y=8 omp_schedule=guided mpi=1
./stencil-run.sh -arch hsw -ranks 2 -bx 74 -by 192 -bz 20 -pz 2 -dx 512 -dy 384 -dz 768

Intel® Xeon Phi™ processor 7250 (68 cores): Intel® Xeon Phi™ processor 7250, 68 core, 272 threads, 1400 MHz core freq. (Turbo ON), 1700 MHz uncore freq., MCDRAM 16 GB 7.2 GT/s, BIOS 86B.0010.R00, DDR4 96GB 2400 MHz, quad cluster mode, MCDRAM flat memory mode, Red Hat Enterprise Linux Server release 6.7

Recipe:

Download code from https://github.com/01org/yask and install per included directions
make stencil=awp arch=knl INNER_BLOCK_LOOP_OPTS='prefetch(L1,L2)'
./stencil-run.sh -arch knl -bx 128 -by 32 -bz 32 -dx 1024 -dy 384 -dz 768

ISO3DFD: Another of the stencils included in YASK is iso3dfd, a 16^th-order in space, 2^nd-order in time, finite-difference code found in seismic imaging software used by energy-exploration companies to predict the location of oil and gas deposits. Using a problem size of 1536*1024*768 grid points, the Intel® Xeon Phi™ processor 7250 improved performance by up to 2.6x compared to the Intel® Xeon® processor E5-2697 v4.

ISO3DFD

Configuration details: YASK HPC Stencils, iso3dfd kernel

Recipe:

Download code from https://github.com/01org/yask and install per included directions
make stencil=iso3dfd arch=hsw mpi=1
./stencil-run.sh -arch hsw -ranks 2 -bx 256 -by 64 -bz 64 -dx 768 -dy 1024 -dz 768

Recipe:

Download code from https://github.com/01org/yask and install per included directions
make stencil=iso3dfd arch=knl
./stencil-run.sh -arch knl -bx 192 -by 96 -bz 96 -dx 1536 -dy 1024 -dz 768

↧

CoreCLR JIT Profiling with Intel® VTune™ Amplifier on Windows*

August 1, 2016, 6:21 am

Latest and popular articles on Intel Technologies

≫ Next: IoT Security Made Simple with Wind River* Linux*

≪ Previous: Recipe: Building and Running YASK (Yet Another Stencil Kernel) on Intel® Processors

CoreCLR JIT Profiling with Intel® VTune™ Amplifier on Windows*

Intel® VTune™ Amplifier supports JIT profiling on Windows for the full Common Language Runtime (CLR) environment out of the box. But to enable JIT profiling support for CoreCLR, which is a subset of CLR, you need to manually configure CoreCLR environment variables.

On February 3, 2015, Microsoft announced open source CoreCLR on GitHub. CoreCLR is a .NET* execution engine in .NET Core. However, if you install CoreCLR and run the VTune Amplifier performance profiling atop of it on Windows, you cannot see any JITted code in the collected result.

The example below shows the Advanced Hotspots analysis result collected with VTune Amplifier 2016 update 3 on CoreCLR. No JITted code is displayed. Instead, you see only the “Unkown” module:

CoreCLR VTune Amplifier "Unknown" Module

To enable CoreCLR JIT profiling with VTune Amplifier on Windows, do the following:

Start the registry editor (search for regedit from the Start menu), right-click Computer> HKEY_LOCAL_MACHINE , select Find... to look up the GUID# for amplxe_samplingmrte_clrprof_1.0.dll.
The Find dialog box opens.
Type amplxe_samplingmrte_clrprof_1.0.dll and click the Find Next button:

For example, for the VTune Amplifier XE 2016, this file by default is installed at C:\Program Files (x86)\IntelSWTools\VTune Amplifier XE 2016\bin64\amplxe_samplingmrte_clrprof_1.0.dll. So, the registry editor finds the file and displays the GUID# as follows:
Copy the GUID (in the red circle). For the example above, the GUID is {AA5E4821-E3B1-479c-B7FF-5AD047D22CED}.
From Control Panel, select System and click the Advanced system settings. From the Advanced system settings, select the Advanced tab and click the Environment Variables... button.
Set the CoreCLR system variables as follows:
CORECLR_ENABLE_PROFILING=1
CORECLR_PROFILER={AA5E4821-E3B1-479c-B7FF-5AD047D22CED}
Restart the VTune Amplifier and run the analysis (for example, Advanced Hotspots).
Outcome: JIT profiling is enabled and all JITted code is displayed in the viewpoint:

For more details on .NET code analysis with the VTune Amplifier for Windows, see the product online help.

↧

IoT Security Made Simple with Wind River* Linux*

August 2, 2016, 11:42 am

Latest and popular articles on Intel Technologies

≫ Next: Selecting Programming Languages for the IoT

≪ Previous: CoreCLR JIT Profiling with Intel® VTune™ Amplifier on Windows*

A security breach of an Internet of Things (IoT) device can lead to serious hazards. The always-on, always-connected nature of IoT devices exposes a larger threat surface that requires stronger security capabilities. That’s why integrating security into these devices is critical not just during the build phase but throughout the product life cycle.

When building IoT products, your priority as a developer is to make the development life cycle fast, simple, and cost-effective. This approach may not be feasible, however, when you use an embedded Linux platform if the onus of addressing individual security vulnerabilities remains on you. You need a platform that supports robust security integration, monitoring, and upgrades throughout the product life cycle.

The open source collaboration called Yocto Project* has simplified IoT embedded Linux life-cycle support with a broad set of user-space packages; encapsulation and separation of software and middleware; a standardized cross-build methodology for kernel, packages, patches, and other standardized tools; and infrastructure. Wind River, a leading contributor to the Yocto Project, developed its embedded Linux platform by integrating the Yocto* 2.0 open source framework. Features such as built-in certifiable security capability, portability across hardware platforms, and life-cycle maintenance support make this Wind River Linux ideal for simplifying your overall development life cycle.

A Smart Platform on Which to Build Smart Appliances

Wind River Linux supports a wide range of IoT devices in terms of size, architecture, and industry. It gives you access to vital security as well as faster build and deployment capabilities, with key features such as:

Off-the-Shelf Run Times: Preconfigured profiles for various device types, sizes, latency, and power management let you jump-start device development.
Minimal Installation Provisions: Begin with a smaller initial download, and incrementally download packages as you need them during the build process.
Project-Specific, Tailored Environment: The certified binary distribution allows you to customize your middleware development environment. Using this customized environment, you can develop and integrate security, connectivity, rich networking options, and cloud management easily. Certified binary feeds let you directly integrate updates and security fixes.
Advanced Built-In Security: Secured boot authenticates all stages of the boot process. A secure runtime loader prevents code tampering and unauthorized execution, and an advanced user management system protects against unauthorized access and allows the use and enforcement of user-based policies and permissions. Secure network communications are accomplished through Security protocols such as Wind River SSL (Secure Sockets Layer), Wind River SSH (Secure Shell), Wind River IPsec (IP security), and Wind River IKE (Internet Key Exchange).
Life-Cycle Support: Using an innovative patch management technology, Wind River delivers bug and security fixes with predictability.
Automatic Updates: Updated binaries are automatically detected and downloaded.
Portability: Support for a broad spectrum of silicon architectures and board support packages, including Freescale, Intel, LSI, Texas Instruments, and Xilinx platforms adds flexibility to your security implementations.
Reduced Build Time: Fifteen-second quick boot, rapid installation, updates without the need for extra compilation and reboot, and centralized cloud orchestration access help reduce overall development time.
Development Tools: Wind River Workbench and Wind River Simics* are included with the platform. The Workbench integrated development environment allows you to configure, analyze, debug, and optimize your system. With Simics, you can accelerate every phase of your development life cycle by extensively using simulations, reducing cost and the risks of shipping late or sacrificing quality.

Figure 1 shows the architecture of Wind River Linux 8.

Figure 1. Wind River Linux 8 Architecture

Wind River Linux is interoperable with other open source projects, such as Toaster and Eclipse*, which can reduce the learning curve for new developers. With Wind River, you can use a simple, common framework to manage all your projects.

Build Secure Devices with Wind River* Linux*

With an increasing number of connected devices, security threats for Linux-based systems are on the rise. Threats include unauthorized access, data and device destruction, tampering with functionality, information disclosure, information modification, and denial-of-service attacks. Wind River Linux comes with Security Profile, a commercial off-the-shelf product that has the features necessary to fight these threats. Security Profile is in compliance with the Common Criteria scheme, an International Organization for Standardization/International Electrotechnical Commission 15408 standard for cybersecurity certification.

Because of increased cybersecurity concerns with connected products, the Common Criteria scheme, mostly used in regulated industries, have now expanded to other industries, as well. As a result, embedded software development should factor in these compliance criteria. Use case–specific security requirements in the Common Criteria scheme are documented using Protection Profiles and used for evaluation (with various Evaluation Assurance Levels) and certification.

Certification of new products as compliant with these requirements can be both costly and time consuming. Wind River Linux Security Profile simplifies this process (Figure 2) by enabling multiple Protection Profiles feature templates, including:

Operating System Protection Profile (OSPP) and Validation Tools;
General-Purpose OSPP;
Labelled Security Protection Profile; and
Role-Based Access Control Protection Profile.

Figure 2. Security Profile for Wind River Linux architecture

The Security Profile architecture includes a security-focused, hardened kernel that uses PaX to implement least-privilege protections for memory pages. This approach protects systems from security exploits such as buffer overflows and return-to-libc attacks. Features such as enhanced Address Space Layout Randomization, memory corruption–based exploit prevention, memory sanitizing, and path-based security policy with zero run time memory allocation harden your product at no extra cost.

Secure-core and platform options built into the secure user space prevent run time buffer overflow. The operating system also includes a suite of tools with which you can lock down, monitor, and audit a system, all of which give you greater insights and control for maintenance and troubleshooting. The Security Profile also supports multitier security with stacked Linux Security Modules. Trusted computing with root of trust is built in.

Constant scrutiny over all software update releases is critical for always-on devices. Continuous regression tests and bug-fix patch releases for such devices can add significant time and cost to your development life cycle, but when you use Wind River Linux, Wind River’s team of specialists handle these tasks for you.

The Wind River Linux security response team monitors 24x7 security threats. The team provides hot patches to the market’s best known threats with-24-hour response times. You get security mitigation, detailed system monitoring, forensics, and zero-day protection support from an experienced, highly reliable team. In the long term, this support can save you significant time, money, and hassle compared to free Linux options.

Ship Faster, With Reduced Overhead and Simplified Workflow

As the embedded Linux footprint continues to expand in the IoT space, conformance to stronger security requirements is no longer optional. Today, security readiness is mandatory not just for regulated markets like aerospace and defense but in all commercial verticals. When you build secure IoT products, efficient handling of security compliance–related overhead is critical for reducing cost and meeting sensitive deadlines.

Wind River Linux as a commercial development platform provides you with a support team that understands your application, your hardware, and your unique environment, allowing you to stay focused on critical applications and product-specific features while the Wind River team of experts supports your Linux platform tools and security infrastructure.

Security Profile simplifies the Common Criteria scheme–based security certification for devices developed using preintegrated Protection Profiles. The feature guides you through the certification process with its Evaluation Configuration guide and documentation.

As security requirements harden, using commercial embedded platform such as Wind River Linux helps you simplify your workflow to efficiently build smart, secure, and resilient connected products that last for years.

↧

Selecting Programming Languages for the IoT

August 2, 2016, 2:43 pm

Latest and popular articles on Intel Technologies

≫ Next: Understanding the IoT Ecosystem

≪ Previous: IoT Security Made Simple with Wind River* Linux*

The Internet of Things Ecosystem

To understand which languages you can apply to your IoT projects, you must first understand the IoT ecosystem. This knowledge is important because the processor architectures and resources available to your software or firmware will differ greatly at each level.

Let’s start by defining a simple taxonomy of IoT devices for which you’re developing software (see Figure 1).

Figure 1. Taxonomy for Internet of Things devices

At the bottom are the edge devices. These devices interact with the world and represent things like wearables and other connected devices. The devices source and create data and interact with the world through actuators.

Next are the gateways. These devices can be intermediary devices that move data to other systems for processing. The gateways can also aggregate data from many edge-devices and provide a control path to edge devices.

Finally, there’s the cloud. The cloud is a scalable set of compute, network, and storage resources that provide the ability to store, analyze, and visualize data from edge devices and gateways.

Examples of these levels include the Intel® Galileo board and the Intel® Curie™ Compute Module for edge devices, Intel® IoT Gateways for the gateway level, and the Wind River* Helix* Lab Cloud for the cloud level.

With the IoT ecosystem divided into its layers, let’s look at which languages you can apply at each level.

Edge Device

Edge devices, such as wearables, are typically constrained-resource, embedded systems because of the space and power constraints in which they function. Devices like the Intel® Curie™ module are the size of a button, and can be powered by a small, coin-sized battery (see Figure 2). Given the minimal resources of the Intel® Curie™ module, the typical languages suitable for its use include Assembly and C. Although C is the lingua franca of embedded firmware development, there are times when you need to squeeze as many instructions as possible into a device. In such cases, Assembly is your best choice. The downside is that development time can be longer depending on your expertise with the language.

Figure 2. The Intel® Curie™ Compute Module

Another example of an edge device is the Intel® Edison board (see Figure 3), which you can use in the wearables space or in general IoT products. Differing from the Intel® Curie™ module, which incorporates a microcontroller, the Intel® Edison board includes a dual-core Intel® Atom™ processor with considerably more computing power (the size of an SD card). Because the Intel® Edison board runs Linux*, the C language is an ideal choice here, but you could use other languages, as well, including Python* and Node.js*. Python* is ideal for quick prototyping and production deployment, but can lag in performance compared to native compiled C. If you use Intel® XDK, you can also run Node.js* (JavaScript*) with Node-RED*. Node-RED* makes it easy to build and run data flows, offering a graphical approach to development. Knowledge of the JavaScript* language makes this environment even more powerful.

Figure 3. The Intel® Edison board

Gateway

At the gateway level, the compute capability rises greatly because gateways are responsible for communication and analysis of data from many devices through several different buses. With this additional computing capability comes the ability to run more powerful languages or interpreted languages with greater performance.

Intel® IoT Gateways provide a variety of designs that scale from a single-core Intel® Quark™ system on a chip to a quad-core Intel® Atom™ or Intel® Core™ processor (see Figure 4). These platforms support either Wind River Linux* 7 or the Snappy Ubuntu* Core (Linux*).

Figure 4. An Intel® IoT gateway

In addition to support for C and C++ (which is also ideal for higher-performance devices), you can use Python*, but at greater execution speeds. Node.js* with JavaScript* is also available, which is ideal for creating or connecting to web-services as well as to cloud services.

Cloud

When you reach the cloud, the computing capability increases drastically, as do the language choices. In the cloud, you’ll find servers enabled through power-efficient Intel Atom and Intel Core processors as well as highest-compute-density Intel® Xeon® processors. Apps written in the cloud serve a variety of needs, and as such, the languages you use here can differ greatly. To process the massive amounts of data that IoT edge devices create, you can use big-data frameworks like Apache Hadoop*. On top of Hadoop, query languages such as Apache Hive* enable computation over massive data sets with Structured Query Language (SQL)-like queries. Apache Pig* is also useful for experimentation with large data sets in the Pig Latin scripting language.

Data analytics and visualization are other key applications within the cloud that many programming languages enable. A popular language and environment for statistical computing and visualization is the R language, which has recently grown in popularity. The Julia language is another option here. Julia is a high-performance, dynamic language designed with cloud computing in mind.

You use several languages to build web-services to monetize data in the cloud, including JavaScript*, Node.js*, or server-side and client-side Java*. Given many frameworks and supporting languages (such as Rails and Ruby), the language opportunities in this space is large and growing.

Summary

Choosing a programming language for a project requires consideration of the target environment (including the processor) as well as the resources available to it. Developing software in the cloud opens many possibilities, given the scale of resources available, but developing embedded firmware for smaller microcontrollers requires greater control to minimize instruction count and maximize execution speed and resource management.

The following list briefly summarizes the languages discussed in this article, and identifies their key use models:

Assembly: Developing firmware in the native instruction set provides the greatest control in resource-constrained systems such as edge-devices.
C/C++: One step above Assembly, C and C++ enable construction of resource-constrained code but with readability and maintainability. C’s prominence allows you to find it in every use model.
Python*: An interpreted language that simplifies prototyping, you can also use it for production. Python* supports a massive number of libraries and modules so that you can get more done with less code. It’s useful in more powerful edge devices, gateways, and even the cloud.
JavaScript*/Node.js*: A popular language and runtime, each enables the development of scalable network applications and can be applied across use models.
Node-RED*: Developing visually with Node-RED* makes it easy to build data flows that include sensors and actuators. If you know JavaScript*, these flows can be even more powerful. You can apply this language at gateways and within more powerful edge devices, such as those that the Edison board powers.
HiveQL: If you’re using Hive (built on Hadoop), you can use HiveQL to process massive data sets in cloud environments.
Pig: Using Pig Latin to process big data enables quick script development and simple experimentation of data sets within the cloud.
R: An increasingly popular statistical computing language for the processing and visualization of data, you can apply data mining with R in cloud environments. R is also open source.
Julia: Another high-performance language for processing and visualizing data in cloud environments, Julia is also open source. It was designed with parallelism in mind and includes key elements for distributed computation.
Java*: A popular server-side and client-side language for the Web, making it useful in the cloud for both web and web services development.

↧

Understanding the IoT Ecosystem

August 2, 2016, 2:44 pm

Latest and popular articles on Intel Technologies

≫ Next: Recipe: ROME1.0/SML for the Intel® Xeon Phi™ Processor 7250

≪ Previous: Selecting Programming Languages for the IoT

Anatomy of the IoT Ecosystem

Wearables and home automation devices dominate the IoT market today, but the overall ecosystem for the IoT will evolve. Figure 1 illustrates a simplified IoT ecosystem:

At the left side are the edge devices. These are the IoT end points, and the provide the means to sense and control their environment through sensors and actuators.
Gateways aggregate data from edge devices and present it to the cloud (while providing control from the cloud). In some cases, gateways may process data to add value into the ecosystem.
The cloud provides the means to store data and perform analytics. The importance of a cloud is a set of resources that can scale up or down in an elastic fashion as a function of need.
The cloud enables monetization of the data and control through application programming interfaces and apps (which may or may not reside in the cloud, as well).
At the top is management and monitoring across all levels of the ecosystem.
At the bottom are technologies that enable development, test, and other critical capabilities, such as end-to-end security (for the data and control planes).

Figure 1. Simplified Internet of Things Ecosystem

Now, let’s look at each part of the IoT ecosystem and which technologies are applied.

Edge

At the edge of the IoT ecosystem are connected devices that can sense and actuate at various levels of complexity. In the wearables space, you’ll find smart wristbands and watches that include biometric sensing. In the automotive space, you’ll find networks of smart devices that cooperatively create a safer and more enjoyable driving experience (through sensors that improve drivetrain efficiency or tune automotive parameters based on altitude or temperature).

In the space of power-conscious wearable devices, you’ll find processors like Intel® Quark™ SoC (in the tiny Intel® Curie™ Compute Module), which can operate from a coin-sized battery, and includes a six-axis combination sensor (accelerometer and gyroscope). For greater processing power, the Intel® Edison compute module supports both single-core and dual-core Intel® Atom™ CPUs. The Intel® Edison board can run Yocto Linux*, so the software ecosystem enabled there creates endless development opportunities (see Figure 2).

Figure 2. The Intel® Curie™ Computer Module and Intel® Edison board

Gateway

When we talk about the IoT, the emphasis is on things—lots of them. For this reason, the gateway is an integral part of the IoT ecosystem, bridging small edge devices that may include little intelligence to the cloud (where the data may be monetized). The gateway can serve one or two primary functions (sometimes both): It can be the bridge that migrates collected edge data to the cloud (and provides control from the cloud), and it can also be a data processor, reducing the mass of available data on its way to the cloud or making decisions immediately based on the data available. For this reason, gateways tend to be more powerful than the edge devices.

The Intel® IoT Gateway is a platform for development of IoT gateway apps (see Figure 3). What makes this a platform is the integration of key communication technologies (including Ethernet, Wi-Fi, Bluetooth*, and ZigBee*, as well as 2G, 3G, and Long-Term Evolution) and sensor/actuator interfaces (RS-232, analog/digital input/output), with processing capability from single-core Intel® Quark SoCs to dual-core and quad-core Intel® Atom™ and Intel® Core™ processors. To simplify development, the Intel IoT Gateway supports Wind River Linux* 7, Windows® 10, or the Snappy Ubuntu* Core (with integrated driver support for the various interfaces, allowing you to focus on your app).

Figure 3. The Intel® IoT Gateway Platform

You can simply development further by using the Wind River* Intelligent Device Platform XT. Intelligent Device Platform XT is a customizable middleware development environment that provides, among other things, security and management technologies. Although these features are commonly developed as an afterthought, bringing security and manageability in at the start enables a world-class IoT gateway that protects your data and minimizes downtime.

Cloud

Think of the cloud as an essential part of the IoT ecosystem, given its attributes of scaling and elasticity. As data grows from edge devices, the ability to scale the storage and networking assets as well as compute resources becomes a key enabler for the development of IoT systems.

What makes elastic compute possible is a technology called virtualization. Using virtualization, you can carve up a processor to represent two or more virtual processors. Each virtual processor time-shares the physical processor such that when one needs less computing power, another virtual processor (and the software that occupies it) can exploit those physical resources.

Virtualization has been around for some time, but you can find extensions in modern processors that make the technology more efficient. As you’d expect, you can find these virtualization extensions in powerful Intel® Xeon® processors for data centers, but you can also find them in lower-power Intel® Atom™ processors.

Virtualization means that when more IoT data flows from edge devices, physical processors can be carved up and associated with these data flows. When the flow of data subsides, these resources can be idled or reassigned to other tasks to save power and cost.

Management and Monitoring

A complexity that the IoT creates is the monitoring and management of gateways and edge-devices. Considering that an IoT system could contain thousands of gateways with many millions of sensor and actuator endpoints, management and monitoring present new challenges.

Although it’s possible to build a custom cloud-based set of apps to meet this challenge, you must also consider time-to-market constraints. This is one of the reasons Wind River created the Wind River Helix* Device Cloud. Device Cloud is a cloud-based IoT platform that provides device management, end-to-end security, and telemetry and analytics. Device Cloud is a technology stack that operates from the edge device into the cloud and offers data capture, data analysis, and overall monitoring and management for IoT systems at their scale. Device Cloud is also fully integrated with Intel® IoT Gateway Technology, as well as a portfolio of operating systems, such as Wind River Linux and VxWorks*.

Analytics

The key behind IoT is data, and this is where you create value. IoT data can come in many forms, but it commonly has two attributes: its scale and its relationship with time.

Realization of the IoT was partially enabled by big data processing systems. These systems were designed for datasets that require nontraditional processing methods. The datasets that massive numbers of edge devices create in an IoT ecosystem are a perfect match. The other aspect of IoT data is that it tends to be time-series data. Its storage and analysis are better suited to big data processing systems and NoSQL databases than traditional approaches.

Apache Hadoop* (such as provided through Cloudera) remains the key big-data processing system that includes an ecosystem in itself of technologies to address a range of needs. Apache NiFi*, for example, is a dataflow system that permits flow-based programming through directed graphs (perfect for streams of time-series data). Apache Cassandra*, which differs from the batch-oriented Hadoop Distributed File System (HDFS), is a NoSQL database distributed across nodes and supports clusters spanning geographically distributed data centers. The Cassandra data model is also ideal for real-time processing of time-series data (using a hybrid key-value and column-oriented database). Figure 4 illustrates these components’ relationships.

Figure 4. Relationship of big data processing systems and their file systems

Analytics is an ideal match for the cloud. The ability to scale compute resources as a function of dataset size or processing speed requirements makes the cloud a perfect platform for analyzing IoT data with systems like NiFi. Elastically expanding the compute capabilities for processing a dataset, and then gracefully decreasing those resources when no longer needed minimizes infrastructure cost.

Enabling Technologies

The IoT ecosystem is enabled by a collection of other technologies that are important to understand. Let’s focus on some of the technologies for development and test and a few technologies that live inside devices within the IoT ecosystem:

The Wind River Helix App Cloud is a browser-based development environment for IoT apps. Using App Cloud, you can develop code, build on top of Wind River operating systems, and simplify app testing using devices such as the Edison board. Because it’s a browser-based development environment, you can attach to your environment from anywhere with everything you’d expect from a world-class integrated development environment.
Wind River Helix Lab Cloud, fully integrated with App Cloud, allows broad testing of apps over a range of virtualized devices. Using Lab Cloud, you can create a device configuration that represents a physical device, and then virtualize that device in the cloud. Using App Cloud, you can then load your code onto the device for validation. As a set of virtualized resources, you could create thousands of devices for testing, allowing you to find bugs more quickly. Lab Cloud helps you make reliable IoT apps at the edge device or gateway.
Wind River Rocket* is a best-in-class real-time-operating-system (RTOS) designed for the IoT, using hardware such as the Intel® Edison™ board. Rocket was designed to be scalable, running on as little as 4 KB of memory for power and memory constrained systems. Rocket provides all the services you’d expect from an RTOS, including multithreading, and is preintegrated with App Cloud, making it simple to build gateway or edge device apps in minimal time.
Wind River Pulsar* Linux is a Linux distribution for small, high-performance IoT systems that require security and manageability. Pulsar supports kernel reconfiguration so that you can tailor it to your needs and includes capabilities like virtualization to build complex IoT apps. You’ll also take advantage of continuous updates to ensure a reliable and secure platform. You can use Pulsar on a variety of hardware solutions, such as the MinnowBoard MAX* board, and Intel® Atom™ CPU.

Summary

The IoT ecosystem is created from a broad set of technologies but with a common thread of manageability and security. To build an end-to-end platform for the IoT, you need practical knowledge of many disciplines, but leveraging pre-validated and pre-integrated assets that work together makes this task not only simple but enjoyable.

↧

Recipe: ROME1.0/SML for the Intel® Xeon Phi™ Processor 7250

August 2, 2016, 3:02 pm

Latest and popular articles on Intel Technologies

≫ Next: Heterogeneous Computing Implementation via OpenCL™

≪ Previous: Understanding the IoT Ecosystem

Overview

This article provides a recipe for how to obtain, compile, and run ROME1.0 SML on Intel® Xeon® processors and Intel® Xeon Phi™ processors. Before you run SML, you need to run the MAP processing phase first, because SML will use the output of MAP. So this document also describes how to run MAP as well as SML. Please follow the instructions below to run the MAP and SML workloads.

The source and test workloads for this version of ROME can be downloaded from: http://ipccsb.dfci.harvard.edu/rome/download.html.

Introduction

ROME (Refinement and Optimization via Machine lEarning for cryo-EM) is one of the major research software packages from the Dana-Farber Cancer Institute. ROME is a parallel computing software system dedicated to high-resolution cryo-EM structure determination and data analysis, implementing advanced machine learning approaches optimized for HPC clusters. ROME 1.0 introduces SML (statistical manifold learning)-based deep classification, following MAP-based (maximum a posteriori) image alignment. More information about ROME can found at http://ipccsb.dfci.harvard.edu/rome/index.html.

The ROME system has be optimized for both Intel® Xeon® processors and Intel® Xeon Phi™ processors. Detailed information about the underlying algorithms and optimizations can be found at http://arxiv.org/abs/1604.04539.

In this document, we used three workloads: Inflammasome, RP-a and RP-b. The workload descriptions are as follows:

Inflammasome data: 16306 images of NLRC4/NAIP2 inflammasome with a size of 250² pixels
RP-a: 57001 images of proteasome regulatory particles (RP) with a size of 160² pixels
RP-b: 35407 images of proteasome regulatory particles (RP) with a size of 160² pixels

In these documents, we use “ring11_all” to refer to the Inflammasome workload, “data6” to refer to the RP-a workload, and “data8” to refer to the RP-b workload.

Preliminaries

To match these results, the Intel Xeon Phi processor machine needs to be booted with BIOS settings for quad cluster mode and MCDRAM cache mode. Please review this document for further information. The Intel Xeon processor system does not need to be started in any special manner.
To build this package, install the Intel® MPI Library for Linux* 5.1(Update 3) and Intel® Parallel Studio XE Composer Edition for C++ Linux* Version 2016 (Update 3) or higher products on your systems.
Download the source ROME1.0a.tar.gz from http://ipccsb.dfci.harvard.edu/rome/download.html
Unpack the source code to /home/users.

> cp ROME1.0a.tar.gz /home/users > tar –xzvf ROME1.0a.tar.gz
The workloads are provided by the Intel® Parallel Computing Center for Structural Biology (http://ipccsb.dfci.harvard.edu/). As noted above, the workloads can be downloaded from http://ipccsb.dfci.harvard.edu/rome/download.html. Following the EMPIAR-10069 link, download Inf_data1.* (Set 1) and rename them ring11_all.*. Download RP_data2.* (Set 2) and rename them data8.*. Download RP_data4.* (Set 4) and rename them data6.*. The scripts referred to below can be obtained by pulling the file KNL_LAUNCH.tgz from http://ipccsb.dfci.harvard.edu/rome/download.html
Copy the workloads and run scripts to your home directory. You should have the following files:

>cp ring11_all.star /home/users >cp ring11_all.mrcs /home/users >cp data6.star /home/users >cp data6.mrcs /home/users >cp data8.star /home/users >cp data8.mrcs /home/users >cp run_ring11_all_map_XEON.sh /home/users >cp run_ring11_all_sml_XEON.sh /home/users >cp run_ring11_all_map_XEONPHI.sh /home/users >cp run_ring11_all_sml_XEONPHI.sh /home/users >cp run_data6_map_XEON.sh /home/users >cp run_data6_sml_XEON.sh /home/users >cp run_data6_map_XEONPHI.sh /home/users >cp run_data6_sml_XEONPHI.sh /home/users >cp run_data8_map_XEON.sh /home/users >cp run_data8_sml_XEON.sh /home/users >cp run_data8_map_XEONPHI.sh /home/users >cp run_data8_sml_XEONPHI.sh /home/users

Prepare the binaries for the Intel Xeon processor and the Xeon Phi processor

Set up the Intel® MPI Library and Intel® C++ Compiler environments:

> source /opt/intel/impi/<version>/bin64/mpivars.sh > source /opt/intel/composer_xe_<version>/bin/compilervars.sh intel64 > source /opt/intel/mkl/<version>/bin/mklvars.sh intel64
Set environment variables for compilation of ROME:

>export ROME_CC=mpiicpc
Build the binaries for the Intel Xeon processor.

>cd /home/users/ROME1.0a >make >mkdir bin >mv rome_map bin/rome_map >mv rome_sml bin/rome_sml
Build the binaries for the Intel Xeon Phi processor.

>cd /home/users/ROME1.0a >vi makefile Modify FLAGS to below: FLAGS := -mkl -fopenmp -O3 -xMIC-AVX512 -DNDEBUG -std=c++11 >make >mkdir bin_knl >mv rome_map bin_knl/rome_map >mv rome_sml bin_knl/rome_sml

Run the test workloads on the Intel Xeon processor (an Intel® Xeon® processor E5-2697 v4 is assumed by the scripts)

Running the ROME MAP phase for these workloads:

Running workload1: ring11_all >cd /home/users/ >sh run_ring11_all_map_XEON.sh

Running workload2: data6 >cd /home/users/ >sh run_data6_map_XEON.sh

Running workload3: data8 >cd /home/users/ >sh run_data8_map_XEON.sh
Running the ROME SML phase for these workloads:

Running workload1: ring11_all >cd /home/users/ >sh run_ring11_all_sml_XEON.sh

Running workload2: data6 >cd /home/users/ >sh run_data6_sml_XEON.sh

Running workload3: data8 >cd /home/users/ >sh run_data8_sml_XEON.sh

Run the test workloads on the Intel Xeon Phi processor

Running the ROME MAP phase for these workloads:

>cd /home/users/ Running workload1: ring11_all >cd /home/users/ >sh run_ring11_all_map_XEONPHI.sh

Running workload2: data6 >cd /home/users/ >sh run_data6_map_XEONPHI.sh

Running workload3: data8 >cd /home/users/ >sh run_data8_map_XEONPHI.sh
Running ROME SML phase for these workloads:

Running workload1: ring11_all >cd /home/users/ >sh run_ring11_all_sml_XEONPHI.sh

Running workload2: data6 >cd /home/users/ >sh run_data6_sml_XEONPHI.sh

Running workload3: data8 >cd /home/users/ >sh run_data8_sml_XEONPHI.sh

Performance gain seen with ROME SML

For the workloads we described above, the following graph shows the speedups achieved from running this code on the Intel Xeon Phi processor. As you can see, up to a 2.37x speedup for the ring11_all workload can be achieved when running this code on one Intel® Xeon Phi™ processor 7250 versus one two-socket Intel Xeon processor E5-2697 v4. The data used below were stored on a Lustre* file system.

Speedups achieved from running this code on the Intel Xeon Phi processor

Testing platform configuration:

Intel Xeon processor E5-2697 v4: BDW-EP node with dual sockets, 18 cores/socket HT enabled @2.3 GHz 145W (Intel Xeon processor E5-2697 v4 w/128 GB RAM), Red Hat Enterprise Linux Server release 6.7 (Santiago)

Intel Xeon Phi processor 7250 (68 cores): Intel Xeon Phi processor 7250 68 core, 272 threads, 1400 MHz core freq. MCDRAM 16 GB 7.2 GT/s, DDR4 96 GB 2400 MHz, Red Hat Enterprise Linux Server release 6.7 (Santiago), quad cluster mode, MCDRAM cache mode.

↧

Heterogeneous Computing Implementation via OpenCL™

August 2, 2016, 3:48 pm

Latest and popular articles on Intel Technologies

≫ Next: Driving Software Simplicity with Wind River Helix* Chassis

≪ Previous: Recipe: ROME1.0/SML for the Intel® Xeon Phi™ Processor 7250

1. Abstract

OpenCL™ is the open standard to programming across multiple computing devices, such as CPU, GPU, and FPGA, and is an ideal programming language for heterogeneous computing implementation. This article is a step-by-step guide on the methodology of dispatching a workload to all OpenCL devices in the platform with the same kernel to jointly achieve a computing task. Although the article focuses on only the Intel processor, Intel® HD Graphics, Iris™ graphics, and Iris™ Pro graphics, theoretically, it works on all OpenCL-complied computing devices. Readers are assumed to have a basic understanding on OpenCL programming. The OpenCL framework, platform model, execution model, and memory model [1] are not discussed here.

2. Concept of Heterogeneous Computing Implementation

In an OpenCL platform, the host contains one or more compute devices. Each device has one or more computing units, and each compute unit has one or more processing elements that can execute the kernel code (Figure 1).

Figure 1: OpenCL™ platform model [2].

From the software implementation perspective, one normally starts OpenCL program from querying the platform. A list of devices can then be retrieved and the programmer can choose the device from those devices. The next step is creating a context. The chosen device is associated with the context and the command queue is created for the device.

Since one context can be associated with multiple devices, the idea is to associate both CPU and GPU to the context and create the command queue for each targeted device (Figure 2).

Figure 2:Topology of multiple devices from a programming perspective.

The workload is enqueued to the context (either in buffer or image object form). It thus is accessible to all devices associated to the context. The host program can distribute different amount of workload to those devices.

Assuming XX% of workload is offloaded to the CPU and YY% of the workload is offloaded to GPU, the value of XX% and YY% can be arbitrarily chosen as long as XX% + YY% = 100% (Figure 3).

Figure 3: Workload dispatch of the sample implementation.

3. Result

In a sample Lattice-Boltzman Method (LBM) OpenCL heterogeneous computing implementation with 100 by 100 by 130 floating point workload, a normalized performance statistic using a different XX% (the percentage of workload to CPU) and YY% (the percentage of workload to GPU) combination is illustrated in Figure 4. The performance was evaluated on a 5th generation Intel® Core™ i7 processor with Iris™ Pro graphics. Note that although the combination (XX, YY) = (50, 50) has the maximum performance gain (around 30%,) it is not the general case. Different kernels might fit better in either the CPU or GPU. The best (XX, YY) combination must be evaluated case by case.

Figure 4: Normalized (XX, YY) combination performance statistics.

4. Implementation Detail

To be more illustrative, the following discussion assumes that the workload is a 100 by 100 by 130 floating point 3D array and the OpenCL devices are an Intel processor and Intel HD Graphics (or Iris graphics or Iris Pro graphics). Since the implementation involves only a host-side program, the OpenCL kernel implementation and optimization are not discussed here. The pseudocode in this section ignores the error checking. Readers are encouraged to add error-checking code themselves when adapting it.

4.1 Workload

The workload assumes a 100 × 100 x 130 floating point three-dimensional (3D) array, declared in the following form:

const int iGridSize = 100 * 100 * 130;

float srcGrid [iGridSize], dstGrid [iGridSize];   // srcGrid and dstGrid represent the source and
 					        //the destination of the workload respectively

Although the workload assumes a 3D floating point array, the memory is declared as a one-dimensional array so that the data can be easily fitted into a cl_mem object, which is easier for data manipulation.

4.2 Data structures to represent the OpenCL platform

To implement the concept in Figure 2 programmatically, the OpenCL data structure must be designed with at least a cl_platform, a cl_context, and a cl_program object. In order to feed to the OpenCL API call, the cl_device_id, cl_command_queue, and cl_kernel objects are declared in pointer form. They could be instantiated via dynamic memory allocation according to the number of computing device used.

typedef struct {
	cl_platform_id clPlatform;		// OpenCL platform ID
	cl_context clContext;			// OpenCL Context
	cl_program clProgram;			// OpenCL kernel program source object
 	cl_int clNumDevices;			// The number of OpenCL devices to use
	cl_device_id* clDevices;			// OpenCL device IDs
 	cl_device_type* clDeviceTypes;		// OpenCL device types info. CPU, GPU, or
						// ACCELERATOR
	cl_command_queue* clCommandQueues;	// Command queues for the OpenCL
 							// devices
	cl_kernel* clKernels;			// OpenCL kernel objects
} OpenCL_Param;

OpenCL_Param prm;

4.3 Constructing the OpenCL devices

The implementation discussed here considers the case with a single machine with two devices (CPU and GPU) so that readers can easily understand the methodology.

4.3.1 Detecting OpenCL devices

Detecting the device is the first step of OpenCL programming. The devices can be retrieved through the follow code snippet.

clGetPlatformIDs ( 1, &(prm.clPlatform), NULL );
 	// Get the OpenCL platform ID and store it in prm.clPlatform.

clGetDeviceIDs ( prm.clPlatform, CL_DEVICE_TYPE_ALL, 0, NULL, &(prm.clNumDevices) );
prm.clDevices = (cl_device_id*)malloc ( sizeof(cl_device_id) * prm.clNumDevices );
 	// Query how many OpenCL devices are available in the platform; the number of
 	// device is stored in prm.clNumDevices. Proper amount of memory is then
 	// allocated for prm.clDevices according to prm.clNumDevices.

clGetDeviceIDs (prm.clPlatform, CL_DEVICE_TYPE_ALL, prm.clNumDevices, prm.clDevices,  \
 		NULL);
 	// Query the OpenCL device IDs and store it in prm.clDevices.

In heterogeneous computing usage, it is important to know which device is which in order to distribute the correct amount of workload to the designated computing device. ClGetDeviceInfo() can be used to query the device type information.

cl_device_type DeviceType;

prm.clDeviceTypes = (cl_device_type*) malloc ( sizeof(cl_device_type) *  \
 						prm.clNumDevices );
 	// Allocate proper amount of memory for prm.clDeviceTypes.

for (int i = 0; i < prm.clNumDevices; i++) {

	clGetDeviceInfo ( prm.clDevices[i], CL_DEVICE_TYPE, \
 			sizeof(cl_device_type), &DeviceType, NULL );
  			// Query the device type of each OpenCL device and store it in
 			// prm.clDeviceType one by one.
	prm.clDeviceTypes[i] = DeviceType;
}

4.3.2 Preparing the OpenCL context

Once the OpenCL devices are located, the next step is to prepare the OpenCL context, which facilitates those devices. It is a straightforward step, as it is the same as any other OpenCL programming on creating the context.

cl_context_properties clCPs[3] = { CL_CONTEXT_PLATFORM, prm.clPlatform, 0 };

prm.clContext = clCreateContext ( clCPs, 2, prm.clDevices, NULL, NULL, NULL );

4.3.3 Create command queues

The command queue is the tunnel to load kernels, kernel parameters, and workload to the OpenCL device. One command queue is created for one OpenCL device; in this example, two command queues are created for CPU and GPU respectively.

prm.clCommandQueues = (cl_command_queue*)malloc ( prm.clNumDevices *  \
 							sizeof(cl_command_queue) );
 	// Allocate proper amount of memory for prm.clCommandQueues.

for (int i = 0; i < prm.clNumDevices; i++) {

prm.clCommandQueues[i] = clCreateCommandQueue ( prm.clContext, \
 				prm.clDevices[i], CL_QUEUE_PROFILING_ENABLE, NULL);
 		// Create command queue for each of the OpenCL device
}

4.4 Compiling OpenCL kernels

The topology indicated in Figure 2 is implemented so far. The kernel source file then should be loaded and built for the OpenCL devices to execute. Note that there are two OpenCL devices in the platform. The two device IDs must be fed to the clBuildProgram() call so that the compiler can build the proper binary code for each device. The following source code snippet assumes that the kernel source code is loaded into a buffer, clSource, via file I/O calls and is not detailed below.

char* clSource;

// Insert kernel source file read code here. Following code assumes clSource buffer is
// properly allocated and loaded with the kernel source.

prm.clProgram = clCreateProgramWithSource (prm.clContext, 1, clSource, NULL, NULL );
clBuildProgram (prm.clProgram, 2, prm.clDevices, NULL, NULL, NULL );
 	// Build the program executable for CPU and GPU via feeding clBuildProgram() with
 	// “2”, which illustrates there are 2 target devices and the device ID list.

prm.clKernels = (cl_kernel*)malloc ( prm.clNumDevices * sizeof(cl_kernel) );
for (int i = 0; i < prm.clNumDevices; i++) {
 	prm.clKernels[i] = clCreateKernel (prm.clProgram, “<the kernel name>”, NULL );
}

4.5 Distributing the workload

After the kernel has been built, the workload can then be distributed to the devices. The following code snippet demonstrates how to dispatch the designated workload to each OpenCL device. Note that the setting OpenCL kernel argument, clSetKernelArg(), call is not demonstrated here. Different kernel implementation need different arguments. The code to set up the kernel argument is less meaningful in the example here.

// Put kernel argument setting code, clSetKernelArg(), here. Note that, the same argument
// must be set to the both kernel objects.

size_t dimBlock[3] = { 100, 1, 1 };		// Work-group dimension and size
size_t dimGrid[2][3] = { {100, 100, 130}, {100, 100, 130} };	// Work-item dimension
 						// and size for each OpenCL device
dimGrid[0][0] = dimGrid[1][0] = (int)ceil ( dimGrid[0][0] / (double)dimBlock[0] ) *  \
 					dimBlock[0];
dimGrid[0][1] = dimGrid[1][1] = (int)ceil ( dimGrid[0][1] / (double)dimBlock[1] ) *  \
 					dimBlock[1];
 	// Make sure the work-item size is a factor of work-group size in each dimension
dimGrid[0][2] = (int)ceil ( round(dimGrid[0][2]* (double)<XX> /100.0) / (double)dimBlock[2] )
 			* dimBlock[2];				// Work-items for CPU
dimGrid[1][2] = (int)ceil ( round(dimGrid[1][2] * (double)<YY> /100.0) /
 			(double)dimBlock[2] ) * dimBlock[2];	// Work-items for GPU
 	// Assume <XX>% of workload for CPU and <YY>% of workload to GPU

Size_t dimOffset[3] = { 0, 0, dimGrid[0][2] };	// The offset of the whole workload. It is
						// the GPU workload starting point

for (int i = 0; i < 2; i++) {

	If ( CL_DEVICE_TYPE_CPU == prm.clDeviceTypes[i] )
		clEnqueueNDRangeKernel ( prm.clCommandQueues[i], prm.clKernels[i], \
					3, NULL, dimGrid[0], dimBlock, 0, NULL, NULL );
	else					// The other device is CL_DEVICE_TYPE_GPU
		clEnqueueNDRangeKernel ( prm.clCommandQueues[i], prm.clKernels[i], \
					3, dimOffset, dimGrid[1], dimBlock, 0, NULL, NULL );
		// Offload proper portion of workload to CPU and GPU respectively
}

5. Reference

[1] OpenCL 2.1 specification. https://www.khronos.org/registry/cl/

[2] Image courtesy of Khronos group.

↧