Quantcast
Channel: Intel Developer Zone Articles
Viewing all 3384 articles
Browse latest View live

Intel® Math Kernel Library (Intel® MKL) 2017 Bug Fixes list

$
0
0

Intel® MKL 2017 ( 6 September 2016) 

DPD200584193Parallel Direct Sparse Solver for Cluster crahes or returns wrong solving comlex linear systems with Hermit Matrixes in the case if number of MPI processes > 2
DPD200583633Fixed the problem when Intel MKL produces the error message "mismatch detected for '_MSC_VER' .... " when staticaly links with mkl_tbb_thread.lib
DPD200583000Added description of LapackE_xerbla routine to the MKL's documentation
DPD200409897Fixed the Intel MKL Pardiso failure when multiple simultaneous  (single-threaded) instances are used in a OpenMp loop
DPD200582626Fixed the Intel MKL Pardiso failures when METIS reodering and huge block on the diagonal 
DPD200578518df?InterpolateEx1D function returns incorrect index of the cell that contains the right boundary of the interpolation interval
DPD200243632Extended version of the Integrate function, df?IntegrateEx1D does not pass the type of integration limit, left or right, into the callback function 
DPD200582431Fixed CP2K code crash in MKLs routines after moving since MKL 11.3.1 to 11.3.2
DPD200584978Fixed Intel MKL ERROR: Parameter 3 was incorrect on entry to DSYTRF
DPD200571698Optimized performance of ?getrf, ?getrs and ?getri for very small cases.
DPD200584139Fixed mkl_lapack.h : const modificator have to be removed (?lacrm,?larcm)
DPD200575462Added support of partial SVD functionality in MKL
DPD200582751Fixed low Automatic Offload performance of mkl_?getrfnpi
DPD200584255Fixed issue with calling LAPACK functions with a workspace query so that unused parameters could be passed in as nullptr.


Intel® MKL 2017 Beta ( 17 Feb 2016) 

DPD200577117 SP2DP custom library sources created and published in online article
DPD200575663 Fixed the problem of SVD computation of very wide matrix performing significantly slower than SVD computation of its transposition
DPD200574694 Introduced support for major-column layout in the results returned by the df?Interpolate1D routine
DPD200576806 Fixed the Intel MKL Pardiso hang problem when called from an OMP critical section
DPD200580172 Fixed the problem with Intel MKL Pardiso's hanging on the reordering stage for real and symmetric indefinite matrix
DPD200576441 Fixed the issue with Intel MKL Cluster FFTW and original FFTW produce the different results in the case of out-of-place computation
DPD200574978 Resolved the segmentation fault resulting from a call to pthread_exit(0)
DPD200571078 Introduced mkl_progress support for the Parallel Direct Sparse Solver for Clusters and fixed the incorrect behavior with mkl_progress routine from Intel MKL Pardiso for SMP
DPD200372223 Improved performance of spline interpolation on multiple threads
DPD200374978 Improved performance of GEMM family routines for skinny matrix A(n,k) for n/k > 100 for Intel(R) Xeon Phi™( aka KNL )

 


Monte-Carlo simulation on Asian Options Pricing

$
0
0

This is an exercise in performance optimization on heterogeneous Intel architecture systems based on multi-core processors and manycore (MIC) coprocessors.

NOTE: this lab follows the discussion in Section 4.7.1 and 4.7.2 in the book "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors", second edition (2015). The book can be obtained at xeonphi.com/book

In this step, we will look at how to load-balance in an MPI application running on a heterogeneous cluster. The provided source code is a Monte-Carlo simulation on Asian Options Pricing. For the purposes of this exercise, the actual implementation of the simulation is not important, however those if you are interested in learning more about the simulation itself refer to the Colfax Research website.

Asian Options Code Sample link: https://github.com/ColfaxResearch/Asian-Options

sssssss

  1. Study "workdistribution.cc" and compile it. Then run the MPI application across all the nodes available to you (including MICs), with one process on each node.

    You should see that there is load-imbalance, where one node finished faster than others.

  2. A simple solution to this load balance is to distribute work unevenly depending on the target system. Implement a tuning variable "alpha" (should be typefloat or double) where the workload MIC receives is alpha times the workload the CPU receives. Each node shpould calculate which options to work on. To do this use the function input "rankTypes", which stores the type (CPU or MIC) of all nodes in the world. "rankTypes[i]" is "1" if the rank "i" node is on a coprocessor, and "0" if it is on a CPU. Make sure every option is accounted for.

    Compile and run the application. Then try to find the "alpha" value that provides the best performance.

  3. The previous implementation, although simple to implement, has the drawback that the alpha value will be dependent on the cluster. To make the application independent of the cluster it runs on, implement boss-worker model in which the boss assigns work to the workers as the workers completes them.

    Compile and run the code to see the performance. Remember that node that has the boss proccess should have 2 processes.

    Hint: To implement th boss worker model, you will need an if statement with two while loops in it. The worker loop should send it's rank to the host, and receive the index that it needs to calculate. The host should use MPI_ANY_SOURCE in it's receive for the rank, and send the next index to the worker rank that it received. When there are no more options to be simulated, the boss should send a "terminate" index (say index of -1). When the worker receives this "terminate" index it should exit the while loop. The host should exit the while loop when "terminate" has been sent to every other process. Finally, don't forget to have MPI_Barrier in before the MPI_Reduce to make sure all processes are done before the reduction happens.

Asian-Options code GitHub link: https://github.com/ColfaxResearch/Asian-Options

Direct N-body Simulation

$
0
0

Exercise in performance optimization on Intel Architecture, including Intel® Xeon Phi processors

NOTE: this lab is an overview of various optimizations discussed in Chapter 4 in the book "Parallel Programming and Optimization with Intel Xeon Phi Coprocessors", second edition (2015). The book can be obtained at xeonphi.com/book

In this step we will look at how to modernize a piece of code through an example application. The provided source code is an N-body simulation, which is a simulation of many particles that gravitationally or electrostatically interacting with each other. We keep track of the position and the velocity of each particle in the structure "Particle". The simulation is discretized into timesteps. In each timestep, first, the force on each particle (stored in the structure) is calculated with a direct all-to-all algorithm (O(n^2) complexity). Next, the velocity of each particle is modified using the explicit Euler method. Finally the positions of the particles are updated using the explicit Euler method.

N-body simulations are used in astrophysics to model galaxy evolution, colliding galaxies, dark matter distribution in the Universe, and planetary systems. They are also used in simulations of molecular structures. Real astrophysical N-body simulations, targeted to systems with billions of particles, use simplifications to reduce the complexity of the method to O(n log n). However, our toy model is the basis on which the more complex models are built.

In this lab, you will be mostly be modifying the function MoveParticles().

  1. Study the code, then compile and run the application to get the baseline performance. To run the application on the host, use the command "make run-cpu" and for coprocessor, use "make run-mic".

  2. Parallelize MoveParticles() by using OpenMP. Remember that there are two loops that need to be parallelized. You only need to parallelize the outer-most loop.

    Also modify the print statement in, which is hardwired to print "1 thread" (i.e., print the actual number of threads used).

    Compile and run the application to see if you got an improvement.

  3. Apply strength reduction for the calculation of force (the j-loop). You should be able to limit the use of expensive operations to one sqrtf() and one division, with the rest being multiplications. Also make sure to control the precision of constants and functions.

    Compile and run the application to see if you got an improvement.

  4. In the current implementation the particle data is stored in a Array of Structures(AoS), namely a structure of "ParticleTypes"s. Although this is great for readability and abstraction, it is sub-optimal for performance because the coordinates of consecutive particles are not adjacent. Thus when the positions and the velocities are accessed in the loop and vectorized, the data has a non-unit stride access, which hampers performance. Therefore it is often beneficial to instead implement a Structure of Arrays (SoA) instead, where a single structure holds coordinate arrays.

    Implement SoA by replacing "ParticleType" with "ParticleSet". Particle set should have 6 arrays of size "n", one for each dimension in the coordinates (x, y, z) and velocities (vx, vy, vz). The i-th element of each array is the cordinate or velocity of the i-th particle. Be sure to also modify the initialization in main(), and modify the access to the arrays in "MoveParticles()". Compile then run to see if you get a performance improvement.

  5. Let's analyze this application in terms of arithmetic intensity. Currently, the vectorized inner j-loop iterates through all particles for each i-th element. Since the cache line length and the vector length are the same, arithmetic intensity is simply the number of instructions in the inner-most loop. Not counting the reduction at the bottom, the number of operations per iteration is ~20, which is less than the ~30 that roofline model calls for.

    To fix this, we can use tiling to increase cache re-use. By tiling in "i" or "j" by Tile=16 (we chose 16 because it is the cache line length as well as the vector length), we can increase the number operations to ~16*20=~320. This is more than enough to be in the compute-bound region of the roofline mode.

    Although the loop can be tiled in "i" or "j" (if we allow loop swap) it is more beneficial to tile in "i" and therefore vectorize in "i". If we have "j" as the inner-most loop each iteration requires three reductions of the vector register (for Fx, Fy, Fz). This is costly as this not vectorizable. On the other hand, if we vectorize in "i" with tile = 16, it does not require reduction. Note though, that you will need to create three buffers of length 16 where you can store Fx, Fy and Fz for the "i"th element.

    Implement tiling in "i". then compile and run to see the performance.

  6. Using MPI, parallelize the simulation across multiple processes (or compute nodes). To make this work doable in a short time span, keep the entire data set in each process. However, each MPI process should execute only a portion of the loop in the MoveParticle() function. Try to minimize the amount of communication between the nodes. You may find the MPI function MPI_Allgather() useful. Compile and run the code to see if you get a performance improvement.

Direct N-Body simulation code GitHub link: https://github.com/ColfaxResearch/N-body

Multithreaded Transposition of Square Matrices with Common Code for Intel® Xeon® Processors and Intel® Xeon Phi™ Coprocessors

$
0
0

In-place matrix transposition, a standard operation in linear algebra, is a memory bandwidth-bound operation. The theoretical maximum performance of transposition is the memory copy bandwidth. However, due to non-contiguous memory access in the transposition operation, practical performance is usually lower. The ratio of the transposition rate to the memory copy bandwidth is a measure of the transposition algorithm efficiency.

This paper demonstrates and discusses an efficient C language implementation of parallel in-place square matrix transposition. For large matrices, it achieves a transposition rate of 49 GB/s (82% efficiency) on Intel® Xeon® Processors and 113 GB/s (67% efficiency) on Intel® Xeon Phi coprocessors. The code is tuned with pragma-based compiler hints and compiler arguments. Thread parallelism in the code is handled by OpenMP*, and vectorization is automatically implemented by the Intel compiler. This approach allows to use the same C code for a CPU and for a MIC architecture executable, both demonstrating high efficiency. For benchmarks, an Intel Xeon Phi 7110P coprocessor is used.

To run the benchmark, execute the script ./benchmark.sh

The included Makefile and script ./benchmark.sh are designed for Linux.

In order for the CPU code to compile, you must have the Intel C++ compiler installed in the system.

In order to compile and run the MIC platform code, you must have an Intel Xeon Phi coprocessor in the system and the MIC Platform Software Stack (MPSS) installed and running

Multithreaded Transposition of Square Matrices code GitHub link: https://github.com/ColfaxResearch/Transposition

 

Tencent In-game Purchase Machine Learning Recommendation System on Intel® Xeon® Processors

$
0
0

Online gaming is very popular now a day, especially with young people. They play games during their leisure time. They play online games among family members or among friends. In many cases, players need to buy stuffs to equip their characters in the games in order to have an advantage over other players.

To enhance user experience when playing online games, Tencent installed an in-game purchase recommendation system employing the machine learning2 method to help users decide what equipment they would want to buy for their games.

Tencent1 is an Internet company. It offers many services including social network, web portals, e-commerce, and multiplayer online games.

The next sections discuss what recommendation system9 is all about, what machine learning algorithm Tencent uses for the recommendation system and how the Intel® Xeon® processor family is able to help improve the performance of that system.

Recommendation System

Recommendation system is a mechanism that produce a list of recommended items for users to choose from. Recommendation system is used extensively to help users decide what item to select among others. Recommendation system can be used to select songs, movies, research articles and so on.

In Tencent case, one of its application is to use the recommendation system to suggest appropriate equipment for the online gaming.

Recommendation system generates a list of items using the following approaches: collaborative10, content base11 or hybrid.

Collaborative is the algorithm that does the recommendations base on the ratings or behavior of other users in the system. It analyzes activities or preferences and predicts what users will like based on their similarity to other users.

Content based algorithm recommends an item to a user based upon a description of the item and a profile of the user’s interests.

Hybrid algorithm combines the best of both collaborative and content based algorithms.

Tencent uses the machine learning algorithm called logistic regression3 for their in-game purchase recommendation system. The next section will briefly discussed about what logistic regression is all about and its formula.

Logistic Regression

Logistic Regression is a predictive analysis. It is one of the most popular machine learning algorithms for binary classification. Binary classification means the result is dichotomous or, another word, having only two classes like win and lose, yes and no, true and false, 1 and 0. For example, betting whether a horse is going to win or lose a race. Here we have two classes, win and lose. The target/dependent variable here is the bet. It will have a value of 1 if the horse wins the race and 0 if otherwise.

Logistic regression is to find the probability of the log odds5 of an event using the following equation:

p: The probability of presence of event

1 – p: The probability of absence of event

β: Weigths

x: Independent variables

Logistic regression generates the coefficients β of the above formula to predict the probability of presence of an event.

Tencent In-game Purchase recommendation system and Intel® Xeon® Processor E5 v4

Tencent machine learning engine analyzes a huge amount of online gaming user’s behavior to make suggestion on what equipment users should use for their games. Therefore, it needs a lot of computing power to reduce the model training time. It uses DGEMM6 extensively in its module to compute the coefficients for the logistic regression machine learning algorithm. DGEMM is the matrix multiplication function for double-precision floating-point numbers.

Tencent machine learning engine utilizes DGEMM function through the Intel® Math Kernel Library (Intel® MKL)7. The Intel Xeon processor E5 v4 family supports Intel® Advanced Vector Extensions 2 (Intel® AVX2)8, and Intel MKL is highly optimized for performance using Intel AVX2. Applications using Intel MKL only need to link to the latest version of Intel MKL to take advantage of new features in future Intel® Xeon® processors since Intel MKL will auto-detect new features and makes use of them, if applicable.

Performance Test Procedure

To see how much performance improvement when comparing between the previous and the current generation of Intel® Xeon® processors, we performed tests on two platforms. One system was equipped with the Intel® Xeon® processor E5-2699 v3 and the other with the Intel® Xeon® processor E5-2699 v4.

Test Configuration

System equipped with the dual-socket Intel Xeon processor E5-2699 v4

  • System: Preproduction
  • Processors: Intel Xeon processor E5-2699 v4 @2.2GHz
  • Cache: 55 MB
  • Cores: 22
  • Memory: 128 GB DDR4-2133MT/s

System equipped with the dual-socket Intel Xeonprocessor E5-2699 v3

  • System: Preproduction
  • Processors: Intel Xeon processor E5-2699 v3 @2.3GHz
  • Cache: 45 MB
  • Cores: 18
  • Memory: 128 GB DDR4-2133 MT/s

Operating System: Red Hat Enterprise Linux* 7.2-kernel 3.10.0-327

Software:

  • GNU* C Compiler Collection 4.8.2
  • OpenJDK* 7
  • Spark* 1.5.2
  • Intel® MKL 11.3

Application: Tencent Machine learning training workload

Test Results

The following test results show the performance improvement at the application level and at the coefficient computing module, respectively.


Figure 1: Comparison between the application using Intel® Xeon® processor E5-2699 v3 and the Intel® Xeon® processor E5-2699 v4.

Figure 1 shows the results between the application using Intel® Xeon® processor E5-2699 v3 and the Intel® Xeon® processor E5-2699 v4. Since the application is scalable, it can schedule more tasks to run in parallel in v4 than in v3 resulting in reducing the training time of the machine learning model. 


Figure 2: Comparison between the coefficient computing module using Intel® Xeon® processor E5-2699 v4 with Intel® AVX2 enabled

Figure 2 shows the performance improvement of the computing coefficient module when Intel AVX2 is enabled on a system equipped with the Intel Xeon processor E5-2699 v4. The performance is improved by 44%.

Note: Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance

Conclusion

The in-game purchase machine learning recommendation system is embedded inside Tencent games. Therefore, optimizing it will help speeding up the decision making process allowing the system to suggest better game equipment for players playing Tencent games. Intel MKL makes use of Intel AVX2 resulting in improving the performance of the applications running on systems equipped with the Intel Xeon processor.

References

  1. Tencent company information
  2. Machine Learning
  3. Logistic regression
  4. What is logit
  5. Odds and log odds
  6. DGEMM
  7. Intel® Math Kernel Library
  8. Intel® AVX2
  9. Machine learning for recommendation system
  10. Collaborative recommendation
  11. Content Based Recommendation
  12. Hybrid recommendation

Fine-Tuning Vectorization and Memory Traffic on Intel® Xeon Phi™ Coprocessors: LU Decomposition of Small Matrices

$
0
0

by Andrey Vladimirov, Colfax International

LU-decomposition

Common techniques for fine-tuning the performance of automatically vectorized loops in applications for Intel® Xeon Phi coprocessors are discussed. These techniques include strength reduction, regularizing the vectorization pattern, data alignment and aligned data hint, and pointer disambiguation. In addition, the loop tiling technique of memory traffic tuning is shown. The optimization methods are illustrated on an example of single-threaded LU decomposition of a single precision matrix of size 128×128.

Benchmarks show that the discussed optimizations improve the performance on the coprocessor by a factor of 2.8 compared to the unoptimized code, and by a factor of 1.7 on the multi-core host system, achieving roughly the same performance on the host and on the coprocessor.

The code discussed in the paper can be freely downloaded from https://github.com/ColfaxResearch/LU-decomposition

Intel® OpenCL™ Graphics Extensions

$
0
0

OpenCL Extensions available in Intel® SDK for OpenCL™ Applications

The following tables contain information about extensions to the Khronos Group OpenCL™ standard available for Intel processors.   

Notice: Not all extensions are available in all versions of the OpenCL drivers for each OS. Some features are only available on certain hardware platforms or in certain driver baselines. 

 

Media Extensions

These extensions enable video processing applications to access hardware features in Intel processors.

Extension NameDescriptionLinks

cl_intel_advanced_motion_estimation

cl_intel_motion_estimation

 

Custom video motion estimation (VME) extensions enable encoding solutions enhancing the capabilities of Intel Media SDK. This includes custom bitrate control algorithms, frame-based motion estimation enhancements, bi-directional skip checks, MV costing, and intraframe prediction.

Notes:

  • Version 2 of this spec was introduced in early 2016.
  • Advanced VME allows access to a superset of features of the original cl_intel_motion_estimation extension  

For more info:https://software.intel.com/en-us/articles/intro-to-advanced-motion-estimation-extension-for-opencl

Motion estimation samples available in Media Server Studio samples

 

Spec
cl_intel_packed_yuvYUV is usually a planar format.  This extension provides support for a few specific formats of packed YUV images. Spec

 

Sharing Extensions

This group of extensions enables interoperability between OpenCL and other APIs using Intel GPUs.

Extension NameDescriptionLinks
cl_intel_simultaneous_sharing

The OpenCL 1.2 Extension Spec forbids interoperability with multiple graphics APIs at clCreateContext or clCreateContextFromType  time.  It defines that CL_INVALID_OPERATION should be returned in such cases.

The goal of this extension is to relax the restrictions and allow simultaneous use of API combinations as supported by a given OpenCL device.

Spec
cl_intel_va_api_media_sharing

Linux/Android Media Sharing

See https://software.intel.com/en-us/articles/tutorial-opencl-interoperability-with-video-acceleration-api-on-linux-os

Used in Media Server Studio samples

Spec

cl_intel_d3d11_nv12_media_sharing

cl_intel_dx9_media_sharing

Windows sharing APIs (created before Khronos extensions below.)

https://software.intel.com/en-us/articles/d3d9-media-surface-sharing-between-intel-quick-sync-video-and-opencl-on-intel-hd-graphics

Used in Media Server Studio samples

d3d11 Spec

dx9 Spec

cl_khr_dx9_media_sharing

cl_khr_d3d10_sharing

cl_khr_d3d11_sharing

Sharing for DirectX 9, 10, 11

https://software.intel.com/en-us/articles/opencl-and-intel-media-sdk

 

dx9 Spec

d3d10 Spec

d3d11 Spec

cl_khr_gl_sharing

cl_khr_gl_msaa_sharing

cl_khr_gl_depth_images

cl_khr_gl_event

Sample: https://software.intel.com/sites/default/files/managed/2c/79/intel_ocl_ogl_interop_win.zip 
Related Pages:
https://software.intel.com/en-us/articles/opencl-and-opengl-interoperability-tutorial

gl_sharing Spec

gl_msaa_sharing Spec

gl_depth_images Spec

gl_event Spec

 

 

Subgroups Extensions

Work items in a subgroup can share data without implementing shared local memory or using barriers. This extends the work group concept to allow more efficient data sharing.

Extension NameDescriptionLinks
cl_intel_subgroups

Enables work-items in a workgroup to work together let work items share data without local memory and global barriers. Similar to OpenCL 2.0 Subgroups.

https://software.intel.com/en-us/articles/sgemm-for-intel-processor-graphics

https://software.intel.com/en-us/articles/box-blur-filter-using-intel-subgroup-extensions-in-opencl

Spec
cl_intel_required_subgroup_sizeThe goal of this extension is to allow programmers to optionally specify the required subgroup size for a kernel function.  This information is   important for the correctness of many subgroup algorithms, and in some cases may be used by the compiler to generate more optimal code.Spec
cl_khr_subgroupsImplementation controlled division of a workgroup allowing independent forward progress within the workgroup. This feature was promoted to Core in OpenCL 2.1. Spec

 

 

Other Extensions

Extension NameDescriptionLinks
cl_intel_accelerator

Basic accelerator support


The accelerator extension consists of a unified set of OpenCL runtime APIs to create, query, and manage the lifetime of objects which represent acceleration processors, engines, or algorithms.

cl_intel_accelerator.txt

Spec

cl_intel_driver_diagnostics

This extension allow the driver to pass additional strings containing diagnostic information. The diagnostic messages can help to understand how the driver works and can provide guidance to modify an application to improve performance.

Related Pages: 

Spec

 

cl_khr_3d_image_writesEnables writes to 3D image objectsSpec
cl_khr_byte_addressable_storeRemoves restrictions of built-in types (Core feature of 1.1 maintained for backward compatibility). Needed to write to elements of a pointer or struct of type char, uchar, char2, uchar2, short, ushort, and half.Spec
cl_khr_spir

OpenCL Standard Portable Intermediate Representation (SPIR) non source representation of OpenCL.

https://software.intel.com/en-us/articles/using-spir-for-fun-and-profit-with-intel-opencl-code-builder

Spec
cl_khr_fp16Half-precision floating-pointSpec
cl_khr_fp64IEEE-754 double-precision floating-point supportSpec

cl_khr_global_int32_base_atomics

32-bit integer base atomic operations in global memorySpec
cl_khr_global_int32_extended_atomics32-bit integer extended atomic operations in global memorySpec
cl_khr_icdAccess Khronos OpenCL installable client driver loader (ICD Loader)Spec
cl_khr_image2d_from_buffer

2D image from buffer creation support

https://software.intel.com/en-us/articles/using-image2d-from-buffer-extension

Spec

cl_khr_mipmap_image

cl_khr_mipmap_image_writes

Ability to create / read mipmapped images

Adds ability to write mipmapped images, requires cl_khr_mipmap_image

Spec 

 

cl_khr_depth_images  Depth ImagesSpec
cl_khr_throttle_hintsExtension to OpenCL 2.1 API which allows the driver to implement throttling behavior. Throttling behavior is implementation specificSpec

 

Deprecated Extensions

Extension NameDescription
cl_intel_ctzBuilt-in count trailing zeroes

 

[Infographic] What's All The Buzz About Machine Learning


Optimizations Enhance Just Cause 3 on Systems with Intel® Iris™ Graphics

$
0
0

Download Document

High-end PCs continue to drive desktop gaming with amazing graphics. Powered by CPUs such as the 6th Generation Intel® Core™ brand i7-6700K, a state-of-the-art “dream machine” usually gets paired with a high-end discrete video card to run the most demanding games. One such title is Just Cause 3, developed by Avalanche Studios* and published by Square Enix*. Released in late 2015 to much acclaim, JC3 offers fiery explosions, lush terrain, and amazing scenery, as secret agent Rico Rodriguez fights, soars, glides, and grapples through an expanded open world of breathtaking beauty.

While console play was a big target audience, Avalanche wanted to ensure the game worked on the widest possible range of PC hardware, including systems with integrated graphics. Intel worked with Avalanche to complete a range of general optimizations that benefited all PC configurations, but didn’t stop there. They also formed a small independent team (mostly from the Engine and Research group at Avalanche Studios) in a separate joint-effort to optimize for the Intel® Iris™ and Intel® Iris™ Pro graphics chips. That work included harnessing the new graphics features of the 6th Generation Intel® Core™ brand CPUs.

The teams also worked with new graphics features for Microsoft* DirectX* 12-class hardware, exposed under the DirectX 11.3 API. Using additional resources from Intel engineers in R & D across numerous specialties, the result was a game that looked fabulous on the latest consoles and high-end gaming PCs, and also engaged players on well-equipped laptops with Intel Iris graphics.

Intel integrated graphics essentially come in three levels–the mainstream level is HD graphics, and the next step up is Iris graphics, which is high-end mainstream quality. The highest level of integrated graphics is Iris Pro graphics, and the latest version is found on 6th Generation Intel Core processors. In this case study, you’ll learn how Intel optimization tools yielded multiple avenues for improvements. We’ll drill down into the world of shaders, instancing, and low-level Arithmetic and Logic Unit (ALU) optimizations, and explain how Intel and Avalanche Studios searched for every last performance gain.


Figure 1. In Just Cause 3, hero Rico Rodriguez flies over terrain covered with intricate foliage.

Intel® Graphics Performance Analyzers (GPA) proved an excellent tool for helping optimize the complex world of Just Cause 3. This is the third installment of the tremendously popular franchise, which has sold an estimated 7 million copies across all platforms as of mid-2016. Offering 400 square kilometers of stunning terrain, from sunny beaches to snowy peaks, reviewers have called it a wide-open playground “primed for explosive action.” Intel and Avalanche focused their efforts on multiple optimizations, pushing the advanced rendering technologies to the limit.

Intel® Graphics Performance Analyzers Help the Cause

The Intel GPA tool suite enables game developers to utilize the full performance potential of their gaming platform. The tools visualize performance data from your application, enabling you to understand system-level and individual frame-performance issues, as well as allowing you to perform ‘what-if’ experiments to estimate potential performance gains from optimizations.

Intel graphics application engineer Antoine Cohade led the JC3 optimization effort from the Intel side. He traveled frequently from his office in Munich to Stockholm, Sweden, to meet with Avalanche developers–including Christian Nilsendahl, project lead, and Emil Persson, head of research. “We did a lot of work remotely, but we did some heavy profiling on site. At times, we were on the phone or live-chatting on a daily basis,” Cohade said. “Basically, we proved that it's possible to take the highest, most demanding game and have it playable on Intel hardware. Developers can use these tools and techniques to make their games playable on a wide range of mobile systems.”

In addition to other tools, the teams used three key parts of Intel GPA:

  1. System Analyzer – Analyze CPU Graphics API, and GPU performance and power metrics.
  2. Graphics Frame Analyzer – Perform single-frame analysis and optimization for Microsoft DirectX, OpenGL*, and OpenGL ES* game workloads.
  3. Platform Analyzer – View where your application is spending time across the CPU and GPU.


Figure 2. The optimization team started by running the System Analyzer, to determine if the system was CPU-limited or GPU-limited.

Over six months–and among many other optimizations–the team fine-tuned these areas:

  • Low-level ALU optimizations
  • Instancing
  • Vegetation
  • Shadowing
  • Dynamic Resolution Rendering
  • DirectX optimizations, for DX11.3 API

Initially, Intel GPA revealed some key challenges. There were a high number of draw calls in multiple frames, and some individual draw calls were very expensive in terms of processing power. The clustered lighting shader and vegetation were key areas to improve.


Figure 3. High-level view of the Graphics Frame Analyzer screen, showing where information can be found for various components.

The team determined that they should implement multiple ALU optimizations to streamline expensive individual shaders. They also investigated areas where high draw counts were frequently making the game CPU-limited, due to the cost of the individual API calls.

Start at the Bottom: Tweaking the Arithmetic and Logic Unit (ALU)

Much of the low-level ALU optimization effort involved re-working the math to generate fewer instructions. Emil Persson, who has published his work and presented two key papers on shading and optimization at the Game Developers Conference, was the leader of this effort. Some of Persson’s lessons are fairly simple: don’t rely on the compiler optimizing for you, separate scalar and vector work, and remember that low-level and high-level optimizations are not mutually exclusive–try to do both.

The investigation showed that the shader compiler wasn’t producing optimal code, which is not intuitive. Persson was able to make very small changes to force the compiler to create incrementally better output. “I think it’s pretty rare that someone goes that low-level and takes the time to document it,” Cohade said of Persson’s work. “You rarely see that kind of effort. You see the optimization, but you don’t see it documented.”

Figure 4 shows a “before and after” example from the code, along with the number of micro-ops generated by the compiler.


Figure 4. Code snippet showing before (left) and after (right) where the team pulled computations out of the loop, saving 30 operations.

In their 2016 GDC talk, Cohade and Persson noted that reducing the utilization of the ALU meant they were able to save about 2 ms (out of 6 ms). Persson also noted in one of his previous optimization talks that reducing the ALU utilization from 50% to 25% while still bound by something else probably doesn't improve performance; it lets the GPU run cooler, however, which can still be a big benefit.

Instancing and Tune-up for Foliage Meshes

After completing some GPU optimizations, the team ran tests that revealed some scenes were now CPU limited. For example, in the GPU view screenshot below, you can see that one CPU thread is almost 100% active, while there are gaps in the GPU execution.


Figure 5. GPU gaps showing that the CPU became the main bottleneck for the current scene.

When the team found out that the CPU limit was due to the high number of draw calls, their next move was to try instancing. Instancing refers to simultaneously rendering multiple copies of the same mesh in a scene. Instancing is particularly helpful when creating foliage, which Just Cause 3 relies on heavily, but instancing also helps with characters and common objects.

By altering different parameters, such as color or skeletal pose, objects and foliage can be represented as repeated geometries without appearing unduly repetitive.

Implementing instancing did help significantly in the scenes that were CPU limited, as Figure 6 shows below. When compared to Figure 5, the CPU thread isn’t always active, which returned the team to the point where the GPU was the limiting factor (GPU queue always full).


Figure 6. Same section of the game after removing the CPU bottleneck.

In one case, for a tiny piece of foliage (4 vertices and 6 indices), with many instances, standard instancing resulted in poor wavefront occupancy. To solve this issue, the team decided to implement manual instancing. They repeated the geometry inside a larger index buffer, allowing the shader to manually fetch data to draw many copies of the trees at once. The amount of vertex data stayed the same. The data was read from a texture to modify the trees and create multiple appearances, reducing redundancy. The result was a dramatic reduction in the number of draw calls, and a reduced number of buffer updates. The team calculated that calls took 2.4 ms before, and 0.7 ms after–a big improvement.


Figure 7. Example of changes from instancing to manual instancing.

Vegetation Optimizations: Stalls and Forest Layers

While optimizing for Intel Iris graphics, the team discovered that vegetation rendering seemed to take way too long, so they hit upon the idea of disabling stencil writes, changing the state within the Intel GPA Graphics Frame Analyzer and measuring the impact. They found that rendering speeds improved dramatically (up to 80% savings!) in some instances. The reason behind this is that on Intel hardware, writing in the stencil buffer prevents the hardware from using early-Z rejection, creating pipeline stalls, and starving the Execution Units.


Figure 8. The Graphics Frame Analyzer helped identify "stalls" where rendering seemed to take abnormally long.

Optimizing forest layers also proved to be a challenge. At the lowest level of detail (LOD), trees were rendered using a dense grid mesh, at 129x129 per patch, that has an alpha texture mapped onto it, with the forest silhouettes using the alpha texture to fill in the detail. Rendering this took up to 5 ms in some scenes. Disabling stencils also didn’t help.

The team found through the Intel GPA tool that they were vertex bound, so they looked at mesh optimizations. They added 65x65 and 33x33 LOD, resulting in a large reduction in total vertices shaded. There was a small visual difference, but at the highest settings, the visuals stayed identical to before.

In addition, the team pursued several shader optimizations. They added a simpler “no-fade” vertex shader, and they added pre-computations to “bake-in” scaling into the world matrix. They recalculated some constants, simplified some math, and measured a performance gain from 5.0 ms to 2.5 ms. They then revisited disabling stencil writes, and reduced their rendering time to 0.5 ms. This is a good example of checking after each fix in order to see if a new bottleneck is hiding behind another issue.

But they didn’t stop there–they also revisited their triangle strips, and cut the time down to 0.4 ms. In all, they achieved a performance gain of an order of magnitude, by resolutely chasing down every single opportunity.

Optimizing Shadows

Cascaded shadow maps (CSMs) are often used to combat perspective aliasing, a common problem in shadowing. The basic problem is that objects nearest the eye require a higher resolution than distant objects, so partitioning the shadow coverage into multiple maps allows for different resolutions. JC3 uses four sun shadow cascades in a scattered update pattern, updating only two cascades per frame, which saves many milliseconds. However, this cycle optimization caused problems when camera flipping, as the shadows pop in over a few frames.

To reduce popping, the outer shadow has been centered around the camera, instead of in front of it:

This technique forced Avalanche Studios to disable frustum culling, and to use a different cascade to increase shadow range. This cost more time in terms of culling and size, so they looked for an alternative approach. They didn’t find a technique that solved the problem, so they reverted back to their original approach, but this time they introduced a tweak consisting of forcing the outer cascade to be updated in the first frame after the camera flip:

The team also disabled cloud shadows for low shadow settings.

Terrain, and Other Tune-ups

To optimize the lush terrain in JC3, the team continuously developed the terrain system, and went from a coarse terrain scheme (with only three levels of detail) to a much finer patch system. This traded more draw calls for less off-screen wastage, and much finer LOD tweaking, saving from 1-2 ms, depending on the scene.

The team also looked at which dependencies existed between rendering passes, and they were careful to not add new ones, in order to be as efficient as possible. This allowed to find the 3 following optimizations:

  1. Better culling of waterboxes, which saved a complete rendering pass when no water was visible.
  2. Disabling the velocity pass when motion-blur and temporal anti-aliasing were disabled.
  3. When they detected that screen-space reflection was enabled, they disabled a planar reflection pass.

The team also tried to remove as many “clears” as they could, and used the hardware’s fast clear path (0,0,0,0, or 1,1,1,1 when the clear color is not set up during the resource creation) for passes where clears were required.

Because PC hardware usually has a different CPU/GPU balance than consoles, the team investigated rebalancing the work between CPU and GPU, and moved some work back to the CPU. That led to shorter shading time, with more computations for the CPU, but also more efficiency.


Figure 9. Performance gains before and after optimization.

Overall, the team doubled the performance of the title on Intel Iris graphics, using focused optimizations.

DirectX Features

In 2015, a wide range of new video cards came to market, with several launched to coincide with the new DirectX 12 (DX12) Application Programming Interface (API). These cards supported new features such as Conservative Rasterization and Rasterizer Ordered Views. Although designed for the DX12 API, these features were exposed in DX11.3, and thus available to Just Cause 3 engineers. Avalanche then adjusted their engine to match certain new features, which are exclusive to the PC version of the game. As 6th Generation Intel Core graphics have full DX12 feature support, the team decided to use these–either for additional performance gains, or for graphical improvements.

At GDC 2015, Avalanche and Intel discussed the PC features implemented thanks to DX12, such as Order-Independent Transparency, and Conservative Rasterization for Light Assignment.

Conservative Rasterization for Light Assignment

Conservative rasterization is an alternative to “classic” rasterization, where the pixel is rasterized when the triangle covers the center of the pixel. Conservative rasterization means that all pixels that are at least partially covered by a rendered primitive are rasterized. This can be useful in a number of situations, including collision detection, occlusion culling, and visibility detection.

Avalanche Studios’ engine is a deferred renderer using clustered shading–as Persson presented in his 2013 Siggraph talk. Clustered shading has typically the same or better performance as traditional deferred renderers, and improves the worst case scenario, and solves the depth continuities issue that traditional deferred renderers have.

Much of the work here is referenced in Kevin Örtegren and Emil Persson’s GPU Pro 7 article and Örtegren’s master’s thesis, entitled “Clustered Shading: Assigning Arbitrarily Shaped Convex Light Volumes Using Conservative Rasterization,” published by the Blekinge Institute of Technology. Basically, this new approach involves replacing the light assignment pass traditionally done on the CPU, with a GPU version using conservative rasterization. This allows perfect clustering for different light shapes. The entire source code of the GPU Pro 7 article is posted on GitHub. The biggest difference between Örtegren’s thesis and the practical approach is to use a bitfield instead of a linked list, which is doable in JC3, as the maximum number of lights is known to be 256 per type. The bitfield approach has proven to be faster under heavy load, and only slightly slower under light load. Overall, the better light culling of the conservative rasterization approach brought an improvement of up to 30%.

Order Independent Transparency

Order Independent Transparency is an approach to solve one of the most fundamental challenges in real-time rendering, and has proven to be quite successful in a number of titles over the past few years. The topic has already been covered in detail, by Marco Salvi’s original approach, and Leigh Davies’ article on Grid 2.

Avalanche Studios’ engine traditionally used alpha testing for fast vegetation rendering, and the team saw an opportunity to improve the visual quality of the title even further by implementing a similar approach on JC3. The OIT code was integrated in the Avalanche engine in a few days, did not require any asset changes, and improved the quality of the vegetation rendering tremendously.


Figure 10. Two scenes from JC3, with no OIT (left) and using OIT (right).

As the approach has already been covered in detail in the above links, this article will describe the differences between the original and the Just Cause 3 approach:

  • The original approach–using HDR–consisted of using one 32-bit buffer to store depth nodes, and another for color and alpha. HDR rendering could not be used in this case, however. This was solved by packing the eighth bit of alpha in the depth buffer, leaving the other buffer for a R11G11B10F HDR buffer.
  • As an additional optimization, the team switched to using a Texture2DArray to store each node (instead of a structured buffer), which brought some performance benefit when using less than four AOIT nodes (JC3 uses two).
  • With JC3 being an open world, with a lot of far vegetation, Intel and Avalanche engineers realized that Salvi’s approach would be extremely costly. In order to scale the performance from high-end PCs to mainstream systems with integrated graphics, the developers decided to add quality levels for the OIT, by simply using OIT on the first LOD at low settings, versus all levels of detail at high settings.

Conclusion

As a result of the collaboration between Intel and Avalanche, multiple optimizations throughout the rendering pipeline made laptop gameplay on systems with Intel Iris graphics almost on a par with leading platforms such as PS4* and Xbox* 1.

Many of the optimizations described here benefit not just systems with integrated graphics, but also high-end systems with discrete graphics cards. All systems benefit from better balancing CPU and GPU workloads, after all. Still, it’s exciting to open up a top title such as Just Cause 3 to a wider audience. Market share is expected to rise for mid-level and high-end mobile devices such as the Microsoft Surface Pro* 4, and the Intel® NUC kits should also offer interesting options for gamers. Code-named “Skull Canyon,” the NUC6i7KYK kit incorporates a very fast quad-core Intel Core i7 processor complete with Intel Iris Pro graphics. These mobile devices are inviting targets for game producers, and the optimizations described here will help reach those devices–no matter how demanding the graphics output.

Additional Resources

Intel GPA Documentation: https://software.intel.com/en-us/gpa-support/documentation

Formula 1 article: https://software.intel.com/en-us/articles/codemasters-leads-the-pack-in-pc-to-tablet-optimization-with-grid-autosport

Just Cause 3: https://justcause.com

Clustered Shading: http://www.cse.chalmers.se/~uffe/clustered_shading_preprint.pdf

Configuration for Python Benchmarks

$
0
0

Configuration Info

Software

  • apt/atlas: installed with apt-get, Ubuntu 16.10, python 3.5.2, numpy 1.11.0, scipy 0.17.0
  • pip/openblas: installed with pip, Ubuntu 16.10, python 3.5.2, numpy 1.11.1, scipy 0.18.0;
  • Intel Python: Intel Distribution for Python 2017

Hardware

  • Xeon: Intel Xeon CPU E5-2698 v3 @ 2.30 GHz (2 sockets, 16 cores each, HT=off), 64 GB of RAM, 8 DIMMS of 8GB@2133MHz
  • Xeon Phi: Intel Intel(R) Xeon Phi(TM) CPU 7210 1.30 GHz, 96 GB of RAM, 6 DIMMS of 16GB@1200MHz

​Problem sizes

 

Hardware/problem size

dot

lu

det

inv

cholesky

fft

Xeon (32 core) &

Xeon Phi (64 core)

(20k, 10k) & (10k, 20k)

(35k, 35k)

(15k, 15k)

(25k, 25k)

(40k, 40k)

520k

Xeon (1 core)

(20k, 5k) & (5k,20k)

(20k, 20k)

(10k, 10k)

(10k, 10k)

Xeon Phi (1 core)

(20k, 300) & (300,20k)

(6k, 6k)

(4k, 4k)

(2k, 2k)

 

 

Improving Support Vector Machine with Intel® Data Analytics Acceleration Library

$
0
0

Improving the Performance of Support Vector Machine with Intel® Data Analytics Acceleration Library

Introduction

With the wide availability of the internet, text categorization has become an important way to handle and organize text data. Text categorization is used to classify news stories and find information on the Web. Also, in order to search for a photo on the web or be able to distinguish a horse from a lion, for example, there must be some kind of mechanism to recognize and classify the pictures. Classifying text or pictures is time consuming. This type of classification is a good candidate for machine learning. [1]

This article describes a classification machine learning algorithm called support vector machine [2] and how the Intel® Data Analytics Acceleration Library (Intel® DAAL) [3] helps optimize this algorithm when running it on systems equipped with Intel® Xeon® processors.

What is a Support Vector Machine?

A support vector machine (SVM) is a supervised machine learning algorithm. It can be used for classification and regression.

An SVM performs classification by finding the hyperplane [4] that separates between a set of objects that have different classes. This hyperplane is chosen in such a way that maximizes the margin between the two classes to reduce noise and increase the accuracy of the results. The vectors that are on the margins are called support vectors. Support vectors are data points that lie on the margin.

Figure 1 shows how an SVM classifies objects:

Improving Performance - Fig 1
Figure 1:Classifying objects with a support vector machine.

There are two classes: green and purple. The hyperplane separates the two classes. If an object lies on the left side of the hyperplane, it is classified as belonging to the green class. Similarly, an object lying on the right side of the hyperplane belongs to the purple class.

As mentioned above, we need to maximize the margin H (the distance between the two margins) to reduce noise, thus improving the accuracy of the prediction.

Improving Performance - Equation

In order to maximize the margin H, we need to minimize |W|.

We also need to make sure that there are no data points lying between the two margins. To do that, the following conditions need to be met:

xi •w+b ≥ +1 when yi =+1

xi •w+b ≤ –1 when yi =–1

The above conditions can be rewritten to:

yi (xi •w) ≥ 1

So far we have talked about the hyperplane as being a flat plane or as a line in a two-dimensional space. However, in real-life situations, that is not always the case. Most of the time, the hyperplane will be curved, not straight, as shown in Figure 2.

Improving Performance - Fig 2
Figure 2:The hyperplane as a curved line.

For simplicity, assume that we are working in a two-dimensional space. In this case, the hyperplane is a curved line. To transform the curved line into a straight line, we can raise the whole thing into higher dimensions. How about lifting into a three-dimensional space by introducing a third dimension, called z?

Improving Performance - Fig 3
Figure 3:Introducing a third dimension, z.

The technique of raising the data to a higher dimensional space so that we can create a straight line or a flat plane in a higher dimension is called a kernel trick [5].

Improving Performance - Fig 4
Figure 4:Using a kernel trick to create a straight line or flat plane in a higher dimension.

Applications of SVM

SVMs can be used to:

  • Categorize text and hypertext.
  • Classify images.
  • Recognize hand-written characters.

Advantages and Disadvantages of SVM

Using SVM has advantages and disadvantages:

  • Advantages
    • Works well with a clear margin of separation.
    • Effective in high-dimensional spaces.
    • Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
  • Disadvantages
    • Takes a huge amount of time to train a large dataset.
    • Doesn’t work well when there is no clear separation between target classes.

Intel® Data Analytics Acceleration Library

Intel DAAL is a library consisting of many basic building blocks that are optimized for data analytics and machine learning. These basic building blocks are highly optimized for the latest features of latest Intel® processors. More about Intel DAAL can be found at [7]. The SVM classifier is one of the classification algorithms that Intel DAAL provides. In this article, we use the Python* API of Intel DAAL, to build a basic SVM classifier. To install it, follow the instructions in [10].

Using Support Vector Machine Algorithm in Intel Data Analytics Acceleration Library

This section shows how to invoke the SVM algorithm [8] in Python [9] using Intel DAAL.

Do the following steps to invoke the SVM algorithm from Intel DAAL:

  1. Import the necessary packages using the commands from and import

     
    1. Import Numpy by issuing the command:
      import numpy as np
    2. Import the Intel DAAL numeric table by issuing the following command:
      from daal.data_management import HomogenNumericTable
    3. Import the SVM algorithm using the following commands:
      from daal.algorithms.svm import training, prediction
      from daal.algorithms import classifier, kernel_function
      import daal.algorithms.kernel_function.linear
  2. Create a function to split the input dataset into the training data, label, and test data.

    Basically, split the input data set array into two arrays. For example, for a dataset with 100 rows, split it into 80/20: 80 percent of the data for training and 20 percent for testing. The training data will contain the first 80 lines of the array input dataset, and the testing data will contain the remaining 20 lines of the input dataset.

  3. Restructure the training and testing dataset so Intel DAAL can read it.

    Use the commands to reformat the data as follows (We treat trainLabels and testLabels as n-by-1 tables, where n is the number of lines in the corresponding datasets):

    trainInput = HomogenNumericTable(trainingData)
    trainLabels = HomogenNumericTable(
    trainGroundTruth.reshape(trainGroundTruth.shape[0],1))

    testInput = HomogenNumericTable(testingData)
    testLabels = HomogenNumericTable(testGroundTruth.reshape(testGroundTruth.shape[0],1))

    where

    trainInput: Training data has been reformatted to Intel DAAL numeric tables.

    trainLabels: Training labels has been reformatted to Intel DAAL numeric tables.

    testInput: Testing data has been reformatted to Intel DAAL numeric tables.

    testLabels: Testing labels has been reformatted to Intel DAAL numeric tables.

  4. Create a function to train the model.

     
    1. First create an algorithm object to train the model using the following command:
      algorithm = training.Batch_Float64DefaultDense(nClasses)

    2. Pass the training data and label to the algorithm using the following commands:
      algorithm.input.set(classifier.training.data,trainInput)
      algorithm.input.set(classifier.training.labels,trainLabels)

      where
      algorithm: The algorithm object as defined in step a above.
      trainInput: Training data.
      trainLabels: Training labels.
    3. Train the model using the following command:
      Model = algorithm.compute()
      where
      algorithm:The algorithm object as defined in step a above.
  5. Create a function to test the model.

     
    1. First create an algorithm object to test/predict the model using the following command:
      algorithm = prediction.Batch_Float64DefaultDense(nClasses)
    2. Pass the testing data and the train model to the model using the following commands:
      algorithm.input.setTable(classifier.prediction.data, testInput) algorithm.input.setModel(classifier.prediction.model, model.get(classifier.training.model))
      where
      algorithm: The algorithm object as defined in step a above.
      testInput: Testing data.
      model: Name of the model object.

    3. Test/predict the model using the following command: Prediction = algorithm.compute()
      where
      algorithm: The algorithm object as defined in step a above.
      prediction: Prediction result that contains predicted labels for test data.

Conclusion

SVM is a powerful classification algorithm. It works well with a clear margin of separation. Intel DAAL optimized the SVM algorithm. By using Intel DAAL, developers can take advantage of new features in future generations of Intel® Xeon® processors without having to modify their applications. They only need to link their applications to the latest version of Intel DAAL.

References

  1. Wikipedia – machine learning

  2. Support Vector Machine

  3. Introduction to Intel DAAL

  4. What is Hyperplane

  5. Kernel trick

  6. Introduction to Intel Data Analytics Acceleration Library

  7. Support Vector Machine Algorithm

  8. Python website

  9. How to install the Python Version of Intel DAAL in Linux*

Intel® XDK FAQs - General

$
0
0

How can I get started with Intel XDK?

There are plenty of videos and articles that you can go through here to get started. You could also start with some of our demo apps. It may also help to read Five Useful Tips on Getting Started Building Cordova Mobile Apps with the Intel XDK, which will help you understand some of the differences between developing for a traditional server-based environment and developing for the Intel XDK hybrid Cordova app environment.

Having prior understanding of how to program using HTML, CSS and JavaScript* is crucial to using the Intel XDK. The Intel XDK is primarily a tool for visualizing, debugging and building an app package for distribution.

You can do the following to access our demo apps:

  • Select Project tab
  • Select "Start a New Project"
  • Select "Samples and Demos"
  • Create a new project from a demo

If you have specific questions following that, please post it to our forums.

How do I convert my web app or web site into a mobile app?

The Intel XDK creates Cordova mobile apps (aka PhoneGap apps). Cordova web apps are driven by HTML5 code (HTML, CSS and JavaScript). There is no web server in the mobile device to "serve" the HTML pages in your Cordova web app, the main program resources required by your Cordova web app are file-based, meaning all of your web app resources are located within the mobile app package and reside on the mobile device. Your app may also require resources from a server. In that case, you will need to connect with that server using AJAX or similar techniques, usually via a collection of RESTful APIs provided by that server. However, your app is not integrated into that server, the two entities are independent and separate.

Many web developers believe they should be able to include PHP or Java code or other "server-based" code as an integral part of their Cordova app, just as they do in a "dynamic web app." This technique does not work in a Cordova web app, because your app does not reside on a server, there is no "backend"; your Cordova web app is a "front-end" HTML5 web app that runs independent of any servers. See the following articles for more information on how to move from writing "multi-page dynamic web apps" to "single-page Cordova web apps":

Can I use an external editor for development in Intel® XDK?

Yes, you can open your files and edit them in your favorite editor. However, note that you must use Brackets* to use the "Live Layout Editing" feature. Also, if you are using App Designer (the UI layout tool in Intel XDK) it will make many automatic changes to your index.html file, so it is best not to edit that file externally at the same time you have App Designer open.

Some popular editors among our users include:

  • Sublime Text* (Refer to this article for information on the Intel XDK plugin for Sublime Text*)
  • Notepad++* for a lighweight editor
  • Jetbrains* editors (Webstorm*)
  • Vim* the editor

How do I get code refactoring capability in Brackets* (the Intel XDK code editor)?

...to be written...

Why doesn’t my app show up in Google* play for tablets?

...to be written...

What is the global-settings.xdk file and how do I locate it?

global-settings.xdk contains information about all your projects in the Intel XDK, along with many of the settings related to panels under each tab (Emulate, Debug etc). For example, you can set the emulator to auto-refresh or no-auto-refresh. Modify this file at your own risk and always keep a backup of the original!

You can locate global-settings.xdk here:

  • Mac OS X*
    ~/Library/Application Support/XDK/global-settings.xdk
  • Microsoft Windows*
    %LocalAppData%\XDK
  • Linux*
    ~/.config/XDK/global-settings.xdk

If you are having trouble locating this file, you can search for it on your system using something like the following:

  • Windows:
    > cd /
    > dir /s global-settings.xdk
  • Mac and Linux:
    $ sudo find / -name global-settings.xdk

When do I use the intelxdk.js, xhr.js and cordova.js libraries?

The intelxdk.js and xhr.js libraries were only required for use with the Intel XDK legacy build tiles (which have been retired). The cordova.js library is needed for all Cordova builds. When building with the Cordova tiles, any references to intelxdk.js and xhr.js libraries in your index.html file are ignored.

How do I get my Android (and Crosswalk) keystore file?

New with release 3088 of the Intel XDK, you may now download your build certificates (aka keystore) using the new certificate manager that is built into the Intel XDK. Please read the initial paragraphs of Managing Certificates for your Intel XDK Account and the section titled "Convert a Legacy Android Certificate" in that document, for details regarding how to do this.

It may also help to review this short, quick overview video (there is no audio) that shows how you convert your existing "legacy" certificates to the "new" format that allows you to directly manage your certificates using the certificate management tool that is built into the Intel XDK. This conversion process is done only once.

If the above fails, please send an email to html5tools@intel.com requesting help. It is important that you send that email from the email address associated with your Intel XDK account.

How do I rename my project that is a duplicate of an existing project?

See this FAQ: How do I make a copy of an existing Intel XDK project?

How do I recover when the Intel XDK hangs or won't start?

  • If you are running Intel XDK on Windows* it must be Windows* 7 or higher. It will not run reliably on earlier versions.
  • Delete the "project-name.xdk" file from the project directory that Intel XDK is trying to open when it starts (it will try to open the project that was open during your last session), then try starting Intel XDK. You will have to "import" your project into Intel XDK again. Importing merely creates the "project-name.xdk" file in your project directory and adds that project to the "global-settings.xdk" file.
  • Rename the project directory Intel XDK is trying to open when it starts. Create a new project based on one of the demo apps. Test Intel XDK using that demo app. If everything works, restart Intel XDK and try it again. If it still works, rename your problem project folder back to its original name and open Intel XDK again (it should now open the sample project you previously opened). You may have to re-select your problem project (Intel XDK should have forgotten that project during the previous session).
  • Clear Intel XDK's program cache directories and files.

    On a Windows machine this can be done using the following on a standard command prompt (administrator is not required):

    > cd %AppData%\..\Local\XDK
    > del *.* /s/q

    To locate the "XDK cache" directory on [OS X*] and [Linux*] systems, do the following:

    $ sudo find / -name global-settings.xdk
    $ cd <dir found above>
    $ sudo rm -rf *

    You might want to save a copy of the "global-settings.xdk" file before you delete that cache directory and copy it back before you restart Intel XDK. Doing so will save you the effort of rebuilding your list of projects. Please refer to this question for information on how to locate the global-settings.xdk file.
  • If you save the "global-settings.xdk" file and restored it in the step above and you're still having hang troubles, try deleting the directories and files above, along with the "global-settings.xdk" file and try it again.
  • Do not store your project directories on a network share (Intel XDK currently has issues with network shares that have not yet been resolved). This includes folders shared between a Virtual machine (VM) guest and its host machine (for example, if you are running Windows* in a VM running on a Mac* host). This network share issue is a known issue with a fix request in place.
  • There have also been issues with running behind a corporate network proxy or firewall. To check them try running Intel XDK from your home network where, presumably, you have a simple NAT router and no proxy or firewall. If things work correctly there then your corporate firewall or proxy may be the source of the problem.
  • Issues with Intel XDK account logins can also cause Intel XDK to hang. To confirm that your login is working correctly, go to the Intel XDK App Center and confirm that you can login with your Intel XDK account. While you are there you might also try deleting the offending project(s) from the App Center.

If you can reliably reproduce the problem, please send us a copy of the "xdk.log" file that is stored in the same directory as the "global-settings.xdk" file to html5tools@intel.com.

Is Intel XDK an open source project? How can I contribute to the Intel XDK community?

No, It is not an open source project. However, it utilizes many open source components that are then assembled into Intel XDK. While you cannot contribute directly to the Intel XDK integration effort, you can contribute to the many open source components that make up Intel XDK.

The following open source components are the major elements that are being used by Intel XDK:

  • Node-Webkit
  • Chromium
  • Ripple* emulator
  • Brackets* editor
  • Weinre* remote debugger
  • Crosswalk*
  • Cordova*
  • App Framework*

How do I configure Intel XDK to use 9 patch png for Android* apps splash screen?

Intel XDK does support the use of 9 patch png for Android* apps splash screen. You can read up more at https://software.intel.com/en-us/xdk/articles/android-splash-screens-using-nine-patch-png on how to create a 9 patch png image and link to an Intel XDK sample using 9 patch png images.

How do I stop AVG from popping up the "General Behavioral Detection" window when Intel XDK is launched?

You can try adding nw.exe as the app that needs an exception in AVG.

What do I specify for "App ID" in Intel XDK under Build Settings?

Your app ID uniquely identifies your app. For example, it can be used to identify your app within Apple’s application services allowing you to use things like in-app purchasing and push notifications.

Here are some useful articles on how to create an App ID:

Is it possible to modify the Android Manifest or iOS plist file with the Intel XDK?

You cannot modify the AndroidManifest.xml file directly with our build system, as it only exists in the cloud. However, you may do so by creating a dummy plugin that only contains a plugin.xml file containing directives that can be used to add lines to the AndroidManifest.xml file during the build process. In essence, you add lines to the AndroidManifest.xml file via a local plugin.xml file. Here is an example of a plugin that does just that:

<?xml version="1.0" encoding="UTF-8"?><plugin xmlns="http://apache.org/cordova/ns/plugins/1.0" id="my-custom-intents-plugin" version="1.0.0"><name>My Custom Intents Plugin</name><description>Add Intents to the AndroidManifest.xml</description><license>MIT</license><engines><engine name="cordova" version=">=3.0.0" /></engines><!-- android --><platform name="android"><config-file target="AndroidManifest.xml" parent="/manifest/application"><activity android:configChanges="orientation|keyboardHidden|keyboard|screenSize|locale" android:label="@string/app_name" android:launchMode="singleTop" android:name="testa" android:theme="@android:style/Theme.Black.NoTitleBar"><intent-filter><action android:name="android.intent.action.SEND" /><category android:name="android.intent.category.DEFAULT" /><data android:mimeType="*/*" /></intent-filter></activity></config-file></platform></plugin>

You can inspect the AndroidManifest.xml created in an APK, using apktool with the following command line:

$ apktool d my-app.apk
$ cd my-app
$ more AndroidManifest.xml

This technique exploits the config-file element that is described in the Cordova Plugin Specification docs and can also be used to add lines to iOS plist files. See the Cordova plugin documentation link for additional details.

Here is an example of such a plugin for modifying the iOS plist file, specifically for adding a BIS key to the plist file:

<?xml version="1.0" encoding="UTF-8"?><plugin
    xmlns="http://apache.org/cordova/ns/plugins/1.0"
    id="my-custom-bis-plugin"
    version="0.0.2"><name>My Custom BIS Plugin</name><description>Add BIS info to iOS plist file.</description><license>BSD-3</license><preference name="BIS_KEY" /><engines><engine name="cordova" version=">=3.0.0" /></engines><!-- ios --><platform name="ios"><config-file target="*-Info.plist" parent="CFBundleURLTypes"><array><dict><key>ITSAppUsesNonExemptEncryption</key><true/><key>ITSEncryptionExportComplianceCode</key><string>$BIS_KEY</string></dict></array></config-file></platform></plugin>

How can I share my Intel XDK app build?

You can send a link to your project via an email invite from your project settings page. However, a login to your account is required to access the file behind the link. Alternatively, you can download the build from the build page, onto your workstation, and push that built image to some location from which you can send a link to that image.

Why does my iOS build fail when I am able to test it successfully on a device and the emulator?

Common reasons include:

  • Your App ID specified in the project settings do not match the one you specified in Apple's developer portal.
  • The provisioning profile does not match the cert you uploaded. Double check with Apple's developer site that you are using the correct and current distribution cert and that the provisioning profile is still active. Download the provisioning profile again and add it to your project to confirm.
  • In Project Build Settings, your App Name is invalid. It should be modified to include only alpha, space and numbers.

How do I add multiple domains in Domain Access?

Here is the primary doc source for that feature.

If you need to insert multiple domain references, then you will need to add the extra references in the intelxdk.config.additions.xml file. This StackOverflow entry provides a basic idea and you can see the intelxdk.config.*.xml files that are automatically generated with each build for the <access origin="xxx" /> line that is generated based on what you provide in the "Domain Access" field of the "Build Settings" panel on the Project Tab.

How do I build more than one app using the same Apple developer account?

On Apple developer, create a distribution certificate using the "iOS* Certificate Signing Request" key downloaded from Intel XDK Build tab only for the first app. For subsequent apps, reuse the same certificate and import this certificate into the Build tab like you usually would.

How do I include search and spotlight icons as part of my app?

Please refer to this article in the Intel XDK documentation. Create anintelxdk.config.additions.xml file in your top level directory (same location as the otherintelxdk.*.config.xml files) and add the following lines for supporting icons in Settings and other areas in iOS*.

<!-- Spotlight Icon --><icon platform="ios" src="res/ios/icon-40.png" width="40" height="40" /><icon platform="ios" src="res/ios/icon-40@2x.png" width="80" height="80" /><icon platform="ios" src="res/ios/icon-40@3x.png" width="120" height="120" /><!-- iPhone Spotlight and Settings Icon --><icon platform="ios" src="res/ios/icon-small.png" width="29" height="29" /><icon platform="ios" src="res/ios/icon-small@2x.png" width="58" height="58" /><icon platform="ios" src="res/ios/icon-small@3x.png" width="87" height="87" /><!-- iPad Spotlight and Settings Icon --><icon platform="ios" src="res/ios/icon-50.png" width="50" height="50" /><icon platform="ios" src="res/ios/icon-50@2x.png" width="100" height="100" />

For more information related to these configurations, visit http://cordova.apache.org/docs/en/3.5.0/config_ref_images.md.html#Icons%20and%20Splash%20Screens.

For accurate information related to iOS icon sizes, visit https://developer.apple.com/library/ios/documentation/UserExperience/Conceptual/MobileHIG/IconMatrix.html

NOTE: The iPhone 6 icons will only be available if iOS* 7 or 8 is the target.

Cordova iOS* 8 support JIRA tracker: https://issues.apache.org/jira/browse/CB-7043

Does Intel XDK support Modbus TCP communication?

No, since Modbus is a specialized protocol, you need to write either some JavaScript* or native code (in the form of a plugin) to handle the Modbus transactions and protocol.

How do I sign an Android* app using an existing keystore?

New with release 3088 of the Intel XDK, you may now import your existing keystore into Intel XDK using the new certificate manager that is built into the Intel XDK. Please read the initial paragraphs of Managing Certificates for your Intel XDK Account and the section titled "Import an Android Certificate Keystore" in that document, for details regarding how to do this.

If the above fails, please send an email to html5tools@intel.com requesting help. It is important that you send that email from the email address associated with your Intel XDK account.

How do I build separately for different Android* versions?

Under the Projects Panel, you can select the Target Android* version under the Build Settings collapsible panel. You can change this value and build your application multiple times to create numerous versions of your application that are targeted for multiple versions of Android*.

How do I display the 'Build App Now' button if my display language is not English?

If your display language is not English and the 'Build App Now' button is proving to be troublesome, you may change your display language to English which can be downloaded by a Windows* update. Once you have installed the English language, proceed to Control Panel > Clock, Language and Region > Region and Language > Change Display Language.

How do I update my Intel XDK version?

When an Intel XDK update is available, an Update Version dialog box lets you download the update. After the download completes, a similar dialog lets you install it. If you did not download or install an update when prompted (or on older versions), click the package icon next to the orange (?) icon in the upper-right to download or install the update. The installation removes the previous Intel XDK version.

How do I import my existing HTML5 app into the Intel XDK?

If your project contains an Intel XDK project file (<project-name>.xdk) you should use the "Open an Intel XDK Project" option located at the bottom of the Projects List on the Projects tab (lower left of the screen, round green "eject" icon, on the Projects tab). This would be the case if you copied an existing Intel XDK project from another system or used a tool that exported a complete Intel XDK project.

If your project does not contain an Intel XDK project file (<project-name>.xdk) you must "import" your code into a new Intel XDK project. To import your project, use the "Start a New Project" option located at the bottom of the Projects List on the Projects tab (lower left of the screen, round blue "plus" icon, on theProjects tab). This will open the "Samples, Demos and Templates" page, which includes an option to "Import Your HTML5 Code Base." Point to the root directory of your project. The Intel XDK will attempt to locate a file named index.html in your project and will set the "Source Directory" on the Projects tab to point to the directory that contains this file.

If your imported project did not contain an index.html file, your project may be unstable. In that case, it is best to delete the imported project from the Intel XDK Projects tab ("x" icon in the upper right corner of the screen), rename your "root" or "main" html file to index.html and import the project again. Several components in the Intel XDK depend on this assumption that the main HTML file in your project is named index.hmtl. See Introducing Intel® XDK Development Tools for more details.

It is highly recommended that your "source directory" be located as a sub-directory inside your "project directory." This insures that non-source files are not included as part of your build package when building your application. If the "source directory" and "project directory" are the same it results in longer upload times to the build server and unnecessarily large application executable files returned by the build system. See the following images for the recommended project file layout.

I am unable to login to App Preview with my Intel XDK password.

On some devices you may have trouble entering your Intel XDK login password directly on the device in the App Preview login screen. In particular, sometimes you may have trouble with the first one or two letters getting lost when entering your password.

Try the following if you are having such difficulties:

  • Reset your password, using the Intel XDK, to something short and simple.

  • Confirm that this new short and simple password works with the XDK (logout and login to the Intel XDK).

  • Confirm that this new password works with the Intel Developer Zone login.

  • Make sure you have the most recent version of Intel App Preview installed on your devices. Go to the store on each device to confirm you have the most recent copy of App Preview installed.

  • Try logging into Intel App Preview on each device with this short and simple password. Check the "show password" box so you can see your password as you type it.

If the above works, it confirms that you can log into your Intel XDK account from App Preview (because App Preview and the Intel XDK go to the same place to authenticate your login). When the above works, you can go back to the Intel XDK and reset your password to something else, if you do not like the short and simple password you used for the test.

If you are having trouble logging into any pages on the Intel web site (including the Intel XDK forum), please see the Intel Sign In FAQ for suggestions and contact info. That login system is the backend for the Intel XDK login screen.

How do I completely uninstall the Intel XDK from my system?

Take the following steps to completely uninstall the XDK from your Windows system:

  • From the Windows Control Panel, remove the Intel XDK, using the Windows uninstall tool.

  • Then:
    > cd %LocalAppData%\Intel\XDK
    > del *.* /s/q

  • Then:
    > cd %LocalAppData%\XDK
    > copy global-settings.xdk %UserProfile%
    > del *.* /s/q
    > copy %UserProfile%\global-settings.xdk .

  • Then:
    -- Goto xdk.intel.com and select the download link.
    -- Download and install the new XDK.

To do the same on a Linux or Mac system:

  • On a Linux machine, run the uninstall script, typically /opt/intel/XDK/uninstall.sh.
     
  • Remove the directory into which the Intel XDK was installed.
    -- Typically /opt/intel or your home (~) directory on a Linux machine.
    -- Typically in the /Applications/Intel XDK.app directory on a Mac.
     
  • Then:
    $ find ~ -name global-settings.xdk
    $ cd <result-from-above> (for example ~/Library/Application Support/XDK/ on a Mac)
    $ cp global-settings.xdk ~
    $ rm -Rf *
    $ mv ~/global-settings.xdk .

     
  • Then:
    -- Goto xdk.intel.com and select the download link.
    -- Download and install the new XDK.

Is there a tool that can help me highlight syntax issues in Intel XDK?

Yes, you can use the various linting tools that can be added to the Brackets editor to review any syntax issues in your HTML, CSS and JS files. Go to the "File > Extension Manager..." menu item and add the following extensions: JSHint, CSSLint, HTMLHint, XLint for Intel XDK. Then, review your source files by monitoring the small yellow triangle at the bottom of the edit window (a green check mark indicates no issues).

How do I delete built apps and test apps from the Intel XDK build servers?

You can manage them by logging into: https://appcenter.html5tools-software.intel.com/csd/controlpanel.aspx. This functionality will eventually be available within Intel XDK after which access to app center will be removed.

I need help with the App Security API plugin; where do I find it?

Visit the primary documentation book for the App Security API and see this forum post for some additional details.

When I install my app or use the Debug tab Avast antivirus flags a possible virus, why?

If you are receiving a "Suspicious file detected - APK:CloudRep [Susp]" message from Avast anti-virus installed on your Android device it is due to the fact that you are side-loading the app (or the Intel XDK Debug modules) onto your device (using a download link after building or by using the Debug tab to debug your app), or your app has been installed from an "untrusted" Android store. See the following official explanation from Avast:

Your application was flagged by our cloud reputation system. "Cloud rep" is a new feature of Avast Mobile Security, which flags apks when the following conditions are met:

  1. The file is not prevalent enough; meaning not enough users of Avast Mobile Security have installed your APK.
  2. The source is not an established market (Google Play is an example of an established market).

If you distribute your app using Google Play (or any other trusted market) your users should not see any warning from Avast.

Following are some of the Avast anti-virus notification screens you might see on your device. All of these are perfectly normal, they are due to the fact that you must enable the installation of "non-market" apps in order to use your device for debug and the App IDs associated with your never published app or the custom debug modules that the Debug tab in the Intel XDK builds and installs on your device will not be found in a "established" (aka "trusted") market, such as Google Play.

If you choose to ignore the "Suspicious app activity!" threat you will not receive a threat for that debug module any longer. It will show up in the Avast 'ignored issues' list. Updates to an existing, ignored, custom debug module should continue to be ignored by Avast. However, new custom debug modules (due to a new project App ID or a new version of Crosswalk selected in your project's Build Settings) will result in a new warning from the Avast anti-virus tool.

  

  

How do I add a Brackets extension to the editor that is part of the Intel XDK?

The number of Brackets extensions that are provided in the built-in edition of the Brackets editor are limited to insure stability of the Intel XDK product. Not all extensions are compatible with the edition of Brackets that is embedded within the Intel XDK. Adding incompatible extensions can cause the Intel XDK to quit working.

Despite this warning, there are useful extensions that have not been included in the editor and which can be added to the Intel XDK. Adding them is temporary, each time you update the Intel XDK (or if you reinstall the Intel XDK) you will have to "re-add" your Brackets extension. To add a Brackets extension, use the following procedure:

  • exit the Intel XDK
  • download a ZIP file of the extension you wish to add
  • on Windows, unzip the extension here:
    %LocalAppData%\Intel\XDK\xdk\brackets\b\extensions\dev
  • on Mac OS X, unzip the extension here:
    /Applications/Intel\ XDK.app/Contents/Resources/app.nw/brackets/b/extensions/dev
  • start the Intel XDK

Note that the locations given above are subject to change with new releases of the Intel XDK.

Why does my app or game require so many permissions on Android when built with the Intel XDK?

When you build your HTML5 app using the Intel XDK for Android or Android-Crosswalk you are creating a Cordova app. It may seem like you're not building a Cordova app, but you are. In order to package your app so it can be distributed via an Android store and installed on an Android device, it needs to be built as a hybrid app. The Intel XDK uses Cordova to create that hybrid app.

A pure Cordova app requires the NETWORK permission, it's needed to "jump" between your HTML5 environment and the native Android environment. Additional permissions will be added by any Cordova plugins you include with your application; which permissions are includes are a function of what that plugin does and requires.

Crosswalk for Android builds also require the NETWORK permission, because the Crosswalk image built by the Intel XDK includes support for Cordova. In addition, current versions of Crosswalk (12 and 14 at the time this FAQ was written)also require NETWORK STATE and WIFI STATE. There is an extra permission in some versions of Crosswalk (WRITE EXTERNAL STORAGE) that is only needed by the shared model library of Crosswalk, we have asked the Crosswalk project to remove this permission in a future Crosswalk version.

If you are seeing more than the following five permissions in your XDK-built Crosswalk app:

  • android.permission.INTERNET
  • android.permission.ACCESS_NETWORK_STATE
  • android.permission.ACCESS_WIFI_STATE
  • android.permission.INTERNET
  • android.permission.WRITE_EXTERNAL_STORAGE

then you are seeing permissions that have been added by some plugins. Each plugin is different, so there is no hard rule of thumb. The two "default" core Cordova plugins that are added by the Intel XDK blank templates (device and splash screen) do not require any Android permissions.

BTW: the permission list above comes from a Crosswalk 14 build. Crosswalk 12 builds do not included the last permission; it was added when the Crosswalk project introduced the shared model library option, which started with Crosswalk 13 (the Intel XDK does not support 13 builds).

How do I make a copy of an existing Intel XDK project?

If you just need to make a backup copy of an existing project, and do not plan to open that backup copy as a project in the Intel XDK, do the following:

  • Exit the Intel XDK.
  • Copy the entire project directory:
    • on Windows, use File Explorer to "right-click" and "copy" your project directory, then "right-click" and "paste"
    • on Mac use Finder to "right-click" and then "duplicate" your project directory
    • on Linux, open a terminal window, "cd" to the folder that contains your project, and type "cp -a old-project/ new-project/" at the terminal prompt (where "old-project/" is the folder name of your existing project that you want to copy and "new-project/" is the name of the new folder that will contain a copy of your existing project)

If you want to use an existing project as the starting point of a new project in the Intel XDK. The process described below will insure that the build system does not confuse the ID in your old project with that stored in your new project. If you do not follow the procedure below you will have multiple projects using the same project ID (a special GUID that is stored inside the Intel XDK <project-name>.xdk file in the root directory of your project). Each project in your account must have a unique project ID.

  • Exit the Intel XDK.
  • Make a copy of your existing project using the process described above.
  • Inside the new project that you made (that is, your new copy of your old project), make copies of the <project-name>.xdk file and <project-name>.xdke files and rename those copies to something like project-new.xdk and project-new.xdke (anything you like, just something different than the original project name, preferably the same name as the new project folder in which you are making this new project).
  • Using a TEXT EDITOR (only) (such as Notepad or Sublime or Brackets or some other TEXT editor), open your new "project-new.xdk" file (whatever you named it) and find the projectGuid line, it will look something like this:
    "projectGuid": "a863c382-ca05-4aa4-8601-375f9f209b67",
  • Change the "GUID" to all zeroes, like this: "00000000-0000-0000-000000000000"
  • Save the modified "project-new.xdk" file.
  • Open the Intel XDK.
  • Goto the Projects tab.
  • Select "Open an Intel XDK Project" (the green button at the bottom left of the Projects tab).
  • To open this new project, locate the new "project-new.xdk" file inside the new project folder you copied above.
  • Don't forget to change the App ID in your new project. This is necessary to avoid conflicts with the project you copied from, in the store and when side-loading onto a device.

My project does not include a www folder. How do I fix it so it includes a www or source directory?

The Intel XDK HTML5 and Cordova project file structures are meant to mimic a standard Cordova project. In a Cordova (or PhoneGap) project there is a subdirectory (or folder) named www that contains all of the HTML5 source code and asset files that make up your application. For best results, it is advised that you follow this convention, of putting your source inside a "source directory" inside of your project folder.

This most commonly happens as the result of exporting a project from an external tool, such as Construct2, or as the result of importing an existing HTML5 web app that you are converting into a hybrid mobile application (eg., an Intel XDK Corodova app). If you would like to convert an existing Intel XDK project into this format, follow the steps below:

  • Exit the Intel XDK.
  • Copy the entire project directory:
    • on Windows, use File Explorer to "right-click" and "copy" your project directory, then "right-click" and "paste"
    • on Mac use Finder to "right-click" and then "duplicate" your project directory
    • on Linux, open a terminal window, "cd" to the folder that contains your project, and type "cp -a old-project/ new-project/" at the terminal prompt (where "old-project/" is the folder name of your existing project that you want to copy and "new-project/" is the name of the new folder that will contain a copy of your existing project)
  • Create a "www" directory inside the new duplicate project you just created above.
  • Move your index.html and other source and asset files to the "www" directory you just created -- this is now your "source" directory, located inside your "project" directory (do not move the <project-name>.xdk and xdke files and any intelxdk.config.*.xml files, those must stay in the root of the project directory)
  • Inside the new project that you made above (by making a copy of the old project), rename the <project-name>.xdk file and <project-name>.xdke files to something like project-copy.xdk and project-copy.xdke (anything you like, just something different than the original project, preferably the same name as the new project folder in which you are making this new project).
  • Using a TEXT EDITOR (only) (such as Notepad or Sublime or Brackets or some other TEXT editor), open the new "project-copy.xdk" file (whatever you named it) and find the line named projectGuid, it will look something like this:
    "projectGuid": "a863c382-ca05-4aa4-8601-375f9f209b67",
  • Change the "GUID" to all zeroes, like this: "00000000-0000-0000-000000000000"
  • A few lines down find: "sourceDirectory": "",
  • Change it to this: "sourceDirectory": "www",
  • Save the modified "project-copy.xdk" file.
  • Open the Intel XDK.
  • Goto the Projects tab.
  • Select "Open an Intel XDK Project" (the green button at the bottom left of the Projects tab).
  • To open this new project, locate the new "project-copy.xdk" file inside the new project folder you copied above.

Can I install more than one copy of the Intel XDK onto my development system?

Yes, you can install more than one version onto your development system. However, you cannot run multiple instances of the Intel XDK at the same time. Be aware that new releases sometimes change the project file format, so it is a good idea, in these cases, to make a copy of your project if you need to experiment with a different version of the Intel XDK. See the instructions in a FAQ entry above regarding how to make a copy of your Intel XDK project.

Follow the instructions in this forum post to install more than one copy of the Intel XDK onto your development system.

On Apple OS X* and Linux* systems, does the Intel XDK need the OpenSSL* library installed?

Yes. Several features of the Intel XDK require the OpenSSL library, which typically comes pre-installed on Linux and OS X systems. If the Intel XDK reports that it could not find libssl, go to https://www.openssl.org to download and install it.

I have a web application that I would like to distribute in app stores without major modifications. Is this possible using the Intel XDK?

Yes, if you have a true web app or “client app” that only uses HTML, CSS and JavaScript, it is usually not too difficult to convert it to a Cordova hybrid application (this is what the Intel XDK builds when you create an HTML5 app). If you rely heavily on PHP or other server scripting languages embedded in your pages you will have more work to do. Because your Cordova app is not associated with a server, you cannot rely on server-based programming techniques; instead, you must rewrite any such code to user RESTful APIs that your app interacts with using, for example, AJAX calls.

What is the best training approach to using the Intel XDK for a newbie?

First, become well-versed in the art of client web apps, apps that rely only on HTML, CSS and JavaScript and utilize RESTful APIs to talk to network services. With that you will have mastered 80% of the problem. After that, it is simply a matter of understanding how Cordova plugins are able to extend the JavaScript API for access to features of the platform. For HTML5 training there are many sites providing tutorials. It may also help to read Five Useful Tips on Getting Started Building Cordova Mobile Apps with the Intel XDK, which will help you understand some of the differences between developing for a traditional server-based environment and developing for the Intel XDK hybrid Cordova app environment.

What is the best platform to start building an app with the Intel XDK? And what are the important differences between the Android, iOS and other mobile platforms?

There is no one most important difference between the Android, iOS and other platforms. It is important to understand that the HTML5 runtime engine that executes your app on each platform will vary as a function of the platform. Just as there are differences between Chrome and Firefox and Safari and Internet Explorer, there are differences between iOS 9 and iOS 8 and Android 4 and Android 5, etc. Android has the most significant differences between vendors and versions of Android. This is one of the reasons the Intel XDK offers the Crosswalk for Android build option, to normalize and update the Android issues.

In general, if you can get your app working well on Android (or Crosswalk for Android) first you will generally have fewer issues to deal with when you start to work on the iOS and Windows platforms. In addition, the Android platform has the most flexible and useful debug options available, so it is the easiest platform to use for debugging and testing your app.

Is my password encrypted and why is it limited to fifteen characters?

Yes, your password is stored encrypted and is managed by https://signin.intel.com. Your Intel XDK userid and password can also be used to log into the Intel XDK forum as well as the Intel Developer Zone. the Intel XDK does not store nor does it manage your userid and password.

The rules regarding allowed userids and passwords are answered on this Sign In FAQ page, where you can also find help on recovering and changing your password.

Why does the Intel XDK take a long time to start on Linux or Mac?

...and why am I getting this error message? "Attempt to contact authentication server is taking a long time. You can wait, or check your network connection and try again."

At startup, the Intel XDK attempts to automatically determine the proxy settings for your machine. Unfortunately, on some system configurations it is unable to reliably detect your system proxy settings. As an example, you might see something like this image when starting the Intel XDK.

On some systems you can get around this problem by setting some proxy environment variables and then starting the Intel XDK from a command-line that includes those configured environment variables. To set those environment variables, similar to the following:

$ export no_proxy="localhost,127.0.0.1/8,::1"
$ export NO_PROXY="localhost,127.0.0.1/8,::1"
$ export http_proxy=http://proxy.mydomain.com:123/
$ export HTTP_PROXY=http://proxy.mydomain.com:123/
$ export https_proxy=http://proxy.mydomain.com:123/
$ export HTTPS_PROXY=http://proxy.mydomain.com:123/

IMPORTANT! The name of your proxy server and the port (or ports) that your proxy server requires will be different than those shown in the example above. Please consult with your IT department to find out what values are appropriate for your site. Intel has no way of knowing what configuration is appropriate for your network.

If you use the Intel XDK in multiple locations (at work and at home), you may have to change the proxy settings before starting the Intel XDK after switching to a new network location. For example, many work networks use a proxy server, but most home networks do not require such a configuration. In that case, you need to be sure to "unset" the proxy environment variables before starting the Intel XDK on a non-proxy network.

After you have successfully configured your proxy environment variables, you can start the Intel XDK manually, from the command-line.

On a Mac, where the Intel XDK is installed in the default location, type the following (from a terminal window that has the above environment variables set):

$ open /Applications/Intel\ XDK.app/

On a Linux machine, assuming the Intel XDK has been installed in the ~/intel/XDK directory, type the following (from a terminal window that has the above environment variables set):

$ ~/intel/XDK/xdk.sh &

In the Linux case, you will need to adjust the directory name that points to the xdk.sh file in order to start. The example above assumes a local install into the ~/intel/XDK directory. Since Linux installations have more options regarding the installation directory, you will need to adjust the above to suit your particular system and install directory.

How do I generate a P12 file on a Windows machine?

See these articles:

How do I change the default dir for creating new projects in the Intel XDK?

You can change the default new project location manually by modifying a field in the global-settings.xdk file. Locate the global-settings.xdk file on your system (the precise location varies as a function of the OS) and find this JSON object inside that file:

"projects-tab": {"defaultPath": "/Users/paul/Documents/XDK","LastSortType": "descending|Name","lastSortType": "descending|Opened","thirdPartyDisclaimerAcked": true
  },

The example above came from a Mac. On a Mac the global-settings.xdk file is located in the "~/Library/Application Support/XDK" directory.

On a Windows machine the global-settings.xdk file is normally found in the "%LocalAppData%\XDK" directory. The part you are looking for will look something like this:

"projects-tab": {"thirdPartyDisclaimerAcked": false,"LastSortType": "descending|Name","lastSortType": "descending|Opened","defaultPath": "C:\\Users\\paul/Documents"
  },

Obviously, it's the defaultPath part you want to change.

BE CAREFUL WHEN YOU EDIT THE GLOBAL-SETTINGS.XDK FILE!! You've been warned...

Make sure the result is proper JSON when you are done, or it may cause your XDK to cough and hack loudly. Make a backup copy of global-settings.xdk before you start, just in case.

Where I can find recent and upcoming webinars list?

How can I change the email address associated with my Intel XDK login?

Login to the Intel Developer Zone with your Intel XDK account userid and password and then locate your "account dashboard." Click the "pencil icon" next to your name to open the "Personal Profile" section of your account, where you can edit your "Name & Contact Info," including the email address associated with your account, under the "Private" section of your profile.

What network addresses must I enable in my firewall to insure the Intel XDK will work on my restricted network?

Normally, access to the external servers that the Intel XDK uses is handled automatically by your proxy server. However, if you are working in an environment that has restricted Internet access and you need to provide your IT department with a list of URLs that you need access to in order to use the Intel XDK, then please provide them with the following list of domain names:

  • appcenter.html5tools-software.intel.com (for communication with the build servers)
  • s3.amazonaws.com (for downloading sample apps and built apps)
  • download.xdk.intel.com (for getting XDK updates)
  • debug-software.intel.com (for using the Test tab weinre debug feature)
  • xdk-feed-proxy.html5tools-software.intel.com (for receiving the tweets in the upper right corner of the XDK)
  • signin.intel.com (for logging into the XDK)
  • sfederation.intel.com (for logging into the XDK)

Normally this should be handled by your network proxy (if you're on a corporate network) or should not be an issue if you are working on a typical home network.

I cannot create a login for the Intel XDK, how do I create a userid and password to use the Intel XDK?

If you have downloaded and installed the Intel XDK but are having trouble creating a login, you can create the login outside the Intel XDK. To do this, go to the Intel Developer Zone and push the "Join Today" button. After you have created your Intel Developer Zone login you can return to the Intel XDK and use that userid and password to login to the Intel XDK. This same userid and password can also be used to login to the Intel XDK forum.

Installing the Intel XDK on Windows fails with a "Package signature verification failed." message.

If you receive a "Package signature verification failed" message (see image below) when installing the Intel XDK on your system, it is likely due to one of the following two reasons:

  • Your system does not have a properly installed "root certificate" file, which is needed to confirm that the install package is good.
  • The install package is corrupt and failed the verification step.

The first case can happen if you are attempting to install the Intel XDK on an unsupported version of Windows. The Intel XDK is only supported on Microsoft Windows 7 and higher. If you attempt to install on Windows Vista (or earlier) you may see this verification error. The workaround is to install the Intel XDK on a Windows 7 or greater machine.

The second case is likely due to a corruption of the install package during download or due to tampering. The workaround is to re-download the install package and attempt another install.

If you are installing on a Windows 7 (or greater) machine and you see this message it is likely due to a missing or bad root certificate on your system. To fix this you may need to start the "Certificate Propagation" service. Open the Windows "services.msc" panel and then start the "Certificate Propagation" service. Additional links related to this problem can be found here > https://technet.microsoft.com/en-us/library/cc754841.aspx

See this forum thread for additional help regarding this issue > https://software.intel.com/en-us/forums/intel-xdk/topic/603992

Troubles installing the Intel XDK on a Linux or Ubuntu system, which option should I choose?

Choose the local user option, not root or sudo, when installing the Intel XDK on your Linux or Ubuntu system. This is the most reliable and trouble-free option and is the default installation option. This will insure that the Intel XDK has all the proper permissions necessary to execute properly on your Linux system. The Intel XDK will be installed in a subdirectory of your home (~) directory.

Inactive account/ login issue/ problem updating an APK in store, How do I request account transfer?

As of June 26, 2015 we migrated all of Intel XDK accounts to the more secure intel.com login system (the same login system you use to access this forum).

We have migrated nearly all active users to the new login system. Unfortunately, there are a few active user accounts that we could not automatically migrate to intel.com, primarily because the intel.com login system does not allow the use of some characters in userids that were allowed in the old login system.

If you have not used the Intel XDK for a long time prior to June 2015, your account may not have been automatically migrated. If you own an "inactive" account it will have to be manually migrated -- please try logging into the Intel XDK with your old userid and password, to determine if it no longer works. If you find that you cannot login to your existing Intel XDK account, and still need access to your old account, please send a message to html5tools@intel.com and include your userid and the email address associated with that userid, so we can guide you through the steps required to reactivate your old account.

Alternatively, you can create a new Intel XDK account. If you have submitted an app to the Android store from your old account you will need access to that old account to retrieve the Android signing certificates in order to upgrade that app on the Android store; in that case, send an email to html5tools@intel.com with your old account username and email and new account information.

Connection Problems? -- Intel XDK SSL certificates update

On January 26, 2016 we updated the SSL certificates on our back-end systems to SHA2 certificates. The existing certificates were due to expire in February of 2016. We have also disabled support for obsolete protocols.

If you are experiencing persistent connection issues (since Jan 26, 2016), please post a problem report on the forum and include in your problem report:

  • the operation that failed
  • the version of your XDK
  • the version of your operating system
  • your geographic region
  • and a screen capture

How do I resolve build failure: "libpng error: Not a PNG file"?  

f you are experiencing build failures with CLI 5 Android builds, and the detailed error log includes a message similar to the following:

Execution failed for task ':mergeArmv7ReleaseResources'.> Error: Failed to run command: /Developer/android-sdk-linux/build-tools/22.0.1/aapt s -i .../platforms/android/res/drawable-land-hdpi/screen.png -o .../platforms/android/build/intermediates/res/armv7/release/drawable-land-hdpi-v4/screen.png

Error Code: 42

Output: libpng error: Not a PNG file

You need to change the format of your icon and/or splash screen images to PNG format.

The error message refers to a file named "screen.png" -- which is what each of your splash screens were renamed to before they were moved into the build project resource directories. Unfortunately, JPG images were supplied for use as splash screen images, not PNG images. So the files were renamed and found by the build system to be invalid.

Convert your splash screen images to PNG format. Renaming JPG images to PNG will not work! You must convert your JPG images into PNG format images using an appropriate image editing tool. The Intel XDK does not provide any such conversion tool.

Beginning with Cordova CLI 5, all icons and splash screen images must be supplied in PNG format. This applies to all supported platforms. This is an undocumented "new feature" of the Cordova CLI 5 build system that was implemented by the Apache Cordova project.

Why do I get a "Parse Error" when I try to install my built APK on my Android device?

Because you have built an "unsigned" Android APK. You must click the "signed" box in the Android Build Settings section of the Projects tab if you want to install an APK on your device. The only reason you would choose to create an "unsigned" APK is if you need to sign it manually. This is very rare and not the normal situation.

My converted legacy keystore does not work. Google Play is rejecting my updated app.

The keystore you converted when you updated to 3088 (now 3240 or later) is the same keystore you were using in 2893. When you upgraded to 3088 (or later) and "converted" your legacy keystore, you re-signed and renamed your legacy keystore and it was transferred into a database to be used with the Intel XDK certificate management tool. It is still the same keystore, but with an alias name and password assigned by you and accessible directly by you through the Intel XDK.

If you kept the converted legacy keystore in your account following the conversion you can download that keystore from the Intel XDK for safe keeping (do not delete it from your account or from your system). Make sure you keep track of the new password(s) you assigned to the converted keystore.

There are two problems we have experienced with converted legacy keystores at the time of the 3088 release (April, 2016):

  • Using foreign (non-ASCII) characters in the new alias name and passwords were being corrupted.
  • Final signing of your APK by the build system was being done with RSA256 rather than SHA1.

Both of the above items have been resolved and should no longer be an issue.

If you are currently unable to complete a build with your converted legacy keystore (i.e., builds fail when you use the converted legacy keystore but they succeed when you use a new keystore) the first bullet above is likely the reason your converted keystore is not working. In that case we can reset your converted keystore and give you the option to convert it again. You do this by requesting that your legacy keystore be "reset" by filling out this form. For 100% surety during that second conversion, use only 7-bit ASCII characters in the alias name you assign and for the password(s) you assign.

IMPORTANT: using the legacy certificate to build your Android app is ONLY necessary if you have already published an app to an Android store and need to update that app. If you have never published an app to an Android store using the legacy certificate you do not need to concern yourself with resetting and reconverting your legacy keystore. It is easier, in that case, to create a new Android keystore and use that new keystore.

If you ARE able to successfully build your app with the converted legacy keystore, but your updated app (in the Google store) does not install on some older Android 4.x devices (typically a subset of Android 4.0-4.2 devices), the second bullet cited above is likely the reason for the problem. The solution, in that case, is to rebuild your app and resubmit it to the store (that problem was a build-system problem that has been resolved).

How can I have others beta test my app using Intel App Preview?

Apps that you sync to your Intel XDK account, using the Test tab's green "Push Files" button, can only be accessed by logging into Intel App Preview with the same Intel XDK account credentials that you used to push the files to the cloud. In other words, you can only download and run your app for testing with Intel App Preview if you log into the same account that you used to upload that test app. This restriction applies to downloading your app into Intel App Preview via the "Server Apps" tab, at the bottom of the Intel App Preview screen, or by scanning the QR code displayed on the Intel XDK Test tab using the camera icon in the upper right corner of Intel App Preview.

If you want to allow others to test your app, using Intel App Preview, it means you must use one of two options:

  • give them your Intel XDK userid and password
  • create an Intel XDK "test account" and provide your testers with that userid and password

For security sake, we highly recommend you use the second option (create an Intel XDK "test account"). 

A "test account" is simply a second Intel XDK account that you do not plan to use for development or builds. Do not use the same email address for your "test account" as you are using for your main development account. You should use a "throw away" email address for that "test account" (an email address that you do not care about).

Assuming you have created an Intel XDK "test account" and have instructed your testers to download and install Intel App Preview; have provided them with your "test account" userid and password; and you are ready to have them test:

  • sign out of your Intel XDK "development account" (using the little "man" icon in the upper right)
  • sign into your "test account" (again, using the little "man" icon in the Intel XDK toolbar)
  • make sure you have selected the project that you want users to test, on the Projects tab
  • goto the Test tab
  • make sure "MOBILE" is selected (upper left of the Test tab)
  • push the green "PUSH FILES" button on the Test tab
  • log out of your "test account"
  • log into your development account

Then, tell your beta testers to log into Intel App Preview with your "test account" credentials and instruct them to choose the "Server Apps" tab at the bottom of the Intel App Preview screen. From there they should see the name of the app you synced using the Test tab and can simply start it by touching the app name (followed by the big blue and white "Launch This App" button). Staring the app this way is actually easier than sending them a copy of the QR code. The QR code is very dense and is hard to read with some devices, dependent on the quality of the camera in their device.

Note that when running your test app inside of Intel App Preview they cannot test any features associated with third-party plugins, only core Cordova plugins. Thus, you need to insure that those parts of your apps that depend on non-core Cordova plugins have been disabled or have exception handlers to prevent your app from either crashing or freezing.

I'm having trouble making Google Maps work with my Intel XDK app. What can I do?

There are many reasons that can cause your attempt to use Google Maps to fail. Mostly it is due to the fact that you need to download the Google Maps API (JavaScript library) at runtime to make things work. However, there is no guarantee that you will have a good network connection, so if you do it the way you are used to doing it, in a browser...

<script src="https://maps.googleapis.com/maps/api/js?key=API_KEY&sensor=true"></script>

...you may get yourself into trouble, in an Intel XDK Cordova app. See Loading Google Maps in Cordova the Right Way for an excellent tutorial on why this is a problem and how to deal with it. Also, it may help to read Five Useful Tips on Getting Started Building Cordova Mobile Apps with the Intel XDK, especially item #3, to get a better understanding of why you shouldn't use the "browser technique" you're familiar with.

An alternative is to use a mapping tool that allows you to include the JavaScript directly in your app, rather than downloading it over the network each time your app starts. Several Intel XDK developers have reported very good luck with the open-source JavaScript library named LeafletJS that uses OpenStreet as it's map database source.

You can also search the Cordova Plugin Database for Cordova plugins that implement mapping features, in some cases using native SDKs and libraries.

How do I fix "Cannot find the Intel XDK. Make sure your device and intel XDK are on the same wireless network." error messages?

You can either disable your firewall or allow access through the firewall for the Intel XDK. To allow access through the Windows firewall goto the Windows Control Panel and search for the Firewall (Control Panel > System and Security > Windows Firewall > Allowed Apps) and enable Node Webkit (nw or nw.exe) through the firewall

See the image below (this image is from a Windows 8.1 system).

Google Services needs my SHA1 fingerprint. Where do I get my app's SHA fingerprint?

Your app's SHA fingerprint is part of your build signing certificate. Specifically, it is part of the signing certificate that you used to build your app. The Intel XDK provides a way to download your build certificates directly from within the Intel XDK application (see the Intel XDK documentation for details on how to manage your build certificates). Once you have downloaded your build certificate you can use these instructions provided by Google, to extract the fingerprint, or simply search the Internet for "extract fingerprint from android build certificate" to find many articles detailing this process.

Why am I unable to test or build or connect to the old build server with Intel XDK version 2893?

This is an Important Note Regarding the use of Intel XDK Versions 2893 and Older!!

As of June 13, 2016, versions of the Intel XDK released prior to March 2016 (2893 and older) can no longer use the Build tab, the Test tab or Intel App Preview; and can no longer create custom debug modules for use with the Debug and Profile tabs. This change was necessary to improve the security and performance of our Intel XDK cloud-based build system. If you are using version 2893 or older, of the Intel XDK, you must upgrade to version 3088 or greater to continue to develop, debug and build Intel XDK Cordova apps.

The error message you see below, "NOTICE: Internet Connection and Login Required," when trying to use the Build tab is due to the fact that the cloud-based component that was used by those older versions of the Intel XDK work has been retired and is no longer present. The error message appears to be misleading, but is the easiest way to identify this condition. 

How do I run the Intel XDK on Fedora Linux?

See the instructions below, copied from this forum post:

$ sudo find xdk/install/dir -name libudev.so.0
$ cd dir/found/above
$ sudo rm libudev.so.0
$ sudo ln -s /lib64/libudev.so.1 libudev.so.0

Note the "xdk/install/dir" is the name of the directory where you installed the Intel XDK. This might be "/opt/intel/xdk" or "~/intel/xdk" or something similar. Since the Linux install is flexible regarding the precise installation location you may have to search to find it on your system.

Once you find that libudev.so file in the Intel XDK install directory you must "cd" to that directory to finish the operations as written above.

Additional instructions have been provided in the related forum thread; please see that thread for the latest information regarding hints on how to make the Intel XDK run on a Fedora Linux system.

Back to FAQs Main

Intel® Curie™ Module Datasheet

$
0
0

Intel® Curie™ Module Datasheet

The Intel® Curie™ module is an advanced device built around the Intel® Quark™ SE microcontroller integrating compute, sense, awareness, connectivity and a programmable input/output controller within a common package.

Intel® Curie™ Module Design Guide

$
0
0

Intel® Curie™ Module Design Guide

This document provides design recommendations for the Intel® Curie™ module, which is based on the Intel® Quark™ SE system on a chip. Technical implementation examples provided are derived from the functional reference circuits.

Superior Performance Commits Kyoto University to CPUs Over GPUs

$
0
0

The Kyoto University Graduate School of Medicine determined that a dual-socket Intel® Xeon® E5-2699v3 (Haswell architecture) system delivers better performance than an NVIDIA K40 GPU when training deep learning neural networks for computational drug discovery using the Theano framework. Theano is a Python* library that lets researchers transparently run deep learning models on CPUs and GPUs. It does so by generating C++ code from the Python* script for the destination architecture. The generated C++ code can also call optimized math libraries.

The Kyoto University team recognized that the performance of the open source Theano C++ multi-core code could be significantly improved. They worked with Intel to improve Theano multicore performance using a dual-socket Intel® Xeon® processor based system as the next generation Intel® Xeon Phi™ processors were not available at that time. The optimized performance improvement turned out to be significant and demonstrated that a dual-socket Haswell processor chipset can outperform an NVIDIA K40 GPU on deep learning training tasks1.

On the basis of the Intel® Xeon® processor benchmark results presented by Masatoshi Hamanaka (Research Fellow) at the 2015 Annual conference of the Japanese Society for Bioinformatics (JSBI 2015) and the consistency of the multi- and many-core runtime environment, GPUs were eliminated from consideration as they added needless cost, complexity, and memory limitations without delivering a deep learning performance benefit.

A summary slide from the presentation is shown below.


Figure 2: Speedup of optimized Theano relative to GPU plus impact of the larger Intel® Xeon® memory capacity. (Results courtesy Kyoto University)

The Kyoto deep learning cluster procurement will act as a bellwether as it is the first first prominent system to select many-core CPU over GPU technology. According to all expectations, the Theano software will run much faster on the next generation Intel® Xeon Phi™ processors.


Figure 3:At ISC’16, Intel provided details on the superior performance of Intel® Xeon Phi™ compared to GPUs for deep learning

Importance of the science

The Kyoto University Graduate School of Medicine is applying various machine learning and deep learning algorithms to problems in life sciences including drug discovery, medicine, and health care. As with other fields, the Kyoto researchers are faced with vast amounts of data. For example, the Kyoto team wishes to apply machine learning to data produced by experimental technologies such as high-throughput screening (HTS) and next-generation sequencing (NGS). In addition, electronic health records (EHR) from daily clinical practice can be analyzed. The Kyoto team believes they can perform a more thorough analysis than other efforts through their use of big-data machine-learning technology compared to previous approaches.


Figure 4: Illustration showing how deep learning differs from conventional approaches. (Image courtesy Kyoto University)

Kyoto has two goals for their machine learning and deep-learning study: (1) Make knowledge discoveries from the rapidly increasing data generated by the experiments and electronic data that is now being collected at the patient’s bedside, and (2) improve drug discovery and patient health care by returning relevant information from their knowledge discoveries to both experimentalists and physicians.

“Many clinical applications during the next decade will adopt machine learning technology,” said Professor Yasushi Okuno. “Our application of machine learning and deep-learning will become increasingly important over the next ten years.”

The Kyoto drug discovery workload

Part of the Kyoto workload will apply computational virtual screening to the field of drug discovery. Virtual screening is used in early stage of drug discovery process, a process which usually take ten years or more. The purpose of virtual screening is to computationally screen huge numbers of chemical compounds to find new drug candidates.

“Currently, this early stage of drug discovery takes several years and a few hundred million dollars,” explained Professor Okuno. “But we believe our study will significantly decrease both time and cost.”


FIgure 5: The case for virtual drug discovery lies in speed and volume. (Image courtesy Kyoto University)

“Since the DBN learns from the data it is possible that it can find drug candidates that do not resemble the structure of existing drug-like compounds,” Professor Okuno continued. “For this reason, we also think that deep learning can help find such de-novo drug candidates.”


Figure 6: DBN can ‘learn’ features of the data that are important to drug-like activity. These DBNs can then be used to predict, or ‘score’ drug candidates. (Image courtesy Kyoto)

Big data is key to accurately training neural networks to solve complex problems. (In their paper, “How Neural Networks Work”, Lapedes and Farber showed that the neural network is actually fitting a ‘bumpy’ multi-dimensional surface, which means the training data needs to specify the hills and valleys, or points of inflection, on the surface. This explains why more data is required to fit complex surfaces.)


Figure 7: Proposed method to find drug candidates using deep learning (Image courtesy Kyoto University)

The Kyoto dataset evaluated the Theano scaling behavior to four million rows and 2,000 features. Results are validated using a 20% held out validation. In the future, the Kyoto team intends to use Theano to train on data sets with 200 million rows and 380 thousand features – a 130x increase in data!

“Experimental results are increasing day-by-day,” Professor Okuno said. “So we will always be looking to increase their computing performance.”

As can be seen below, the optimized multicore Theano code delivers excellent scaling as data set sizes increase, which allows training with much more data. The expectation is that the new Intel® Xeon Phi™ processor-based system should scale similarly and deliver faster time-to-model performance.


Figure 8: Scaling of optimized DBN Theano code according to data size. (Image courtesy Kyoto University)

Fixing poorly optimized multicore code compared to GPU code paths

The Kyoto results demonstrate that modern multicore processing technology now matches or exceeds GPU machine-learning performance, but equivalently optimized software is required to perform a fair benchmark comparison. For historical reasons, many software packages like Theano lacked optimized multicore code as all the open source effort had been put into optimizing the GPU code paths.

To assist others in performing fair benchmarks and to realize the benefits of multi- and many-core performance, Intel announced several optimized libraries at ISC’16 for deep and machine learning such as the high-level Intel® Data Analytics Acceleration Library (Intel® DAAL) and lower level Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) libraries that provides optimized deep learning primitives. The ISC’16 MKL-DNN announcement also noted the library is open source, has no restrictions, and is royalty free. Even the well-established Intel® Math Kernel Library (Intel® MKL) is getting a machine learning refresh with the addition of optimized primitives to speed machine and deep learning on Intel® architectures.

For example, the SGEMM operation is very important to machine learning algorithms, which heavily utilize single-precision arithmetic. The new libraries provide improved SGEMM parallelism.


Figure 9: Improved SGEMM parallelism by AVX instruction along the contiguous address. (Image courtesy Intel)

The new vector and multicore optimized libraries announced at ISC’16 will speed machine learning efforts and assist others so they too – just like Kyoto University – can make fair comparisons using optimized multicore codes when evaluating hardware platforms for machine learning.

Expectations for the new Intel® Xeon Phiprocessor-based cluster

The Academic Center for Computing and Media Studies (ACCMS) at Kyoto University will be standing up a new Intel® Xeon Phi™ processor-based cluster designed to support training on the larger data sets. Specifically, the expectations are:

  1. To deliver higher performance so the team train on bigger data and in less time compared to other CPU and GPU platforms.
  2. To facilitate advanced algorithm development. Many deep-learning algorithms are complex, which means the Kyoto team wants to eliminate as many architecture limitations as possible. The consistent multi- to many-core programming environment is very attractive as it eliminates the complexities, memory limitations, and hardware variations of a GPU environment. Further, Intel has proven to be very responsive in providing optimized libraries that provide access to the Intel® Xeon® and Intel® Xeon Phi™ capabilities.

Teaching people that multicore processors outperform GPUs

To help data scientists and the HPC community understand and use the multi- and many-core software and hardware technology, Intel has created a machine learning portal at http://intel.com/machinelearning. Content on this portal will teach readers how multi- and many-core processors outperform GPUs and deliver superior training and prediction (also called inference or scoring) performance as well as better scalability on a variety of machine learning frameworks. Through this portal, Intel hopes to train 100,000 developers in the benefits of their machine learning technology and optimized libraries. They are backing this up by giving early technology access to top research academics.

To help bring machine-learning and HPC computing into the exascale era, Intel has also created Intel® Scalable System Framework (Intel® SSF). Intel SSF incorporates a host of software and hardware technologies including Intel® Omni-Path Architecture (Intel® OPA), Intel® Optane™ SSDs built on 3D XPoint™ technology, and new Intel® Silicon Photonics– plus it incorporates Intel’s existing and upcoming compute and storage products, including Intel® Xeon® processors, Intel® Xeon Phi™ processors, and Intel® Enterprise Edition for Lustre* software.

About the Author: Rob Farber is a global technology consultant and author with an extensive background in HPC and machine learning technology that he applies at national labs and commercial organizations throughout the world. He can be reached at info@techenablement.com.

1 Broadwell microarchitecture improvements – especially to the FMA (Fused Multiply-Add) instruction – should increase performance even further. See http://www.nextplatform.com/2016/03/31/examining-potential-hpc-benefits-new-intel-xeon-processors for more information.


Offload over Fabric to Intel® Xeon Phi™ Processor: Tutorial

$
0
0

The OpenMP* 4.0 device constructs supported by the Intel® C++ Compiler can be used to offload a workload from an Intel® Xeon® processor-based host machine to Intel® Xeon Phi™ coprocessors over Peripheral Component Interface Express* (PCIe*). Offload over Fabric (OoF) extends this offload programing model to support the 2nd  generation Intel® Xeon Phi™ processor; that is, the Intel® Xeon® processor-based host machine uses OoF to offload a workload to the 2nd generation Intel Xeon Phi processors over high-speed networks such as Intel® Omni-Path Architecture (Intel® OPA) or Mellanox InfiniBand*.

This tutorial shows how to install OoF software, configure the hardware, test the basic configuration, and enable OoF. A sample source code is provided to illustrate how the OoF works.

Hardware Installation

In this tutorial, two machines are used: an Intel® Xeon® processor E5-2670 2.6 GHz serves as the host machine and an Intel® Xeon Phi™ processor serves as the target machine. Both host and target machines are running Red Hat Enterprise Linux* 7.2, and each has Gigabit Ethernet adapters to enable remote log in. Note that the hostnames of the host and target machines are host-device and knl-sb2 respectively.

First we need to set up a high-speed network. We used InfiniBand in our lab due to the hardware availability, but Intel OPA is also supported.

Prior to the test, both host and target machines are powered off to set up a high-speed network between the machines. Mellanox ConnectX*-3 VPI InfiniBand adapters are installed into PCIe slots in these machines and are connected using an InfiniBand cable with no intervening router. After rebooting the machines, we first verify that the Mellanox network adapter is installed on the host:

[host-device]$ lspci | grep Mellanox
84:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

And on the target:

[knl-sb2 ~]$ lspci | grep Mellanox
01:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

Software Installation

The host machine and target machines are running Red Hat Enterprise Linux 7.2. On the host, you can verify the current Linux kernel version:

[host-device]$ uname -a
Linux host-device 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux

You can also verify the current operating system kernel running on the target:

[knl-sb2 ~]$ uname –a
Linux knl-sb2 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux

On the host machine, install the latest OoF software here to enable OoF. In this tutorial, the OoF software version 1.4.0 for Red Hat Enterprise Linux 7.2 (xppsl-1.4.0-offload-host-rhel7.2.tar) was installed. Refer to the document “Intel® Xeon Phi™ Processor x200 Offload over Fabric User’s Guide” for details on the installation. In addition, the Intel® Parallel Studio XE 2017 is installed on the host to enable the OoF support, specifically support of offload programming models provided by the Intel compiler.

On the target machine, install the latest Intel Xeon Phi processor software here. In this tutorial, the Intel Xeon Phi processor software version 1.4.0 for Red Hat Enterprise Linux 7.2 (xppsl-1.4.0-rhel7.2.tar) was installed. Refer to the document “Intel® Xeon Phi™ Processor Software User’s Guide” for details on the installation.

On both host and target machines, the Mellanox OpenFabrics Enterprise Distribution (OFED) for Linux driver MLNX_OFED_LINUX 3.2-2 for Red Hat Enterprise Linux 7.2 is installed to set up the InfiniBand network between the host and target machines. This driver can be download from www.mellanox.com (navigate to Products > Software > InfiniBand/VPI Drivers, and download Mellanox OFED Linux).

Basic Hardware Testing

After you have installed the Mellanox driver on both the host and target machines, test the network cards to insure the Mellanox InfiniBand HCAs are working properly. To do this, bring the InfiniBand network up, and then test the network link using the ibping command.

First start InfiniBand and the subnet manager on the host, and then display the link information:

[knl-sb2 ~]$ sudo service openibd start
Loading HCA driver and Access Layer:                       [  OK  ]

[knl-sb2 ~]$ sudo service opensm start
Redirecting to /bin/systemctl start  opensm.service

[knl-sb2 ~]$ iblinkinfo
CA: host-device HCA-1:
      0x7cfe900300a13b41      1    1[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>       2    1[  ] "knl-sb2 HCA-1" ( )
CA: knl-sb2 HCA-1:
      0xf4521403007d2b91      2    1[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>       1    1[  ] "host-device HCA-1" ( )

Similarly, start InfiniBand and the subnet manager on the target, and then display the link information of each port in the InfiniBand network:

[knl-sb2 ~]$ sudo service openibd start
Loading HCA driver and Access Layer:                       [  OK  ]

[knl-sb2 ~]$ sudo service opensm start
Redirecting to /bin/systemctl start  opensm.service

[knl-sb2 ~]$ iblinkinfo
CA: host-device HCA-1:
      0x7cfe900300a13b41      1    1[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>       2    1[  ] "knl-sb2 HCA-1" ( )
CA: knl-sb2 HCA-1:
      0xf4521403007d2b91      2    1[  ] ==( 4X       14.0625 Gbps Active/  LinkUp)==>       1    1[  ] "host-device HCA-1" ( )

iblinkinfo reports the link information for all the ports in the fabric, one at the target machine and one at the host machine. Next, use the ibping command to test the link (it is equivalent to the ping command for Ethernet). Start the ibping server on the host machine using:

[host-device ~]$ ibping –S

From the target machine, ping the port identification of the host:

[knl-sb2 ~]$ ibping -G 0x7cfe900300a13b41
Pong from host-device.(none) (Lid 1): time 0.259 ms
Pong from host-device.(none) (Lid 1): time 0.444 ms
Pong from host-device.(none) (Lid 1): time 0.494 ms

Similarly, start the ibping server on the target machine:

[knl-sb2 ~]$ ibping -S

This time, ping the port identification of the target from the host:

[host-device ~]$ ibping -G 0xf4521403007d2b91
Pong from knl-sb2.jf.intel.com.(none) (Lid 2): time 0.469 ms
Pong from knl-sb2.jf.intel.com.(none) (Lid 2): time 0.585 ms
Pong from knl-sb2.jf.intel.com.(none) (Lid 2): time 0.572 ms

IP over InfiniBand (IPoIB) Configuration

So far we have verified that the InfiniBand network is functional. Next, to use OoFabric, we must configure IP over InfiniBand (IPoIB). This configuration provides the target IP address that is used to offload computations over fabric.

First verify that the ib_ipoib driver is installed:

[host-device ~]$ lsmod | grep ib_ipoib
ib_ipoib              136906  0
ib_cm                  47035  3 rdma_cm,ib_ucm,ib_ipoib
ib_sa                  33950  5 rdma_cm,ib_cm,mlx4_ib,rdma_ucm,ib_ipoib
ib_core               141088  12 rdma_cm,ib_cm,ib_sa,iw_cm,mlx4_ib,mlx5_ib,ib_mad,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
mlx_compat             16639  17 rdma_cm,ib_cm,ib_sa,iw_cm,mlx4_en,mlx4_ib,mlx5_ib,ib_mad,ib_ucm,ib_addr,ib_core,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm ib_ipoib

If the ib_ipoib driver is not listed, you need to add the module to the Linux kernel using the following command:

[host-device ~]$ modprobe ib_ipoib

Next list the InfiniBand interface ib0 on the host using the ifconfig command:

[host-device ~]$ ifconfig ib0
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband A0:00:02:20:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 1024  (InfiniBand)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Configure 10.0.0.1 as the IP address on this interface:

[host-device ~]$ sudo ifconfig ib0 10.0.0.1/24
[host-device ~]$ ifconfig ib0
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
        inet 10.0.0.1  netmask 255.255.255.0  broadcast 10.0.0.255
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband A0:00:02:20:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 1024  (InfiniBand)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 10  bytes 2238 (2.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Similarly on the target, configure 10.0.0.2 as the IP address on this InfiniBand interface:

[knl-sb2 ~]$ ifconfig ib0
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband A0:00:02:20:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 1024  (InfiniBand)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[knl-sb2 ~]$ sudo ifconfig ib0 10.0.0.2/24
[knl-sb2 ~]$ ifconfig ib0
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
        inet 10.0.0.2  netmask 255.255.255.0  broadcast 10.0.0.255
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband A0:00:02:20:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 1024  (InfiniBand)
        RX packets 3  bytes 168 (168.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 10  bytes 1985 (1.9 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Finally, verify the new IP address 10.0.0.2 of the target using the ping command on the host to test the connectivity:

[host-device ~]$ ping 10.0.0.2
PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.
64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=0.443 ms
64 bytes from 10.0.0.2: icmp_seq=2 ttl=64 time=0.410 ms<CTRL-C>

Similarly, from the target, verify the new IP address 10.0.0.1 of the host:

[knl-sb2 ~]$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=0.313 ms
64 bytes from 10.0.0.1: icmp_seq=2 ttl=64 time=0.359 ms
64 bytes from 10.0.0.1: icmp_seq=3 ttl=64 time=0.375 ms<CTRL-C>

SSH Password-Less Setting (Optional)

When offloading a workload to the target machine, Secure Shell (SSH) requires the target’s password to log on to target and execute the workload. To enable this transaction without manual intervention, you must enable the ssh login without a password. To do this, first generate a pair of authentication keys on the host without entering a passphrase:

[host-device ~]$ ssh-keygen -t rsa

Then append the host’s new public key to the target’s public key using the command ssh-copy-id:

[host-device ~]$ ssh-copy-id @10.0.0.2

Offload over Fabric

At this point, the high-speed network is enabled and functional. To enable OoF functionality, you need to Install Intel® Parallel Studio XE 2017 for Linux on the host. For the purpose of this paper we installed the Intel Parallel Studio XE 2017 Beta Update 1 for Linux. Next set up your shell environment using:

[host-device]$ source /opt/intel/parallel_studio_xe_2017.0.024/psxevars.sh intel64
Intel(R) Parallel Studio XE 2017 Beta Update 1 for Linux*
Copyright (C) 2009-2016 Intel Corporation. All rights reserved.

Below is the sample program used to test the OoF functionality. This sample program allocates and initiates a constant A and buffers x, y, z in the host, and then offloads the computation to the target using OpenMP device constructs directives (#pragma omp target map…). The target directive creates a device data environment (on the target). At runtime values for the variables x,y and A are copied to the target before beginning the computation, and values of variable y are copied back (to the host) when the target completes the computation. In this example, the target parses CPU information from the lscpu command, and spawns a team of OpenMP threads to compute a vector scalar product and add the result to a vector.

#include <stdio.h>

int main(int argc, char* argv[])
{
    int i, num = 1024;;
    float A = 2.0f;

    float *x = (float*) malloc(num*sizeof(float));
    float *y = (float*) malloc(num*sizeof(float));
    float *z = (float*) malloc(num*sizeof(float));

    for (i=0; i<num; i++)
    {
       x[i] = i;
       y[i] = 1.5f;
       z[i] = A*x[i] + y[i];
    }

    printf("Workload is executed in a system with CPU information:\n");
    #pragma omp target map(to: x[0:num], A) \
                       map(tofrom: y[0:num])
    {
        char command[64];
        strcpy(command, "lscpu | grep Model");
        system(command);
        int done = 0;

        #pragma omp parallel for
        for (i=0; i<num; i++)
        {
            y[i] = A*x[i] + y[i];

            if ((omp_get_thread_num() == 0) && (done == 0))
            {
               int numthread = omp_get_num_threads();
               printf("Total number of threads: %d\n", numthread);
               done = 1;
            }
        }
    }

    int passed = 0;

    for (i=0; i<num; i++)
        if (z[i] == y[i]) passed = 1;

    if (passed == 1)
        printf("PASSED!\n");
    else
        printf("FAILED!\n");

    free(x);
    free(y);
    free(z);

    return 0;
}

Compile this OpenMP program with the Intel compiler option -qoffload-arch=mic-avx512 to indicate the offload portion is built for the 2nd generation Intel Xeon Phi processor. Prior to executing the program, set the environment variable OFFLOAD_NODES to the IP address of the target machine, in this case 10.0.0.2, to indicate that the high-speed network is to be used.

[host-device]$ icc -qopenmp -qoffload-arch=mic-avx512 -o OoF-OpenMP-Affinity OoF-OpenMP-Affinity.c

[host-device]$ export OFFLOAD_NODES=10.0.0.2

[host-device]$ ./OoF-OpenMP-Affinity
Workload is executed in a system with CPU information:
Model:                 87
Model name:            Intel(R) Xeon Phi(TM) CPU 7250 @000000 1.40GHz
PASSED!
Total number of threads: 268

Note that the offload processing is internally handled by the Intel® Coprocessor Offload Infrastructure (Intel® COI). By default, the offload code runs with all OpenMP threads available in the target. The target has 68 cores, and the Intel COI daemon running on one core of the target leaves the remaining 67 cores available; the total number of threads is 268 (4 threads/core). You can use the coitrace command to trace all Intel COI API invocations:

[host-device]$ coitrace ./OoF-OpenMP-Affinity
COIEngineGetCount [ThID:0x7f02fdd04780]
        in_DeviceType = COI_DEVICE_MIC
        out_pNumEngines = 0x7fffc8833e00 0x00000001 (hex) : 1 (dec)

COIEngineGetHandle [ThID:0x7f02fdd04780]
        in_DeviceType = COI_DEVICE_MIC
        in_EngineIndex = 0x00000000 (hex) : 0 (dec)
        out_pEngineHandle = 0x7fffc8833de8 0x7f02f9bc4320

Workload is executed in a system with CPU information:
COIEngineGetHandle [ThID:0x7f02fdd04780]
        in_DeviceType = COI_DEVICE_MIC
        in_EngineIndex = 0x00000000 (hex) : 0 (dec)
        out_pEngineHandle = 0x7fffc88328e8 0x7f02f9bc4320

COIEngineGetInfo [ThID:0x7f02fdd04780]
        in_EngineHandle = 0x7f02f9bc4320
        in_EngineInfoSize = 0x00001440 (hex) : 5184 (dec)
        out_pEngineInfo = 0x7fffc8831410
                DriverVersion:
                DeviceType: COI_DEVICE_KNL
                NumCores: 68
                NumThreads: 272

<truncate here>

OpenMP* Thread Affinity

The result from the above program shows the default number of threads (272) that run on the target; however, you can set the number of threads that run on the target explicitly. One method uses environment variables on the host to modify the target’s execution environment. First, define a target-specific environment variable prefix, and then add this prefix to the OpenMP thread affinity environment variables. For example, the following environment variable settings configure the offload runtime to use 8 threads on the target:

[host-device]$ $ export MIC_ENV_PREFIX=PHI
[host-device]$ $ export PHI_OMP_NUM_THREADS=8

The Intel OpenMP runtime extensions KMP_PLACE_THREAD and KMP_AFFINITY environment variables can be used to bind threads to physical processing units (that is, cores) (refer to the section Thread Affinity Interface in the Intel® C++ Compiler User and Reference Guide for more information). For example, the following environment variable settings configure the offload runtime to use 8 threads close to each other:

[host-device]$ $ export PHI_KMP_AFFINITY=verbose,granularity=thread,compact
[host-device]$ $ ./OoF-OpenMP-Affinity

You can also use OpenMP affinity by using the OMP_PROC_BIND environment variable. For example, to duplicate the previous example to run 8 threads close to each other using OMP_PROC_BIND use the following:

[host-device]$ $ export MIC_ENV_PREFIX=PHI
[host-device]$ $ export PHI_KMP_AFFINITY=verbose
[host-device]$ $ export PHI_OMP_PROC_BIND=close
[host-device]$ $ export PHI_OMP_NUM_THREADS=8
[host-device]$ $ ./OoF-OpenMP-Affinity

Or run with 8 threads and spread them out using:

[host-device]$ $ export PHI_OMP_PROC_BIND=spread
[host-device]$ $ ./OoF-OpenMP-Affinity

The result is shown in the following table:

OpenMP* thread numberCore number
00
110
219
331
439
548
656
765

To run 8 threads, 2 threads/core (4 cores total) use:

[host-device]$ export PHI_OMP_PROC_BIND=close;
[host-device]$ export PHI_OMP_PLACES="cores(4)"
[host-device]$ export PHI_OMP_NUM_THREADS=8
[host-device]$ $ ./OoF-OpenMP-Affinity

The result is shown in the following table:

OpenMP* thread numberCore number
00
10
21
31
42
52
63
73

Summary

This tutorial shows details on how to set up and run an OoF application. Hardware and software installations were presented. Mellanox InfiniBand Host Channel Adapters were used in this example, but Intel OPA can be used instead. The sample code was an OpenMP offload programming model application that demonstrates running on an Intel Xeon processor host and offloading the computation to an Intel Xeon Phi processor target using a high-speed network. This tutorial also showed how to compile and run the offload program for the Intel Xeon Phi processor and control the OpenMP Thread Affinity on the Intel Xeon Phi processor.

References

About the Author

Loc Q Nguyen received an MBA from University of Dallas, a master’s degree in Electrical Engineering from McGill University, and a bachelor's degree in Electrical Engineering from École Polytechnique de Montréal. He is currently a software engineer with Intel Corporation's Software and Services Group. His areas of interest include computer networking, parallel computing, and computer graphics.

Intel® Software Guard Extensions Tutorial Series: Part 4, Enclave Design

$
0
0

In Part 4 of the Intel® Software Guard Extensions (Intel® SGX) tutorial series we’ll be designing our enclave and its interface. We’ll take a look at the enclave boundary that was defined in Part 3 and identify the necessary bridge functions, examine the impact the bridge functions have on the object model, and create the project infrastructure necessary to integrate the enclave into our application. We’ll only be stubbing the enclave ECALLS at this point; full enclave integration will come in Part 5 of the series.

You can find the list of all of the published tutorials in the article Introducing the Intel® Software Guard Extensions Tutorial Series.

There is source code provided with this installment of the series: the enclave stub and interface functions are provided for you to download.

Application Architecture

Before we jump into designing the enclave interface, we need to take a moment and consider the overall application architecture. As discussed in Part 1, enclaves are implemented as dynamically loaded libraries (DLLs under Windows* and shared libraries under Linux*) and they can only link against 100-percent native C code.

The Tutorial Password Manager, however, will have a GUI written in C#. It uses a mixed-mode assembly written in C++/CLI to get us from managed to unmanaged code, but while that assembly contains native code it is not a 100-percent native module and it cannot interface directly with an Intel SGX enclave. Attempts to incorporate the untrusted enclave bridge functions in C++/CLI assemblies will result in a fatal error:

	Command line error D8045: cannot compile C file 'Enclave_u.c' with the /clr option

That means we need to place the untrusted bridge functions in a separate DLL that is all native code. As a result, our application will need to have, at minimum, three DLLs: the C++/CLI core, the enclave bridge, and the enclave itself. This structure is shown in Figure 1.


Figure 1. Component makeup for a mixed-mode application with enclaves.

Further Refinements

Since the enclave bridge functions must reside in a separate DLL, we’ll go a step further and place all the functions that deal directly with the enclave in that same DLL. This compartmentalization of the application layers will not only make it easier to manage (and debug) the program, but also to ease integration by lessening the impact to the other modules. When a class or module has a specific task with a clearly defined boundary, changes to other modules are less likely to impact it.

In this case, the PasswordManagerCoreNative class should not be burdened with the additional task of instantiating enclaves. It just needs to know whether or not Intel SGX is supported on the platform so that it can execute the appropriate function.

As an example, the following code block shows the unlock() method:

int PasswordManagerCoreNative::vault_unlock(const LPWSTR wpassphrase)
{
	int rv;
	UINT16 size;

	char *mbpassphrase = tombs(wpassphrase, -1, &size);
	if (mbpassphrase == NULL) return NL_STATUS_ALLOC;

	rv= vault.unlock(mbpassphrase);

	SecureZeroMemory(mbpassphrase, size);
	delete[] mbpassphrase;

	return rv;
}

This is a pretty simple method that takes the user’s passphrase as a wchar_t, converts it to a variable-length encoding (UTF-8), and then calls the unlock() method in the vault object. Rather than clutter up this class, and this method, with enclave-handling functions and logic, it would be best to add enclave support to this method through a one-line addition:

int PasswordManagerCoreNative::vault_unlock(const LPWSTR wpassphrase)
{
	int rv;
	UINT16 size;

	char *mbpassphrase = tombs(wpassphrase, -1, &size);
	if (mbpassphrase == NULL) return NL_STATUS_ALLOC;

	// Call the enclave bridge function if we support Intel SGX
	if (supports_sgx()) rv = ew_unlock(mbpassphrase);
	else rv= vault.unlock(mbpassphrase);

	SecureZeroMemory(mbpassphrase, size);
	delete[] mbpassphrase;

	return rv;
}

Our goal will be to put as little enclave awareness into this class as is feasible. The only other additions the PasswordManagerCoreNative class needs is a flag for Intel SGX support and methods to both set and get it.

class PASSWORDMANAGERCORE_API PasswordManagerCoreNative
{
	int _supports_sgx;

	// Other class members ommitted for clarity

protected:
	void set_sgx_support(void) { _supports_sgx = 1; }
	int supports_sgx(void) { return _supports_sgx; }

Designing the Enclave

Now that we have an overall application plan in place, it’s time to start designing the enclave and its interface. To do that, we return to the class diagram for the application core in Figure 2, which was first introduced in Part 3. The objects that will reside in the enclave are shaded in green while the untrusted components are shaded in blue.


Figure 2. Class diagram for the Tutorial Password Manager with Intel® Software Guard Extensions.

The enclave boundary only crosses one connection: the link between the PasswordManagerCoreNative object and the Vault object. That suggests that the majority of our ECALLs will simply be wrappers around the class methods in Vault. We’ll also need to add some additional ECALLs to manage the enclave infrastructure. One of the complications of enclave development is that the ECALLs, OCALLs, and bridge functions must be native C code, and we are making extensive use of C++ features. Once the enclave has been launched, we’ll also need functions that span the gap between C and C++ (objects, constructors, overloads, and others).

The wrapper and bridge functions will go in their own DLL, which we’ll name EnclaveBridge.dll. For clarity, we’ll prefix the wrapper functions with ew_ (for “enclave wrapper”), and the bridge functions that make the ECALLs with ve_ (for “vault enclave”).

Calls from PasswordManagerCoreNative to the corresponding method in Vault will follow the basic flow shown in Figure 3.


Figure 3. Execution flow for bridge functions and ECALLs.

The method in PasswordManagerCoreNative will call into the wrapper function in EnclaveBridge.dll. That wrapper will, in turn, invoke one or more ECALLs, which enter the enclave and invoke the corresponding class method in the Vault object. Once all ECALLs have completed, the wrapper function returns back to the calling method in PasswordManagerCoreNative and provides it with a return value.

Enclave Logistics

The first step in designing the enclave is working out a system for managing the enclave itself. The enclave must be launched and the resulting enclave ID must be provided to the ECALLs. Ideally, this should be transparent to the upper layers of the application.

The easiest solution for the Tutorial Password Manager is to use global variables in the EnclaveBridge DLL to hold the enclave information. This design decision comes with a restriction: only one thread can be active in the enclave at a time. This is a reasonable solution because the password manager application would not benefit from having multiple threads operating on the vault. Most of its actions are driven by the user interface and do not consume a significant amount of CPU time.

To solve the transparency problem, each wrapper function will first call a function to check to see if the enclave has been launched, and launch it if it hasn’t. This logic is fairly simple:

#define ENCLAVE_FILE _T("Enclave.signed.dll")

static sgx_enclave_id_t enclaveId = 0;
static sgx_launch_token_t launch_token = { 0 };
static int updated= 0;
static int launched = 0;
static sgx_status_t sgx_status= SGX_SUCCESS;

// Ensure the enclave has been created/launched.

static int get_enclave(sgx_enclave_id_t *eid)
{
	if (launched) return 1;
	else return create_enclave(eid);
}

static int create_enclave(sgx_enclave_id_t *eid)
{
	sgx_status = sgx_create_enclave(ENCLAVE_FILE, SGX_DEBUG_FLAG, &launch_token, &updated, &enclaveId, NULL);
	if (sgx_status == SGX_SUCCESS) {
		if ( eid != NULL ) *eid = enclaveId;
		launched = 1;
		return 1;
	}

	return 0;
}

Each wrapper function will start by calling get_enclave(), which checks to see if the enclave has been launched by examining a static variable. If it has, then it (optionally) populates the eid pointer with the enclave ID. This step is optional because the enclave ID is also stored as a global variable, enclaveID, which can of course just be used directly.

What happens if an enclave is lost due to a power event or a bug that causes it to crash? For that, we check the return value of the ECALL: it indicates the success or failure of the ECALL operation itself, not of the function being called in the enclave.

sgx_status = ve_initialize(enclaveId, &vault_rv);

The return value of the function being called in the enclave, if any, is transferred via the pointer which is provided as the second argument to the ECALL (these function prototypes are generated for you automatically by the Edger8r tool). You must always check the return value of the ECALL itself. Any result other than SGX_SUCCESS indicates that the program did not successfully enter the enclave and the requested function did not run. (Note that we’ve defined sgx_status as a global variable as well. This is another simplification stemming from our single-threaded design.)

We’ll add a function that examines the error returned by the ECALL and checks for a lost or crashed enclave:

static int lost_enclave()
{
	if (sgx_status == SGX_ERROR_ENCLAVE_LOST || sgx_status == SGX_ERROR_ENCLAVE_CRASHED) {
		launched = 0;
		return 1;
	}

	return 0;
}

These are recoverable errors. The upper layers don’t currently have logic to deal with these specific conditions, but we provide it in the EnclaveBridge DLL in order to support future enhancements.

Also notice that there is no function provided to destroy the enclave. As long as the user has the password manager application open, the enclave is in place even if they choose to lock their vault. This is not good enclave etiquette. Enclaves draw from a finite pool of resources, even when idle. We’ll address this problem in a future segment of the series when we talk about data sealing.

The Enclave Definition Language

Before moving on to the actual enclave design, we’ll take a few moments to discuss the Enclave Definition Language (EDL) syntax. An enclave’s bridge functions, both its ECALLs and OCALLs, are prototyped in its EDL file and its general structure is as follows:

enclave {
	// Include files

	// Import other edl files

	// Data structure declarations to be used as parameters of the function prototypes in edl

	trusted {
	// Include file if any. It will be inserted in the trusted header file (enclave_t.h)

	// Trusted function prototypes (ECALLs)

	};

	untrusted {
	// Include file if any. It will be inserted in the untrusted header file (enclave_u.h)

	// Untrusted function prototypes (OCALLs)

	};
};

ECALLs are prototyped in the trusted section, and OCALLs are prototyped in the untrusted section.

The EDL syntax is C-like and function prototypes very closely resemble C function prototypes, but it’s not identical. In particular, bridge function parameters and return values are limited to some fundamental data types and the EDL includes some additional keywords and syntax that defines some enclave behavior. The Intel® Software Guard Extensions (Intel® SGX) SDK User’s Guide explains the EDL syntax in great detail and includes a tutorial for creating a sample enclave. Rather than repeat all of that here, we’ll just discuss those elements of the language that are specific to our application.

When parameters are passed to enclave functions, they are marshaled into the protected memory space of the enclave. For parameters passed as values, no special action is required as the values are placed on the protected stack in the enclave just as they would be for any other function call. The situation is quite different for pointers, however.

For parameters passed as pointers, the data referenced by the pointer must be marshaled into and out of the enclave. The edge routines that perform this data marshalling need to know two things:

  1. Which direction should the data be copied: into the bridge function, out of the bridge function, or both directions?
  2. What is the size of the data buffer referenced by the pointer?

Pointer Direction

When providing a pointer parameter to a function, you must specify the direction by the keywords in brackets: [in], [out], or [in, out], respectively. Their meaning is given in Table 1.

Direction

ECALL

OCALL

in

The buffer is copied from the application into the enclave. Changes will only affect the buffer inside the enclave.

The buffer is copied from the enclave to the application. Changes will only affect the buffer outside the enclave.

out

A buffer will be allocated inside the enclave and initialized with zeros. It will be copied to the original buffer when the ECALL exits.

A buffer will be allocated outside the enclave and initialized with zeros. This untrusted buffer will be copied to the original buffer in the enclave when the OCALL exits.

in, out

Data is copied back and forth.

Same as ECALLs.

Table 1. Pointer direction parameters and their meanings in ECALLs and OCALLs.

Note from the table that the direction is relative to the bridge function being called. For an ECALL, [in] means “copy the buffer to the enclave,” but for an OCALL it’s “copy the buffer to the untrusted function.”

(There is also the option called user_check that can be used in place of these, but it’s not relevant to our discussion. See the SDK documentation for information on its use and purpose.)

Buffer Size

The edge routines calculate the total buffer size, in bytes, as:

bytes = element_size * element_count

By default, the edge routines assume element_count = 1, and calculate element_size from the element referenced by the pointer parameter, e.g., for an integer pointer it assumes element_size is:

sizeof(int)

For a single element of a fixed data type, such as an int or a float, no additional information needs to be provided in the EDL prototype for the function. For a void pointer, you must specify an element size or you’ll get an error at compile time. For arrays, char and wchar_t strings, and other types where the length of the data buffer is more than one element you must specify the number of elements in the buffer or only one element will be copied.

Add either the count or size parameter (or both) to the bracketed keywords for the pointer as appropriate. They can be set to a constant value or one of the parameters to the function. For most cases, count and size are functionally the same, but it’s good practice to use them in their correct contexts. Strictly speaking, you would only specify size when passing a void pointer. Everything else would use count.

If you are passing a C string or wstring (a NULL-terminated char or wchar_t array), then you can use the string or wstring parameter in place of count or size. In this case, the edge routines will determine the size of the buffer by getting the length of the string directly.

function([in, size=12] void *param);
function([in, count=len] char *buffer, uint32_t len);
function([in, string] char *cstr);

Note that you can only use string or wstring if the direction is set to [in] or [in, out]. When the direction is set only to [out], the string has not yet been created so the edge routine can’t know the size of the buffer. Specifying [out, string] will generate an error at compile time.

Wrapper and Bridge Functions

We are now ready to define our wrapper and bridge functions. As we pointed out above, the majority of our ECALLs will be wrappers around the class methods in Vault. The class definition for the public member functions is shown below:

class PASSWORDMANAGERCORE_API Vault
{
	// Non-public methods and members ommitted for brevity

public:
	Vault();
	~Vault();

	int initialize();
	int initialize(const char *header, UINT16 size);
	int load_vault(const char *edata);

	int get_header(unsigned char *header, UINT16 *size);
	int get_vault(unsigned char *edate, UINT32 *size);

	UINT32 get_db_size();

	void lock();
	int unlock(const char *password);

	int set_master_password(const char *password);
	int change_master_password(const char *oldpass, const char *newpass);

	int accounts_get_count(UINT32 *count);
	int accounts_get_info(UINT32 idx, char *mbname, UINT16 *mbname_len, char *mblogin, UINT16 *mblogin_len, char *mburl, UINT16 *mburl_len);

	int accounts_get_password(UINT32 idx, char **mbpass, UINT16 *mbpass_len);

	int accounts_set_info(UINT32 idx, const char *mbname, UINT16 mbname_len, const char *mblogin, UINT16 mblogin_len, const char *mburl, UINT16 mburl_len);
	int accounts_set_password(UINT32 idx, const char *mbpass, UINT16 mbpass_len);

	int accounts_generate_password(UINT16 length, UINT16 pwflags, char *cpass);

	int is_valid() { return _VST_IS_VALID(state); }
	int is_locked() { return ((state&_VST_LOCKED) == _VST_LOCKED) ? 1 : 0; }
};

There are several problem functions in this class. Some of them are immediately obvious, such as the constructor, destructor, and the overloads for initialize(). These are C++ features that we must invoke using C functions. Some of the problems, though, are not immediately obvious because they stem from the function’s inherent design. (Some of these problem methods were poorly designed on purpose so that we could cover specific issues in this tutorial, but some were just poorly designed, period!) We’ll tackle each problem, one by one, presenting both the prototypes for the wrapper functions and the EDL prototypes for the proxy/bridge routines.

The Constructor and Destructor

In the non-Intel SGX code path, the Vault class is a member of PasswordManagerCoreNative. We can’t do this for the Intel SGX code path; however, the enclave can include C++ code so long as the bridge functions themselves are pure C functions.

Since we have already limited the enclave to a single thread, we can make the Vault class a static, global object in the enclave. This greatly simplifies our code and eliminates the need for creating bridge functions and logic to instantiate it.

The Overload on initialize()

There are two prototypes for the initialize() method:

  1. The method with no arguments initializes the Vault object for a new password vault with no contents. This is a password vault that the user is creating for the first time.
  2. The method with two arguments initializes the Vault object from the header of the vault file. This represents an existing password vault that the user is opening (and, later on, attempting to unlock).

This will be broken up into two wrapper functions:

ENCLAVEBRIDGE_API int ew_initialize();
ENCLAVEBRIDGE_API int ew_initialize_from_header(const char *header, uint16_t hsize);

And the corresponding ECALLs will be defined as:

public int ve_initialize ();
public int ve_initialize_from_header ([in, count=len] unsigned char *header, uint16_t len);

get_header()

This method has a fundamental design issue. Here’s the prototype:

int get_header(unsigned char *header, uint16_t *size);

This function accomplishes one of two tasks:

  1. It gets the header block for the vault file and places it in the buffer pointed to by header. The caller must allocate enough memory to store this data.
  2. If you pass a NULL pointer in the header parameter, the uint16_t pointed to by size is set to the size of the header block, so that the caller knows how much memory to allocate.

This is a fairly common compaction technique in some programming circles, but it presents a problem for enclaves: when you pass a pointer to an ECALL or an OCALL, the edge functions copy the data referenced by the pointer into or out of the enclave (or both). Those edge functions need to know the size of the data buffer so they know how many bytes to copy. The first usage involves a valid pointer with a variable size which is not a problem, but the second usage has a NULL pointer and a size of zero.

We could probably come up with an EDL prototype for the ECALL that could make this work, but clarity should generally trump brevity. It’s better to split this into two ECALLs:

public int ve_get_header_size ([out] uint16_t *sz);
public int ve_get_header ([out, count=len] unsigned char *header, uint16_t len);

The enclave wrapper function will take care of the necessary logic so that we don’t have to make changes to other classes:

ENCLAVEBRIDGE_API int ew_get_header(unsigned char *header, uint16_t *size)
{
	int vault_rv;

	if (!get_enclave(NULL)) return NL_STATUS_SGXERROR;

	if ( header == NULL ) sgx_status = ve_get_header_size(enclaveId, &vault_rv, size);
	else sgx_status = ve_get_header(enclaveId, &vault_rv, header, *size);

	RETURN_SGXERROR_OR(vault_rv);
}

accounts_get_info()

This method operates similarly to get_header(): pass a NULL pointer and it returns the size of the object in the corresponding parameter. However, it is uglier and sloppier because of the multiple parameter arguments. It is better off being broken up into two wrapper functions:

ENCLAVEBRIDGE_API int ew_accounts_get_info_sizes(uint32_t idx, uint16_t *mbname_sz, uint16_t *mblogin_sz, uint16_t *mburl_sz);
ENCLAVEBRIDGE_API int ew_accounts_get_info(uint32_t idx, char *mbname, uint16_t mbname_sz, char *mblogin, uint16_t mblogin_sz, char *mburl, uint16_t mburl_sz);

And two corresponding ECALLs:

public int ve_accounts_get_info_sizes (uint32_t idx, [out] uint16_t *mbname_sz, [out] uint16_t *mblogin_sz, [out] uint16_t *mburl_sz);
public int ve_accounts_get_info (uint32_t idx,
	[out, count=mbname_sz] char *mbname, uint16_t mbname_sz,
	[out, count=mblogin_sz] char *mblogin, uint16_t mblogin_sz,
	[out, count=mburl_sz] char *mburl, uint16_t mburl_sz
);

accounts_get_password()

This is the worst offender of the lot. Here’s the prototype:

int accounts_get_password(UINT32 idx, char **mbpass, UINT16 *mbpass_len);

The first thing you’ll notice is that it passes a pointer to a pointer in mbpass. This method is allocating memory.

In general, this is not a good design. No other method in the Vault class allocates memory so it is internally inconsistent, and the API violates convention by not providing a method to free this memory on the caller’s behalf. It also poses a unique problem for enclaves: an enclave cannot allocate memory in untrusted space.

This could be handled in the wrapper function. It could allocate the memory and then make the ECALL and it would all be transparent to the caller, but we have to modify the method in the Vault class, regardless, so we should just fix this the correct way and make the corresponding changes to PasswordManagerCoreNative. The caller should be given two functions: one to get the password length and one to fetch the password, just as with the previous two examples. PasswordManagerCoreNative should be responsible for allocating the memory, not any of these functions (the non-Intel SGX code path should be changed, too).

ENCLAVEBRIDGE_API int ew_accounts_get_password_size(uint32_t idx, uint16_t *len);
ENCLAVEBRIDGE_API int ew_accounts_get_password(uint32_t idx, char *mbpass, uint16_t len);

The EDL definition should look familiar by now:

public int ve_accounts_get_password_size (uint32_t idx, [out] uint16_t *mbpass_sz);
public int ve_accounts_get_password (uint32_t idx, [out, count=mbpass_sz] char *mbpass, uint16_t mbpass_sz);

load_vault()

The problem with load_vault() is subtle. The prototype is fairly simple, and at first glance it may look completely innocuous:

int load_vault(const char *edata);

What this method does is load the encrypted, serialized password database into the Vault object. Because the Vault object has already read the header, it knows how large the incoming buffer will be.

The issue here is that the enclave’s edge functions don’t have this information. A length has to be explicitly given to the ECALL so that the edge function knows how many bytes to copy from the incoming buffer into the enclave’s internal buffer, but the size is stored inside the enclave. It’s not available to the edge function.

The wrapper function’s prototype can mirror the class method’s prototype, as follows:

ENCLAVEBRIDGE_API int ew_load_vault(const unsigned char *edata);

The ECALL, however, needs to pass the header size as a parameter so that it can be used to define the size of the incoming data buffer in the EDL file:

public int ve_load_vault ([in, count=len] unsigned char *edata, uint32_t len)

To keep this transparent to the caller, the wrapper function will be given extra logic. It will be responsible for fetching the vault size from the enclave and then passing it through as a parameter to this ECALL.

ENCLAVEBRIDGE_API int ew_load_vault(const unsigned char *edata)
{
	int vault_rv;
	uint32_t dbsize;

	if (!get_enclave(NULL)) return NL_STATUS_SGXERROR;

	// We need to get the size of the password database before entering the enclave
	// to send the encrypted blob.

	sgx_status = ve_get_db_size(enclaveId, &dbsize);
	if (sgx_status == SGX_SUCCESS) {
		// Now we can send the encrypted vault data across.

		sgx_status = ve_load_vault(enclaveId, &vault_rv, (unsigned char *) edata, dbsize);
	}

	RETURN_SGXERROR_OR(vault_rv);
}

A Few Words on Unicode

In Part 3, we mentioned that the PasswordManagerCoreNative class is also tasked with converting between wchar_t and char strings. Given that enclaves support the wchar_t data type, why do this at all?

This is a design decision intended to minimize our footprint. In Windows, the wchar_t data type is the native encoding for Win32 APIs and it stores UTF-16 encoded characters. In UTF-16, each character is 16 bits in order to support non-ASCII characters, particularly for languages that aren’t based on the Latin alphabet or have a large number of characters. The problem with UTF-16 is that a character is always 16-bits long, even when encoding plain ASCII text.

Rather than store twice as much data both on disk and inside the enclave for the common case where the user’s account information is in plain ASCII and incur the performance penalty of having to copy and encrypt those extra bytes, the Tutorial Password Manager converts all of the strings coming from .NET to the UTF-8 encoding. UTF-8 is a variable-length encoding, where each character is represented by one to four 8-bit bytes. It is backwards-compatible with ASCII and it results in a much more compact encoding than UTF-16 for plain ASCII text. There are cases where UTF-8 will result in longer strings than UTF-16, but for our tutorial password manager we’ll accept that tradeoff.

A commercial application would choose the best encoding for the user’s native language, and then record that encoding in the vault (so that it would know which encoding was used to create it in case the vault is opened on a system using a different native language).

Sample Code

As mentioned in the introduction, there is sample code provided with this part for you to download. The attached archive includes the source code for the Tutorial Password Manager bridge DLL and the enclave DLL. The enclave functions are just stubs at this point, and they will be filled out in Part 5.

Coming Up Next

In Part 5 of the tutorial we’ll complete the enclave by porting the Crypto, DRNG, and Vault classes to the enclave, and connecting them to the ECALLs. Stay tuned!

[Guest Post] The Art of User Engagement

$
0
0

In this guest article, Intel Innovator, Peter O'Hanlon, explains the art of user engagement, from how to create it to how to measure it.


 

Introduction

In this article, we are going to look at User Engagement, or how our users react to the applications we provide and how engaging the user goes beyond the boundaries of our applications to their every interaction with you. Generally, when we see User Engagement talked about online, people are talking about it in the context of websites, but it is equally important for applications and mobile apps.

By the end of this article, we should reach a point where we can intelligently make decisions about how to measure how well we’re engaging with our users.
 

Definitions

Throughout this article, you’ll see references to application. We use this term to mean something the user uses, be it a website, an executable or a mobile app.
 

User Engagement Is not User Experience

Before we proceed any further in looking at what User Engagement (UE) means to us, it’s important to get a sense of how it fits into the whole idea of User Experience.

When we talk about User Experience (UX), we are talking about a superset of design and management features that indicate how useful an application is, what value it brings to us, and how easy it is to derive the value, as well as the aesthetics of the application, the workflow, and finally, the engagement of the application.

If UX is the superset, how do we define User Engagement (UE)? At a simplistic level, engagement refers to the “enjoyment” of the application experience, which should encourage appropriate repeated use. I put enjoyment in quotes there because this reflects the fact that UE is an emotional response. We want to engage our customers in a way that gives them a positive feeling, hence the enjoyment. We want our users to prefer to use us over the competition; so we want to give them a response that means that they associate good feelings with our application.

This leads us to an understanding that the “art” behind UE is measuring the likelihood that users will continue to come back at appropriate times as customers, and also to encourage others to become customers. Ultimately, this leads us to realize that UE ties in with the definition of engage that we see in sources such as the Cambridge dictionary:


 

It’s Bigger than the Application

It will probably not come as a surprise that UE doesn’t just stop at the application level. There are many factors that affect how our customers view our brand and these all affect how much they enjoy the experience.

While we may have a strong application, a user is unlikely to use it if there are pain points that get in the way of them using it. Unfortunately, these pain points tend to become accepted habits and when one application does it, others may follow them without thinking whether or not they bring any benefit.

There was a period of time where it was common for companies to require people to register to download pretty much anything. If we wanted a copy of the technical specifications, we had to sign up. Generally, this was required for marketing purposes; this provided the company with a readymade list of people who might be interested in our products and they could follow these potential leads later on.

The reality, of course, was that people made up email addresses because they didn’t want to have these follow up marketing requests. They might have been evaluating twenty or thirty products and they didn’t wish to be inundated with these follow up requests. The belief, from the company point of view, was that this process makes it easier to convert to sales.

What we saw was there was a barrier to engaging the user. By making it painful for the user to get at the information they wanted, a company was providing an experience that lessened the chance of signing someone up. There are ways, of course, to mitigate this negative experience; imagine how the user would feel if the act of signing up resulted in them being sent coupons (for instance) for unrelated products. While this might not translate into someone ultimately using your product, they are much more likely to have a positive impression of your application because you have shown that your processes are in place for the benefit of the customer.

Ultimately, we want to provide an experience and content that draws the user into our application. We want to minimize bounce where someone comes to our application, uses it once and moves on.

We touched on processes here because they can have such an impact on users. When something goes wrong and our users want our help, they are much more likely to have a positive experience if we have knowledgeable support staff available at that point. Queues or numerous phone menu options are likely to frustrate, and this leads to a lack of engagement.
 

Scenario

Imagine that you have a web-based email application and that you’ve spent a couple of hours typing a lengthy email message when your PC crashes. You’re going to be pretty unhappy that you’ve lost your work and that you are going to have to do this work all over again. Your experiences here are fairly negative at this point – the crash, while not the fault of the company that provides the email facility, means that your overall experience is unhappy.

Okay, you’ve rebooted your PC and gone back into your email editor, ready to start typing your email again. But wait, the email provider has thought things through and realized that crashes happen so the editor contains the email you were typing, minus a couple of edits because the crash happened between automatic saves. At this point, you are much more likely to have a favorable impression of the email provider. If none of the other email providers have this facility, you’re much more likely to come back to this one because you now experience a positive reaction.

Technically, this wasn’t too difficult for the email company to implement, but the impact a feature has on a user is quite major; a user is much more likely to assume that we have catered for other problem areas because we have shown here that we take care of the little details. More importantly, not only have we delighted our customer, we have potentially made someone an ambassador for our application as this makes them much more likely to recommend us to others.
 

The Importance of Recommendations

Recommendations play a big part in user engagement. An engaged user is much more likely to praise and recommend our application than a disinterested user. Recommendations can come from personal as well as professional sources, so we must treat every interaction with our users as a source for a positive experience.

How many times have we looked at a piece of software and then gone to see what others say about it? With the rise of video reviews on YouTube as well as numerous review sites, it’s all too easy for potential users to find out about us. We are far more likely to draw customers in if reviewers are engaged with us.

When we create an engaging experience, we are more likely to get users who will tell others about us. But that engagement doesn’t need to stop at the product level. Well defined UE strategies will often coincide with activities outside our sites. Engaging our customers might also mean providing engagement scenarios in social media outlets such as Facebook. Using social media is an effective way to get an understanding of what customers expect from our products and also providing feedback to them about upcoming changes, and gauging the impact before committing to actual development time.
 

Metrics

It’s appropriate, at this stage, to consider what decisions we need to make when we want to measure how our users are engaged with us. Is there a one size fits all solution to determining engagement, or is this something that we need a strategy to determine on our own?

It will probably not come as any surprise to realize that, just as our applications are our own, the criteria that we use to judge engagement are going to be our own as well. Determining engagement starts with defining what we consider useful metrics to track. For instance, Google Search can be considered to be engaging because it has a high rate of returning and new visitors, even though they don’t spend long there. If our application provides articles though, a high rate of returning and new visitors wouldn’t indicate an engaged user, if they left the application after 10 seconds – so we would be looking to consider the bounce rate here; how long we were retaining users. Again, the returning users might not be a useful metric on its own. We might decide that an engaged user is someone who has our application as part of their daily life, so “recency of use” is a relevant metric for us. The point is, there is no “one size fits all” set of metrics that we need to take into account. We need to choose our metrics based on what it is that our application does.

A simple example here might help. Suppose that we have a Christmas countdown tracker that starts on December 1st. Our metrics wouldn’t want to track the average number of users who use our application over the year. Instead, we would be much more likely to want to know the number of returning users from previous years, as well as the number of new users; we would also probably be interested in how often they returned to our tracker over the short period the application is relevant annually. While this might seem a trivial example, it does highlight that the choices of metric are important and that the purpose of the application is important.

Deciding on the metrics we want to track should take into account that user interactions may change over a period of time. The needs of a user may change over a period of time, so while they may engage in one part of our services for a certain period, over time this may shift so that they move away from using one application because it is too basic for them now, to using another of our applications because that offers a fuller feature set.

The way we track metrics largely depends on the type of application we are developing. If we are developing a website, then tools such as Google Analytics are an excellent resource. For mobile applications, we might want to use Facebook Analytics. Alternatively, we might want to roll out our own analytics services for our desktop applications. Whatever choice we make, we need to ensure that we have decided well beforehand what the appropriate metrics are for us.

For one application I launched, I decided that my user engagement criteria would be the average time a user spent in the application, against the number of active users. Beyond that, I wanted to track how the number of times users navigated around the application, and compared this to the number who stayed purely in the home screen. Finally, I was interested in seeing how many users would spend time customizing the application – tracked as the settings. These stats were fed back anonymized so I could aggregate them. In order to visualize the results, I developed a custom analytics application that allowed me to zoom in or expand out to visualize the data to as fine a degree as I need to. As this was a desktop application, I had the luxury of creating an application to do this but it is possible to get a similar level of detail from analytics sources such as Google and Wordpress.

Custom analytics tracking application usage from 2012 to date.
 

Conclusion

We have looked at what User Engagement is, and how it fits into the whole User Experience approach. We have also reached an understanding of what engagement with our users means, and how we should think about all the interactions we have with our users.

Finally, we have explored what goes into deciding which metrics to track and how other forms of interaction, such as social media sites, can work in our favor.
 

 

How to use Disk I/O analysis in Intel® VTune™ Amplifier for systems

$
0
0

Introduction

The Intel® VTune™ Amplifier XE 2017 has a new feature called disk input and output analysis that can be used to analyze disk-related performance issues based on device utilization, latency of requests and bandwidth to the device.  This provides a consistent view of the storage subsystem combined with the hardware events like device queue utilization, I/O transfer rate and an easy to use method to match user-level source code with I/O packets executed by hardware.

Overview

To access VTune Amplifier’s disk I/O feature click on “Disk Input and Output” analysis type under “Platform Analysis” in the Analysis Type tab. 

The article uses a simple file copy example that reads from an input file of size say 1G and does a checksum and writes to an output file to illustrate the disk I/O analysis. Here is a snippet of the code:

 infile = fopen(infilename, "rb");
    if (infile == NULL) {
        fprintf (stderr, "%s can't be opened.\n", infilename);
        return 0;
    }
    outfile = fopen(outfilename, "wb");
    srand(time(NULL));
    MD5_Init(&context);
    while ((bytes_read = fread(buffer, 1, buffer_size, infile)) != 0) {
        MD5_Update(&context, buffer, bytes_read);
        fwrite(buffer, 1, bytes_read, outfile);
        buffer_size = ((((rand() % 100) + 1) / 100.0) * MAX_BUFFER_SIZE);
    }

You can either use the GUI or the command line as below to perform collection for Disk I/O.

amplxe-cl –collect disk-io target-appl.exe

Once the collection is finished the summary window opens up as shown below:

The summary window indicates if your application is I/O bound or CPU bound and also the disk input and output histogram plots the read, write and flush operations for the file copy application in terms of duration (fast, good, slow) in seconds. The triangles at the bottom of the fast, good and slow indicators can be moved to suit the user’s needs.  As can be seen for the write operation, there are upto 56 operations that are qualified as slow. Similar information can be gathered for the other I/O operations like reads and flush as well.

For further analysis of your application, move to the bottom-up tab. The top panel of the bottom up tab indicates the top hotspot functions along with a breakdown of the I/O operations for each function as shown in the highlighted box. The top hotspot in filecopy function has 2 slow reads which can probably be optimized. Double clicking on the function would take you to the source code window which can be used to narrow down the lines of code that can be optimized.

The grouping option can also be changed to storage device/partition and this will give a breakdown by the disk and can be used to identify utilization/latency issues when there are multiple disks.

To get a better understanding switch to the platform tab.   

With the thread checkbox enabled this provides a timeline for CPU time spent, context switches, I/O APIs and the slow tasks for your application. As can be seen the I/O wait times contribute to a significant part of the timeline indicating the application is mostly I/O bound.  Also, if you hover over the timeline a pop-up appears that indicates the operation, duration and reason for the wait at that point in the timeline. The highlighted box shows the 2 slow reads that were a part of the top hotspot in the bottom up pane.  You can filter in selection for the selected time range and all the other view windows will be updated for that range. 

The I/O queue depth provides an indication of the number I/O requests submitted to the storage device. This gives an idea of the disk utilization over the lifespan of the application.  Enabling the slow spike indicates where exactly the slow packets are executed. 

The I/O data transfer shows the number of bytes read and written to the disk indicating points of high bandwidth utilization of the disk.  

Conclusion

With all this information, issues like imbalance between compute and I/O, latency and utilization can be identified and appropriate optimizations for the same can be done.  A few options that can be considered for overcoming I/O issues are changing application logic to run compute threads in parallel with IO, changing size of I/O operations or using faster storage.  

Boosting Kingsoft Cloud* Image Processing with Intel® Xeon® Processors

$
0
0

Background

Kingsoft1 Cloud* is a public cloud service provider.  It provides many services including cloud storage. Massive images are stored in Kingsoft Cloud storage. Kingsoft provides not only data storage for their customers but also image processing services to its public cloud customers. Customers can use these image processing services to complete functions such as image scaling, cutting, quality changing, and image watermarking according to their service requirements, which helps them provide the best experience to end users.

In the next section we will see how Kingsoft optimizes the imaging processing task to run on systems equipped with Intel® Xeon® processors.

Kingsoft Image Processing and Intel® Xeon® Processors

Intel® Advanced Vector Extensions 2 (Intel® AVX2)8 accelerates compression and decompression while processing a JPEG file. Those tasks are usually done using libjpeg-turbo2. Libjpeg-turbo is a widely used JPEG software codec. Unfortunately, the libjpeg-turbo library is implemented using Intel® Streaming SIMD Extensions 2 (Intel® SSE2)9, not Intel AVX2.

To optimize libjpeg-turbo to take advantage of Intel AVX2, Kingsoft engineers modified that library to include support for Intel AVX2—the libjpeg-turbo library with Intel AVX2 implemented is found in the library3. The new library accelerates the processes of color space conversion, down/up sampling, integer sample conversion, fast integer forward discrete cosine transform (DCT)4, slow integer forward DCT, integer quantization and integer inverse DCT.

Besides taking advantage of Intel AVX2 to reduce processing time, Kingsoft image processing tasks also gain performance when running on systems equipped with Intel® Xeon® processors E5 v4 over systems equipped with Intel® Xeon® processors E5 v3, due to having more cores and larger cache size. Image processing tasks like image cutting, scaling, and quality changing are cache-sensitive workloads; therefore, a larger CPU cache will make it run faster. Also, more cores means more images can be processed in parallel. Together, tasks finish faster, and running in parallel increases the overall performance.

Kingsoft makes use of the Intel® Math Kernel Library (Intel® MKL)7, in which its functions are optimized using Intel AVX2.

The next section shows how we tested the Kingsoft image processing workload to compare the performance between the current generation of Intel Xeon processors E5 v4 and those of the previous generation of Intel Xeon processors E5 v3.

Performance Test Procedure

We performed tests on two platforms. One system was equipped with the Intel® Xeon® processor E5-2699 v3 and the other with the Intel® Xeon® processor E5-2699 v4. We wanted to see how much performance improved when comparing the previous and the current generation of Intel Xeon processors and how Intel AVX2 plays a role in reducing the image processing time.

Test Configuration

System equipped with the dual-socket Intel Xeon processor E5-2699 v4

  • System: Preproduction
  • Processors: Intel Xeon processor E5-2699 v4 @2.2 GHz
  • Cache: 55 MB
  • Cores: 22
  • Memory: 128 GB DDR4-2133MT/s

System equipped with the dual-socket Intel Xeon processor E5-2699 v3

  • System: Preproduction
  • Processors: Intel Xeon processor E5-2699 v3 @2.3 GHz
  • Cache: 45 MB
  • Cores: 18
  • Memory: 128 GB DDR4-2133 MT/s

Operating System: Red Hat Enterprise Linux* 7.2- kernel 3.10.0-327

Software:

  • GNU* C Compiler Collection 4.8.2
  • GraphicsMagick* 1.3.22
  • libjpeg-turbo 1.4.2
  • Intel MKL 11.3

Application: Kingsoft image cloud workload

Test Results

The following test results show the performance improvement when running the application on systems equipped with the current and previous generations of Intel Xeon processors and when running the application with the Intel AVX2 non-supported and Intel AVX2 supported libjpeg-turbo libraries.

Comparing Xeon E5 Processors

Figure 1: Comparison between the application using the Intel® Xeon® processor E5-2699 v3 and the Intel® Xeon® processor E5-2699 v4.

Figure 1 shows the results between the application using the Intel Xeon processor E5-2699 v3 and the Intel Xeon processor E5-2699 v4. The performance improvement is because Intel Xeon processors E5 v4 have more cores, a larger cache, and Intel AVX2.

Comparing AVX2 Performance

Figure 2: Performance comparison between non-supported Intel® Advanced Vector Extensions 2 (Intel® AVX2) jpeg-turbo and supported Intel AVX2 jpeg-turbo.

Figure 2 shows that application performance improves up to 45 percent when using the libjpeg-turbo library with Intel AVX2 implemented over that with Intel SSE2 implemented. The improvement is achieved because Intel AVX2 instructions perform better than Intel SSE2 instructions. The application is running on a system equipped with the Intel Xeon processor E5 v4.

Conclusion

Kingsoft added support to Intel AVX2 on the libjpeg-turbo library. This allows their applications using the newly modified library to take advantage of the new features in Intel Xeon processors E5 v4. More cores and larger cache size also play an important role in improving performance of the applications running on systems equipped with these processors over systems having previous generations of Intel Xeon processors.

References

  1. Kingsoft company information

  2. jpeg turbo library

  3. jpeg turbo library with Intel® AVX2 supported

  4. Discrete cosine transform

  5. Lossy compression

  6. DGEMM

  7. Intel® Math Kernel Library

  8. Intel® AVX2

  9. Intel® Streaming SIMD Extensions 2

 

Viewing all 3384 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>