Why Efficient Use of the Memory Subsystem Is Critical to Performance

June 15, 2016, 4:52 pm

Latest and popular articles on Intel Technologies

≫ Next: What's New? - Intel® VTune™ Amplifier XE 2016 Update 4

≪ Previous: Getting Started With Node-Red* and RFID on Intel® IoT Gateways

The cores and vector processors on modern multi-core processors are voracious consumers of data. If you do not organize your application to supply data at high-enough bandwidth to the cores, any work you do to vectorize or parallelize your code will be wasted because the cores will stall while waiting for data. The system will look busy, but not fast.

The following chart, which reflects the numbers for a hypothetical 4-processor system with 18 cores per processor, shows the:

Bandwidth that each major component of a core/processor/memory subsystem can produce or consume
Latencies from when the core requests data from a device until the data arrives

Bandwidth vs. Latency

Note: This chart data (from various public sources) is approximate.

As you can see:

The capabilities of the L1 cache, L2 cache, and vector processing units dwarf the other components, whether the data is streaming in or randomly accessed. To provide the core operations with data at the rate they can consume it, most of the data must come from the L1 or L2 caches.
A small fraction can come from the in-processor L3 cache, but even 1% of the accesses traveling beyond that cache make the memory subsystem the dominant source of delays in applications that are load/store intensive.
While the MCDRAM on the Intel® Xeon Phi™ processor code named Knights Landing is a huge performance boost for memory-bound applications, it is clearly insufficient to keep the cores busy.

Performance Improvement Overview

You can often achieve huge performance gains by choosing better algorithms or using better coding techniques. There are many resources on this subject, so this topic is not covered here except for some best design practices at the end of this article. This article assumes you have the best algorithm and concentrates solely on reducing the time spent accessing the data.

Almost all data used across a long set of instructions – the working set or active data for those instructions – must be loaded into the L1 or L2 cache once and should be used from there repeatedly before it is finally evicted to make room for other needed data.

It is possible to get more active data into a cache by decreasing the number of bytes needed to store it. For example, use:

32-bit pointers instead of 64-bit pointers
int array indices instead of pointers
float instead of double
char instead of bool

Avoid storing bytes that do not contain active data. Data is stored in the caches in aligned, contiguous, 64-byte quantities called cache lines. It is often possible to ensure all 64 bytes contain active data by rearranging data structures.

It is unlikely the above techniques alone will fit your data into cache, but they may reduce your memory traffic beyond L2 significantly.

Minimizing data movement is your best opportunity for improvement.

Consume the data on the core where it is produced, soon after it is produced.
Put as many consumers of the data as possible on the same core.
Consume the data before it is evicted from the cache or register.
Use all the bytes of the 64-byte cache lines that are moved.

If you cannot sufficiently reduce data movement, try one or both of the following techniques. Although they do not transform a memory-bound situation into a compute-bound situation, they may let you overlay more memory and compute operations than the hardware does automatically, so an application that is alternately compute and memory bound changes to using both resources concurrently. You can use the following techniques to optimize a loop that is only a few lines of code, or an algorithm coded in many tens of thousands lines of code.

Start moving the data earlier. There are Intel® architecture prefetch instructions to support this.
Do something useful while the data is moving. Most Intel architecture cores can perform out-of-order execution.

Performance Improvement Preparation

Before investing your time to ensure most data accesses hit in the registers, L1 cache, or L2 cache, measure performance to verify your application is indeed stalling while waiting for data. The Intel® Vtune™Amplifier XE (version 2016 and later) has enhanced support for doing these measurements.

After you identify the portions of the execution that are waiting for memory accesses, subdivide the relevant parts of application execution into steps where one step’s output data is another step’s input data. You can view each step as a whole, or subdivide each step the same way. A large step is often referred to as a phase.

For our purposes, a convenient representation is data flowing through a graph of compute nodes performing the steps and data nodes holding the data.

Data flowing through a graph of compute nodes

Often there is a startup phase that loads databases or other initial data from a file system into memory. (With non-volatile memory, the data may be available without this phase.)

For each step you need to know its data, the amount of data, and the number and pattern of accesses:

The inputs, temporary data, shared data, and outputs may be anything from under 1 MB to over 1000 GB.
Each data item can be read or written between zero and a huge number of times per step.
Data item accesses can be evenly spread across the step, or concentrated at intervals during the step.
Data items can be accessed in many patterns, such as sequentially, randomly, and in sequential bursts.

Consider the behavior of each step. Ask yourself, or measure, if the step is memory bound as it accesses its inputs, outputs, and temporaries. If it is, get answers to the following questions:

Where are the step inputs produced, and how large are they?
Where are the step outputs consumed, and how large are they?
What access patterns are used to read the input and produce the output?
Which portion of the outputs is affected by each portion of the inputs?
How large is the temporary data and what access patterns does it have?

Once you have the answers, you are in position to modify the code to reduce the memory traffic the accesses are causing. For instance, you can replace produce a huge array; consume it item by item; with loop { produce a small portion of the huge array ; consume it item by item } so the portions stay in one of the closer, faster, private caches.

Performance Improvement Tactics

The operating system and the hardware choose the specific hardware used to execute instructions and store data. The choices meet requirements specified by the compilers, which in turn reflect the specifications implied by the programming language or explicit in the source code.

For instance, application source code specifies memory or file, but the operating system and hardware choose which levels of cache are used, when the data moves between caches, and which wires and transistors move the data.

The performance improvement tactics involve modifying the code or execution environment to guide these choices. If the choices are important enough, write code that enables and constrains these choices.

Where to place the data – Specify an alignment in the allocating call instead letting malloc or new default it for you.
When to move it – Execute instructions specifying prefetching, cache flushing, and reads and writes that do not go through the cache.
Where to execute the step – Assign a computation to a thread bound to a core.
When to do it – Change the priority of a thread, tile loops, and rearrange the code.

Performance Improvement Tactics to Reduce Traffic

The solution for long stall times caused by memory reads or (rarely) writes is to reduce the number of transfers between the more distant and closer storage devices. This results in fewer cache misses and thus fewer or shorter stalls. You can reduce the number of transfers with the following tactics.

Move more data per transfer by increasing the density of the information in the cache lines.
Reuse the data moved to the closer caches by changing the access patterns (tiling).
Store data near where it is produced, especially if it repeatedly changes before going to the next phase.
Store data near where it is consumed, especially if it is produced just once.
Larger memories and persistence provide the opportunity for more data to stay in memory rather than in a file system. Because accessing a file system is always slower than accessing memory, carefully consider if you can keep the data in memory, perhaps in a compressed form.

If these tactics are insufficient, try increasing the bandwidth to the storage where the data is held. This requires changing the assignment of the data to storage devices, including changing the hardware if the needed hardware is not available.

Performance Improvement Tactics to Increase Available Bandwidth or Decrease Latency

Use more of the hardware by splitting a computation into several computations on several cores.
Request data before it is needed so it arrives in time.
Find other things to do while waiting for data to arrive.
If data is needed in several places, duplicate it.

The remainder of this article discusses each tactic in more detail.

Bring More Useful Data per Transfer

Objective: Reduce traffic.

Because transfers across most of the fabric move 64-byte cache lines, it is desirable to fit as much data as possible into these bytes.

To eliminate unnecessary data from a 64-byte cache line:

Place used data into the fewest possible cache lines by both aligning the containing struct and ordering the data members appropriately.
Place used and unused data into different variables or different cache lines within the variable.
Use smaller numeric types, such as float instead of double.
Use packed data members or bit masks.

Such techniques almost always result in faster execution, because the same data can be transferred in fewer transactions and hence in less time.

Tile

Objective: Reduce traffic.

If possible, reorder accesses to reuse data before dropping or flushing it. This is a critical optimization, and you should be aware of the research into cache-aware and cache-oblivious algorithms, especially techniques such as loop tiling. Read more in Tiling.

Place Data on Closer or Faster Devices

Objective: Reduce traffic.

Tiling, which uses caches to reduce data movement, is an instance of a more general idea: Place data on closer or faster devices.

Typically the devices closer to a core have higher bandwidth and lower latency. Moving all the data is often not an option, because the closer storage usually has a higher cost and smaller size. One exception is moving data from another node to the node containing the core, but in this case the data moves away from another core, and that has a penalty also.

Storing data closer may not require adding devices to the system if some of the existing hardware is not fully used or if other data can move farther from the core. Deciding which data to move can be difficult, especially for applications with long run times that are hard to extrapolate from sample workloads.

Control Data Placement

To control where data is placed relative to the thread that writes or reads it, you need to control:

Where the thread is executing – This is done by specifying the thread-to-processor affinity.
Where the physical memory assigned to the virtual memory for the data is placed – This is done using operating system calls, although these calls may be hidden in a library. There are new compiler directives and libraries for placing data in high-bandwidth memory (HBM).
When data is accessed in relation to other uses of the cache – This is done using tiling and instructions such as clflushopt and prefetch.

Store Data Near Where It Is Produced

If your algorithm repeatedly updates data during a step, then it is clearly ideal if the data stays in the L1 or L2 cache of the producing system until its final value is determined.

Use tiling to achieve this effect.

Store Data Near Where It Is Consumed

If the data is only written once, it starts in the L1 cache and drifts outwards as its cache slot is needed for something else. Ideally, the consumer should get data from the closest common cache before it is flushed beyond this level.

Use tiling to achieve this effect.

Handle Shared Variables

If two or more variables are in the same cache line and are accessed by different cores, the whole cache line moves between the cores, slowing down both cores.

To fix this:

Move the variables to separate cache lines.
Change to not sharing the variable.
Accumulate changes in a local variable, and only rarely modify the shared variable.

Keep Data in Memory

Non-volatile DIMMs (NVDIMMs) are becoming available, as are fast SSDs. If your application spends a lot of time doing I/O, consider keeping the data in NVDIMMs or faster I/O devices.

Use More of the Hardware

Objective: Increase available bandwidth or decrease latency.

The extra hardware may be an additional core, an additional processor, or an additional motherboard. You may need to split a step into several steps that can be spread across the additional hardware. Be careful: You may need to move the inputs to more cores than before, and the extra movement must be offset by the amount of computation on the extra cores. You may need to execute many tens, and perhaps hundreds, of executed instructions or accesses on a new core for every moved 64 bytes to recover the cost of moving the bytes.

Be sure you are getting more of the critical hardware – and hardware that is shared by the existing core and the additional core to perform the operations is a potential bottleneck. For instance, the moved step may need to effectively use its L1 cache because two cores thrashing a shared L2 or L3 cache may run slower than a single core doing all the actions itself.

If enough work is moved and the L1 and L2 caches are effectively used, this change can decrease the elapsed time by a factor almost equal to the number of cores used.

If you can accomplish this on a large scale, you may be able to spread the step over a whole MPI cluster. If you can keep portions of the data on specific nodes and not move them around, this change can decrease the elapsed time by the number of nodes used.

Duplicate Data If It Is Needed in Several Places

Objective: Increase available bandwidth or decrease latency.

The caches are good at duplicating data being read; however, if cache size is too small, it may be more effective to have copies of the data in the memories near the cores that require it, rather than fetching the data from memory attached to a different processor.

The caches are less effective when data is written by several cores. In this case, accumulate updates in local storage before combining them into shared data. OpenMP* technology automatically does this with reductions, but other frameworks may require explicit coding.

It may be best to write the duplicate when first computed (if there are few rewrites) or after rewriting is finished (if there are many rewrites).

Request Data Before It Is Needed

Objective: Increase available bandwidth or decrease latency.

Some processors are good at automatically prefetching data accessed by constant strides in a loop. For other situations, you may need to use the instructions and techniques described in such materials as Compiler Prefetching for the Intel® Xeon Phi™ coprocessor.

Find Other Things to Do While Waiting for Data to Arrive

Objective: Increase available bandwidth or decrease latency.

Processors capable of out-of-order execution, such as Intel® Xeon® processors and later generations of Intel® Xeon Phi™ processors, may be able to execute other nearby instructions while waiting. For other cases, you may be able to restructure your code to move such operations nearby.

In the worst case, if the L1 and L2 caches have room for several threads to run without increasing the cache misses, you may need to put more threads on the core to exploit its hyperthreading capability.

Efficiently Access Far Memory That Is Near Another Processor

Objective: Increase Available Bandwidth or Decrease Latency.

If there is a processor nearer the memory containing the data, it may be more effective to have cores on that processor do the operations.

One of the most difficult programs to optimize can be modelled as randomly accessing a huge array.

size_t hugeNUMAArray[trillionsOfEntries];

    void updateManyEntries() {
        for (size_t i = 0; i < loopCount; i++) {
            hugeNUMAArray[randomBetween(0,trillionsOfEntries)]++;
        }
}

If the hugeNUMAArray is far enough away, the time taken to fetch the entry and send it back may be less than the time taken to identify the nearby processor and send a thread on that processor a message to update the entry.

This is especially true if several operations can be sent in a single message. Use all the techniques above to optimize this approach.

Best Design Practices

Algorithms That Sequentially Access Data

The ideal streaming algorithm:

Reads its inputs exactly once, from a nearby cache into the closest private cache.
Has all its temporary data in a private cache.
Writes all its outputs exactly once into a cache near the consumer, where it stays until the consumer needs it.

If the data is too large to be kept in the caches, it is important to use all the bytes transferred between the caches and dual inline memory modules (DIMMs). But be aware that data is transferred in 64-byte cache lines; do not intermingle useful and useless data within a cache line.

Algorithms That Randomly Access Data

This section applies if you have determined you cannot efficiently transform an algorithm with non-sequential reads to an algorithm with sequential reads.

Note: There are rare algorithms that serially read their data but randomly write it. The writes are rarely a problem because the processor does not wait for a write to complete.

The performance of non-sequential reads may vary significantly across processors. Processors that support out-of-order execution may be able to hide some of read delays behind other operations. Similarly, hyperthreading may allow another thread to use the core when one is stalled. However, sharing L1 and L2 caches may limit effectiveness.

Processor cores that do not perform out-of-order execution (such as the first generations of Intel Xeon Phi processors), and cases where out-of-order execution cannot hide the read delays, can be improved by prefetching data using:

_mm_prefetch or similar instructions
A second thread to load the data into the L3 cache

These techniques are well documented in such materials as Compiler Prefetching for the Intel® Xeon Phi™ coprocessor (PDF).

Summary

The previous article, MCDRAM and HBM, gives more details about Intel’s on-package memory, one of Intel’s new hardware technologies. This article discusses the algorithms and analysis that are critical to getting the best possible performance out of an Intel Xeon Phi and a MCDRAM. If you have not already read the article on Tiling, we recommend you do so now; otherwise, the next article is How Memory Is Accessed, which gives a detailed description about how data moves through the system – knowledge that will help you become a real expert in optimizing memory traffic.

You may also want to read Performance Improvement Opportunities with NUMA, a series that covers the basics of efficiently using Intel’s new memory technologies.

About the Author

Bevin Brett is a Principal Engineer at Intel Corporation, working on tools to help programmers and system users improve application performance. He has always been fascinated by memory, and enjoyed a recent plane ride because the passenger beside him was a neuroscientist who gracefully answered Bevin’s questions about the current understanding of how human memory works. Hint: It is much less reliable than computer memory.

Resources

Intel® Xeon Phi™ Coprocessor High Performance Programming – Book on tools and approaches for memory system optimization by James Reinders and Jim Jeffers
Parallel Programming and Optimization with Intel® Xeon Phi™ Coprocessors, 2nd Edition – Book on tools and approaches for memory system optimization by Andrey Vladimirov, Ryo Asai, and Vadim Karpusenko
Intel® Product Specifications (ARK)
Intel QuickPath Interconnect
Haswell L2 cache bandwidth to L1 (64 bytes/cycle)?
Memory Latencies on Intel® Xeon® Processor E5-4600 and E7-4800 product families
Intel® Memory Latency Checker v3.1
Why Intel added cache partitioning in Broadwell
Memory subsystem performance
Cache and Memory

↧

What's New? - Intel® VTune™ Amplifier XE 2016 Update 4

June 16, 2016, 8:30 am

Latest and popular articles on Intel Technologies

≫ Next: Efficient SIMD in Animation with SIMD Data Layout Templates (SDLT) and Data Preconditioning

≪ Previous: Why Efficient Use of the Memory Subsystem Is Critical to Performance

Intel® VTune™ Amplifier XE 2016 performance profiler

A performance profiler for serial and parallel performance analysis. Overview, training, support.

New for the 2016 Update 4! (Optional update unless you need...)

As compared to 2016 Update 3 release

Support for the Intel® Xeon Phi™ Processor Codenamed Knights Landing (KNL) including General Exploration, Memory Access,HPC Performance Characterization analysis and PMU event reference.
PMU event reference for Intel® Xeon® Processor E5 v4 Family (formerly codenamed "Broadwell-EP")

Note: We are now labeling analysis tool updates as "Recommended for all users" or "Optional update unless you need…". Recommended updates will be available about once a quarter for users who do not want to update frequently. Optional updates may be released more frequently, providing access to new processor support, new features, and critical fixes.

Note: you may receive a warning message about "Unsigned driver" during installation on Windows* 7 and Windows* Server 2008 R2 systems. The VTune™ Amplifier hardware event-based sampling drivers (sepdrv.sys and vtss.sys) are now signed with digital SHA-2 certificate key for compliance with Windows* 10 requirements. To install the drivers on Windows* 7 and Windows* Server 2008 R2 operating systems, you must add functionality for the SHA-2 hashing algorithm to the systems by applying Microsoft* Security update 3033929: https://technet.microsoft.com/en-us/library/security/3033929

Resources

Learn (“How to” videos, technical articles, documentation, …)
Support (forum, knowledgebase articles, how to contact Intel® Premier Support)
Release Notes (pre-requisites, software compatibility, installation instructions, and known issues)

File: vtune_amplifier_xe_2016_update4.tar.gz

Installer for Intel® VTune™ Amplifier XE 2016 for Linux* Update 4

File: VTune_Amplifier_XE_2016_update4_setup.exe

Installer for Intel® VTune™ Amplifier XE 2016 for Windows* Update 4

File: vtune_amplifier_xe_2016_update4.dmg

Installer for Intel® VTune™ Amplifier XE 2016 - OS X* host only Update 4

* Other names and brands may be claimed as the property of others.

Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries.

↧

Efficient SIMD in Animation with SIMD Data Layout Templates (SDLT) and Data Preconditioning

June 17, 2016, 9:50 am

Latest and popular articles on Intel Technologies

≫ Next: Recipe: Building and Running MASNUM WAVE for Intel® Xeon Phi™ Processors

≪ Previous: What's New? - Intel® VTune™ Amplifier XE 2016 Update 4

Introduction

In order to get the most out of SIMD,¹ the key is to put in effort beyond just getting it to vectorize.² It is tempting to add a #pragma omp simd³ to your loop, see that the compiler successfully vectorized it, and then be satisfied, especially if you get a speedup. However, it is possible that there is no speedup at all, or even a slowdown. In either case, to maximize the benefits and speedups of SIMD execution, you often have to rework your algorithm and data layout so that the generated SIMD code is as efficient as possible. Furthermore, often there is the added bonus that even the scalar (non-vectorized) version of your code will perform better.

In this paper, we walk through a 3D Animation algorithm example to demonstrate step-by-step what was done beyond just adding a “#pragma”. In the process, there will be some techniques and methodologies that may benefit your next vectorization endeavors. We also integrate the algorithm with SIMD Data Layout Templates (SDLT), which is a feature of Intel® C++ Compiler, to improve data layout and SIMD efficiency. All the source code in this paper is available for download, and includes other details not mentioned here.

Background and Problem Statement

Sometimes just getting your loop to vectorize is not enough for a performance improvement for your algorithm. The Intel® C++ Compiler may tell you that it “could vectorize but may be inefficient.” But just because a loop can be vectorized does not mean the generated code is more efficient than if the loop were not vectorized. If vectorization does not provide a speedup, it is up to you to investigate why. Often to get efficient SIMD code, the problem requires you to refactor the data layout and algorithm. In many instances, the optimizations that benefit SIMD often account for the majority of the speedups regardless of vectorization or not. However by improving the efficiency of your algorithm, SIMD speedups will be much greater.

In this paper, we introduce a loop from the example source code plus four other versions of it that represent the changes that were made to improve SIMD efficiency. Use Figure 1 as a reference for this paper as well as for the downloaded source code. Sections for Versions #0 through #3 provide the core of this paper. And for extra credit, Version #4 discusses an advanced SDLT feature to overcome SIMD conversion overhead.

Algorithm Version
#0: Original
#1: SIMD
#2: SIMD, data sorting
#3: SIMD, data sorting, SDLT container
#4: SIMD, data sorting, SDLT container, sdlt::uniform_soa_over_1d

Figure 1: Legend of version number and corresponding description of the set of code changes in available source code. Version numbers also imply the order of changes.

Algorithms requiring data to be gathered and scattered can inhibit performance, for both scalar and SIMD. And if you have chains of gathers (or scatters), this further exacerbates underperformance. If your loop contains indirect accesses (or non-unit strided memory accesses⁴) as in Figure 2, the compiler will likely generate gather instructions (whether it be an explicit gather instruction or several instructions emulating a gather). And with indirect accesses to large structs, the number of gathers grows proportionally with the number of data primitives. For example, if struct “A” contains 4 doubles, an indirect access to this struct generates 4 gathers. It may be the case that indirect accesses in your algorithm are unavoidable. However, you should investigate and search for solutions to avoid indirect accesses if possible. Avoiding inefficiencies such as gathers (or scatters) can greatly improve SIMD performance.

Furthermore, data alignment can improve SIMD performance. If your loop is operating on data that is not aligned to SIMD data lanes, your performance may be reduced.

Indirect memory gather or scatter

Figure 2: Indirect memory addressing can be either a gather or a scatter, and is where a loop index is used to look up another index. Gathers are indexed loads. Scatters are indexed stores.

We present a simple 3D mesh deformation algorithm example to illustrate some techniques that can be employed to improve efficiency of generated code that benefits both scalar and SIMD. In Figure 3, each Vertex of the 3D mesh has an Attachment that contains data that influences the deformation of that Vertex. Each Attachment indirectly references 4 Joints. The Attachments and Joints are stored in 1D arrays.

Example algorithm of 3D mesh deformation

Figure 3: Example algorithm of 3D mesh deformation.

Version #0: The Algorithm

In Figure 4, the algorithm loop iterates over an array of “Attachments.” Each Attachment contains 4 Joint index values that indirectly accesses into an array of “Joints.” And each Joint contains a transformation matrix (3x4) of 12 doubles. So each loop iteration requires gathers for 48 doubles (12 doubled by 4 Joints). This large number of gathers can contribute to slower SIMD performance. So if we can somehow reduce or avoid these gathers, the SIMD performance can be greatly improved.

typedef std::vector<Attachment> AttachmentsArray;
AttachmentsArray mAttachments;

void
computeAffectedPositions(std::vector<Vec3d>& aAffectedPositions)
{
    const int count = mAttachments.size();
#pragma novector
    for (unsigned int i=0u; i < count; ++i) {
                 Attachment a = mAttachments[i];
 
        // Compute affected position
        // NOTE: Requires many gathers (indirect accesses)
        Vec3d deformed0 = a.offset * mJoints[a.index0].xform * a.weight0;
        Vec3d deformed1 = a.offset * mJoints[a.index1].xform * a.weight1;
        Vec3d deformed2 = a.offset * mJoints[a.index2].xform * a.weight2;
        Vec3d deformed3 = a.offset * mJoints[a.index3].xform * a.weight3;
 
        Vec3d deformedPos = deformed0 + deformed1 + deformed2 + deformed3;
 
        // Scatter result
        aAffectedPositions[i] = deformedPos;
    }
}

Figure 4: Version #0: Example algorithm with 48 gathers per loop iteration.

Version #1: SIMD

For Version #1, we vectorize the loop. In our example, the loop successfully vectorizes by simply adding “#pragma omp simd” (see Figure 5) because it already meets the criteria to be vectorizable (for example, no function calls, single entry and single exit, and straight-line code⁵). In addition, it follows SDLT’s vectorization strategy, which is to restrict objects to help the compiler succeed in the privatization.⁶ However, it should be noted that in many cases, simply adding the pragma will result in compilation errors or incorrectly generated code.⁷ There is often code refactoring required to get to the state where the loop is vectorizable.

#pragma omp simd

Figure 5: Version #1: Change line 8 of Version #0 (see Figure 4) to vectorize the loop.

Figure 6 shows the Intel® C++ Composer (ICC) XE Opt-report⁸ for the loop in Version #1. For an Intel® Advanced Vector Extensions (Intel® AVX)⁹ build, you can see that the Opt-report states that even though the loop did vectorize, it estimates only a 5-percent speedup. However in our case, the actual performance of Version #1 was 15 percent slower than that of Version #0. Regardless of the estimated speedup reported by the Opt-report, you should test for actual performance.

Furthermore, Figure 6 shows 48 “indirect access” masked indexed loads for each of the doubles in the transformation matrix of all 4 Joints. Correspondingly, it generates 48 “indirect access” remarks such as the one in Figure 7. Opt-report remarks should not be ignored; you should investigate the cause and attempt to resolve them.

Intel® C++ Compiler Opt-report for loop

Figure 6: Version #1: Intel® C++ Compiler Opt-report for loop.

Intel® C++ Compiler Opt-report, indirect access remark

Figure 7: Version #1: Intel® C++ Compiler Opt-report, indirect access remark.

Even though the loop was vectorized, any potential performance gain from SIMD is hindered by the large number of gathers from indirect accesses.

Solution

After successful vectorization, you may or may not get speedups. In either case, just getting your loop to vectorize should be just the beginning of the optimization process, not the end. Instead, utilize tools (for example, Opt-report, assembly code, Intel® VTune™ Amplifier XE, Intel® Advisor XE) to help investigate inefficiencies and implement solutions to improve your SIMD code.

Version #2 (Part 1): Data Preconditioning by Sorting for Uniform Data

In our example, the Opt-report reported 48 gathers and corresponding “indirect access” remarks. The indirect access remarks were particularly alarming, since the report was littered with them. Investigating further, we discovered that they corresponded to the 4x3 matrix values for each of the 4 Joints that were being indirectly accessed from inside the vectorized loop, totaling 48 gathers. We know that gathers (and scatters) can affect performance. But what can we do about them. Are they necessary, or is there a way to avoid them?

For our example, we asked ourselves, “Is there any uniform data being accessed from within the loop body that can be hoisted to outside the loop?” The initial answer was “no,” so then we asked, “Is there a way to refactor the algorithm so that we do have uniform data that is loop-invariant?”

Sorting algorithm data

Figure 8: Sorting algorithm data. On the left, the loop iterates over all Attachments where Joint indexes vary. On the right, the Attachments are sorted so that each (inner) sub-loop has the same set of uniform Joint data.

As shown in Figure 8, many individual Attachments share the same set of Joint index values. By sorting the Attachments so that all the ones that share the same indexes are grouped together, this creates an opportunity to loop over a sub-set of Attachments where the Joints are uniform (loop-invariant) over the sub-loop’s iteration space. This would allow hoisting of the Joint data to outside the vectorized inner loop. And subsequently, the inner vectorized loop would not have any gathers.

void
computeAffectedPositions(std::vector<Vec3d>& aAffectedPositions)
{
    // Here we have a "sorted" array of Attachments, and an array of IndiceSets.
    // Each IndiceSet specifies the range of Attachment-indexes that share common
    // set of Joint-indexes. So we loop over the IndiceSets (outer loop), and
    // loop over the Attachments over the range (inner SIMD loop).
    const int setCount = static_cast<int>(mIndiceSetArray.size());
    for (int setIndex = 0; setIndex < setCount; ++setIndex) {
        const auto & indiceSet = mIndiceSetArray[setIndex];
        const int startAt = indiceSet.rangeStartAt;
        const int endBefore = indiceSet.rangeEndBefore;

        // Uniform (loop-invariant) data, hoisted outside inner loop
        // NOTE: Avoids indirection, therefore gathers
        const Joint joint0 = mJoints[indiceSet.index0];
        const Joint joint1 = mJoints[indiceSet.index1];
        const Joint joint2 = mJoints[indiceSet.index2];
        const Joint joint3 = mJoints[indiceSet.index3];

#pragma omp simd
        for (int i = startAt; i < endBefore; ++i) {
            const Attachment a = mAttachmentsSorted[i];

            // Compute an affected position
            const Vec3d deformed0 = a.offset * joint0.xform * a.weight0;
            const Vec3d deformed1 = a.offset * joint1.xform * a.weight1;
            const Vec3d deformed2 = a.offset * joint2.xform * a.weight2;
            const Vec3d deformed3 = a.offset * joint3.xform * a.weight3;

            const Vec3d deformedPos = deformed0 + deformed1 + deformed2 + deformed3;

            // Scatter result
            aAffectedPositions[a.workIdIndex] = deformedPos;
        }
    }
}

Figure 9: Version #2: Algorithm refactored to create uniform (loop-invariant) data.

Figure 9 shows resulting code using a sorted data array, which groups together elements sharing uniform data, where the original loop is converted to an outer and inner (vectorized) loop that avoids gathers. The array of IndiceSet’s mIndiceSetArray is to track the start and stop indices in the sorted array. This is why we have an outer loop and inner loop. Also since the data is reordered, we needed to add workIdIndex to track the original location to write the results out.

Now the Opt-report (see Figure 10) no longer reports 48 indexed masked loads (or gathers) due to the Joints. And it “estimates” a 2.35x speedup for Intel® AVX. In our case, the actual speedup was 2.30x.

Intel® C++ Compiler Opt-report of refactored loop with uniform Joint data

Figure 10: Version #2: Intel® C++ Compiler Opt-report of refactored loop with uniform Joint data.

In Figure 10, notice that the Opt-report still reports 8 “gathers” or “masked strided loads.” They result from accessing the array of structures memory layout of the mSortedAttatchments array. Ideally, we want to achieve “unmasked aligned unit stride” loads. Later we will demonstrate how to improve this with SDLT. Also notice in the Opt-report (see Figure 10) that we now have 3 scatters. This is because we reordered the input data and thus need to write out the results to the output in the correct order (shown at line number 29 in Figure 9). But it is better to scatter 3 values than to gather 48 values, which is introducing a small overhead to remove a much larger cost.

Version #2 (Part 2): Data Padding

At this point, the Opt-report estimated a good speedup. However we have reordered our original very large attachments loop into many smaller sub-loops, and we noticed that for actual execution of the loop the performance may not be optimal when processing short trip counts. For short trip counts, a significant portion of the execution time might be spent in the Peel or Remainder loop, which are not fully vectorized. Figure 11 provides an example where unaligned data can result in execution in the Peel, Main, or Remainder loop. This happens when either the start or end indexes (or both) of the iteration space are not a multiple of the SIMD vector lane count. Ideally we want all the execution time to be in the Main SIMD loop.

Anatomy of a SIMD loop

Figure 11: Anatomy of a SIMD loop. When the compiler does vectorization, it generates code for 3 types of loops: the main SIMD loop, the Peel and the Remainder loop. In this diagram, we have a 4 Vector lane example where the loop iteration space is 3 to 18. The Main loop will process 4 elements at a time, starting at SIMD lane boundary 4 and ending at 15, while the peel loop will process element 3, and the Remainder will process 16–18.

Intel® VTune™ Amplifier XE (2016) can be used to see corresponding assembly code

Figure 12: Intel® VTune™ Amplifier XE (2016) can be used to see where time is spent in the corresponding assembly code. When inspecting (executed) assembly within Intel VTune Amplifier XE, alongside the scrollbar there are blue horizontal bars that indicate execution time. By identifying the Peel, Main, and Remainder loops in the assembly, you can determine how much time is being spent outside your vectorized Main loop (if any).

Therefore, in addition to sorting the Attachments, the SIMD performance could be improved by padding the Attachment data so that it aligns to multiples of the SIMD vector lane count. Figure 13 illustrates how padding the data array can allow for all execution to occur in the Main SIMD loop, which is ideal. Results may vary, but padding your data is generally a beneficial technique.

Padding data array

Figure 13: Padding data array. In this example of 4 vector lanes, it shows the Attachments, sorted and grouped into two sub-loops. (Left) For sub-loop #1, attachments 0–3 are processed in the Main loop, while the remaining two elements (4 and 5) are processed by Remainder. For sub-loop #2, with only a trip count of 3, all 3 are processed by the Peel loop. (Right) We padded each sub-loop to align with a multiple of 4 SIMD lanes, allowing all attachments to be processed by the vectorized loop.

Version #3: SDLT Container

Now that we have refactored our algorithm to avoid gathers and significantly improve our SIMD performance, we can leverage SDLT to help further improve the efficiency of the generated SIMD code. Until now, all loads have been “masked” and unaligned. Ideally, we want unmasked, aligned, and unit-stride. We utilize SDLT Primitives and Containers to achieve this. SDLT helps with the success of the privatization of local variables in the SIMD loop, meaning each SIMD lane gets a private instance of the variable. The SDLT Containers and Accessors will automatically handle data transformation and alignment.

In Figure 14, the source code shows the changes to integrate SDLT. The key changes are to declare the SDLT_PRIMITIVE for the struct AttachmentSorted and then convert the input data container for the array of Attachments from the std::vector container, which is an Array of Structures (AOS) data layout, to an SDLT container. The programmer uses the operator [] on SDLT accessors as if they were C arrays or std::vector. Initially, we used the SDLT Structure of Arrays (SOA) container (sdlt::soa1d_container), but the Array of Structures of Arrays (ASA) container (sdlt::asa1d_container) yielded better performance. It is easy to switch (that is, using typedef) between SDLT container types to experiment and test for best performance, and you are encouraged to do so. In Figure 14, we also introduce the SDLT_SIMD_LOOP macros, which is a “Preview” feature in ICC 16.2 (SDLT v2), and is compatible with both ASA and SOA container types.

// typedef sdlt::soa1d_container<AttachmentSorted> AttachmentsSdltContainer;
typedef sdlt::asa1d_container<AttachmentSorted, sdlt::simd_traits<double>::lane_count>    AttachmentsSdltContainer;
AttachmentsSdltContainer mAttachmentsSdlt;

void
computeAffectedPositions(std::vector<Vec3>& aAffectedPositions)
{
    // SDLT access for inputs
    auto sdltInputs = mAttachmentsSdlt->const_access();

    math::Vec3* affectedPos = &aAffectedPositions[0];
    for (int setIndex=0; setIndex < setCount; ++setIndex) {
        // . . .

        // SIMD inner loop
        // The ‘sdlt::asa1d_container’ needs a compound index that identifies the AOS index as
        // well as the SOA lane index, and the macro SDLT_SIMD_LOOP provides a compatible index
        // over ranges that begin/end on SIMD lane count boundaries (because we padded our data).
        // NOTE: sdlt::asa_container and SDLT_SIMD_LOOP are “Preview” features in ICC 16.2, SDLT v2.
        SDLT_SIMD_LOOP_BEGIN(index, startAt, endBefore, sdlt::simd_traits<double>::lane_count)
        {
            const AttachmentSorted a = sdltInputs[index];

            // . . .

            affectedPos[a.workIdIndex] = deformedPos;
        }
        SDLT_SIMD_LOOP_END
    }
}

Figure 14: Version #3. Integrate SDLT Container (lines 1–3,7) and Accessor (lines 8 and 19); also, using “Preview” features of SDLT_SIMD_LOOP macros (lines 17 and 23). Only shows differences to Version #2.

Intel® C++ Compiler Opt-report using SDLT Primitive and Containers

Figure 15: Version #3: Intel® C++ Compiler Opt-report using SDLT Primitive and Containers.

In Figure 15, the Opt-report estimates a 1.88x estimated speedup for Version #3. But keep in mind that this is just an estimate and not actual speedup. In fact, in our case it resulted in an actual speedup of 3.17x. Furthermore, recall that the Opt-report for Version #2 (Figure 10) reported “masked strided” loads. Notice now (Figure 15) that the loads are “unmasked,” “aligned,” and “unit stride”. This is ideal for optimal performance and was facilitated by using a SDLT container to improve data layout and memory access efficiency.

Version #4: sdlt::uniform_soa_over1d

In Version #4 of the algorithm, we can discover more opportunities for improvement. Notice that from one sub-loop to its subsequent sub-loops, the same 3 out of 4 Joint data are being pulled for uniform access in the inner loop. Also, be aware that there is large overhead in getting uniform data ready for every entry into a SIMD loop, and we incur this cost at every iteration of the outer loop for every piece of uniform data.

For SIMD loops, depending on the SIMD instruction set,¹⁰ there is overhead in prepping uniform data before iteration starts. For each uniform value the compiler may 1) load the scalar value into the register, 2) broadcast the scalar value in the register to all lanes of the SIMD register, and then (3) store the SIMD register to a new location on the stack for use inside the SIMD loop body. For long trip counts, this overhead can be easily amortized. But for short trip counts, it can hurt performance. In Version #3, every iteration of the outer loop incurs this overhead for the 4 Joints, which means 48 doubles (12 doubles per Joint) total.

Finding trip counts for loop execution

Figure 16: Finding trip counts: Intel® Advisor XE (2016) has a useful feature that provides trip counts for loop execution. This allows you to easily identify short versus long trip counts.

For a scenario such as this, SDLT provides a way to explicitly manage this SIMD data conversion by determining when to incur the overhead, rather than having to automatically incur the overhead cost. This advanced feature of SDLT is sdlt::uniform_soa_over1d. It offers the ability to decouple SIMD conversion overhead from the loop, and puts the user in control of when to incur this overhead. It does this by storing the loop-invariant data in a SIMD-ready format so that the SIMD loops can directly access the data without conversion. It enables partial updates and reuse of uniform data, which benefits the performance of our example.

To illustrate where the SIMD data conversion overhead occurs and how SDLT can help mitigate this overhead, we provide a pseudo-code example in Figures 17 and 18. Figure 17 shows that the overhead is incurred for every iteration of the outer loop (at line 8) and for every double that is accessed from UniformData (12 doubles). And Figure 18 shows how the usage of sdlt::uniform_soa_over1d can reduce the overall cost by having to incur the overhead only once (at line 6). This is an advanced feature that may provide a benefit in specific scenarios. Users should experiment. Results may vary.

Incur SIMD data conversion overhead of uniform data

Figure 17: Before entering a SIMD loop on line 12, for each uniform value (of UniformData) the compiler may (1) load the scalar value into the register, (2) broadcast the scalar value in the register to all lanes of SIMD register, and then (3) store the SIMD register to a new location on the stack for use inside the SIMD loop body. For long trip counts, the overhead can be easily amortized. But for short trip counts, it can hurt performance.

The SIMD loop can use 'UniformOver1d' data without need for conversion

Figure 18: By using sdlt::uniform_soa_over1d, you can explicitly manage this SIMD data conversion by determining when to incur the overhead, rather than having to automatically incur the cost. This advanced feature of SDLT offers the ability to decouple SIMD conversion overhead from the loop and puts you in control of when to incur this overhead. It does this by storing the loop-invariant data in a SIMD-ready format so that the SIMD loops can directly access the data without conversion.

So for the first step to further improve the performance for the case of short trip counts, we refactor the algorithm so that we reuse 3 out of 4 Joints data from outer loop index i to i+1, as illustrated in Figure 19. The usage of the SDLT feature helps mitigate the accumulated overhead of prepping the SIMD data for the sub-loops.

Uniform data for 3 out of 4 Joints can be reused in subsequent sub-loops

Figure 19: Version #3: Uniform data for 3 out of 4 Joints can be reused in subsequent sub-loops. Partial updating of uniform data can be implemented to minimize loads (or gathers in outer loop). And use sdlt::uniform_soa_over1d to store uniform data in SIMD-ready format to minimize the SIMD conversion overhead for all sub-loops.

By refactoring our loop so that we can reuse uniform data from sub-loop to subsequent sub-loop and only have to do partial updates, on average we have to update only a quarter of the uniform data. Thus, we save 75 percent of the overhead involved in setting up the uniform data for use in an SIMD loop.

Conclusion

Performance speedups for Intel® Advanced Vector Extensions

Figure 20: Performance speedups for Intel® Advanced Vector Extensions build on Intel® Xeon® CPU processor E5-2699 v3 (code-named Haswell)ⁱ

Getting your code to vectorize is only the beginning of SIMD speedups, but it should not be the end. Thereafter, you should use available resources and tools (for example, optimization reports, Intel VTune Amplifier XE, Intel Advisor XE) to investigate the efficiency of the generated code. Through analysis, you may then discover opportunities that would benefit both scalar and SIMD code. Then you can employ and experiment with techniques, whether they are common or like the ones provided in this document. You may need to rethink the algorithm and the data layout in order to ultimately improve the efficiency of your code and especially your generated assembly.

Taking our example, the biggest payoff was from Version #2, which implemented data preconditioning to the algorithm so that we could eliminate all the indirection (gathers). And then, Version #3 yielded additional speedups from using SDLT to help improve memory accesses with unmasked aligned unit-stride loads and also padding data to align with SIMD lane boundaries. And for a scenario of short trip counts, we utilized a SDLT advanced feature to help minimize the overall cost of uniform data overhead.

References

Github repository to download example source code:
https://github.intel.com/amwells/animation-simd-sdlt-whitepaper
SDLT documentation (contains some code examples):
https://software.intel.com/en-us/node/600110
SIGGRAPH 2015: DreamWorks Animation (DWA): How We Achieved a 4x Speedup of Skin Deformation with SIMD:
http://www.slideshare.net/IntelSoftware/dreamwork-animation-dwa
For “try before buy” evaluation copy of Intel® Parallel Studio XE:
http://software.intel.com/intel-parallel-studio-xe/try-buy
For free copy of Intel® Parallel Studio XE for qualified students, educators, academic researchers and open source contributors:
https://software.intel.com/en-us/qualify-for-free-software
Intel® VTune™ Amplifier 2016:
https://software.intel.com/en-us/intel-vtune-amplifier-xe
Intel® Advisor:
https://software.intel.com/en-us/intel-advisor-xe

Footnotes

¹ Single instruction, multiple data (SIMD), refers to the exploitation of data-level parallelism, where a single instruction processes multiple data simultaneously. This is in contrast to the conventional “scalar operations” of using a single instruction to process each individual data.

² Vectorization is where a computer program is converted from a scalar implementation to a vector (or SIMD) implementation.

³ pragma simd: https://software.intel.com/en-us/node/583427. pragma omp simd: https://software.intel.com/en-us/node/583456

⁴ Non-Unit Stride Memory Access means that as your loop increments consecutively, you access data from non-adjacent locations in memory. This can add significant overhead on performance. In contrast, accessing memory in a unit-strided (or sequential) fashion can be much more efficient.

⁵ For reference: https://software.intel.com/sites/default/files/8c/a9/CompilerAutovectorizationGuide.pdf

⁶ SDLT Primitives restrict objects to help the compiler succeed in the privatization of local variables in a SIMD loop, which means that each SIMD lane gets a private instance of the variable. To meet this criteria, the objects must be Plain Old Data (POD) and have in-lined object members, no nested arrays, and no virtual functions.

⁷ In the process of vectorizing, the developer should experiment with various pragmas (for example, simd, omp simd, ivdep, and vector {always [assert]}) and utilize Opt-reports.

⁸ For the Intel® C++ Compiler 16.0 (2016) on Linux*, we added command-line options “-qopt-report=5 –qopt-report-phase=vec” to generate the Opt-Report (*.optrpt).

⁹ To generate Intel® Advanced Vector Extensions (Intel® AVX) instructions using the Intel® C++ Compiler 16.0, add the option “-xAVX” to the compile command line.

¹⁰ The AVX512 instruction set has broadcast load instructions that can reduce the SIMD overhead in prepping uniform data before iteration starts.

ⁱ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.

Performance tests, such as SYSmark *and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information, visit http://www.intel.com/performance.

Configurations: Intel® Xeon® processor E5-2699 v3 (45M Cache, 2.30 GHz). CPUs: Two 18-Core C1, 2.3 GHz. UNCORE: 2.3 GHz. Intel® QuickPath Interconnect: 9.6 GT/sec. RAM: 128 GB, DDR4-2133 MHz (16 x 8 GB). Disks: 7200 RPM SATA disk. 800 GB SSD. Hyper-Threading OFF, Turbo OFF. Red Hat Enterprise Linux* Server release 7.0. 3.10.0-123.el7.x86_64.

↧

Recipe: Building and Running MASNUM WAVE for Intel® Xeon Phi™ Processors

June 16, 2016, 10:29 am

Latest and popular articles on Intel Technologies

≫ Next: Recipe: Pricing Options Using Barone-Adesi Whaley Approximation

≪ Previous: Efficient SIMD in Animation with SIMD Data Layout Templates (SDLT) and Data Preconditioning

I. Overview

This article provides a recipe for how to obtain, compile, and run an optimized version of MASNUM WAVE (0.2 degree high resolution) workload on Intel® Xeon® processors and Intel® Xeon Phi™ processors.

The source for this version of MASNUM WAVE as well as the workload can be obtained by contacting Prof. Zhenya Song at songroy@fio.org.cn.

II. Introduction

The MASNUM WAVE model is the 3rd generation surface wave model that was proposed in the early 1990s in LAGFD (Laboratory of Geophysical Fluid Dynamics) from FIO. The application is used to simulate and predict the wave process by solving the wave energy spectrum balance equation and its complicated characteristic equations in wave-number space, which is written in Fortran* and paralleled with MPI.

This version of MASNUM WAVE is optimized for the performance on both Intel Xeon processors and Intel Xeon Phi processors. Optimizations in this package include:

Removing repeated computation
Vectorization by loop unrolling, loop interchange, and removing data dependency
Multi-thread support by OpenMP*
Compiler options tuning

III. Preliminaries

To build this package, install Intel® MPI Library 5.1 or higher and Intel® Parallel Studio XE 2016 or higher on your host system.
Please contact Prof. Zhenya Song at songroy@fio.org.cn to get the optimized MASNUM WAVE source package and test workload.
Set up the Intel MPI Library and Intel® Fortran Compiler environments:

> source /opt/intel/compilers_and_libraries_<version>/linux/mpi/bin64/mpivars.sh
> source /opt/intel/compilers_and_libraries_<version>/linux/bin/compilervars.sh intel64
In order to run MASNUM WAVE on the Intel Xeon Phi processor, reboot the system with SNC-4 cluster mode and cache memory mode via BIOS settings. Please refer to Intel® Xeon Phi™ Processor - Memory Modes and Cluster Modes: Configuration and Use Cases for more details on memory configuration.

IV. Build MASNUM WAVE for the Intel Xeon processor

Unpack the source code to any directory of /home/<user>

> tar xvf WAVE_opt.tar.bz2
Build the executables for the Intel Xeon processor.

> cd /home/<user>/WAVE/src/bin
> cp makfile.cpu makefile
The executables are located at the path /home/<user>/WAVE/exp with the name masnum.wam.mpi.cpu

V. Build MASNUM WAVE for Intel Xeon Phi processor

Build the executables for the Intel Xeon Phi processor.

> cd /home/<user>/WAVE/src/bin
> cp makfile.knl makefile
This will build executables for the Intel Xeon Phi processor; the executables are located at the path /home/<user>/WAVE/exp/, with the name masnum.wam.mpi.knl

VI. Run MASNUM WAVE on the Intel Xeon processor and Intel Xeon Phi processor

Run MASNUM WAVE with the test workload on the Intel Xeon processor.

> cd /home/<user>/WAVE/exp
> mpirun –n 36 -env OMP_NUM_THREADS 1 ./masnum.wam.mpi.cpu
Run MASNUM WAVE with the test workload on the Intel Xeon Phi processor. Make sure all the binary and workload files are located on KNL.

> cd /home/<user>/WAVE/exp
> mpirun –n 34 –env OMP_NUM_THREADS 8 ./masnum.wam.mpi.knl

VIII. Performance gain

For the test workload, the following graph shows the speedup achieved from the Intel Xeon Phi processor, compared to the Intel Xeon processor. As you can see, we get:

Up to 1.25x faster with the Intel® Xeon Phi™ processor 7210 compared to the two-socket Intel® Xeon® processor E5-2697 v4.
Up to 1.41x faster with the Intel® Xeon Phi™ processor 7250 compared to the two-socket Intel Xeon processor E5-2697 v4.

MASNUM WAVE Performance Improvement with the Intel® Xeon Phi™ Processor

Comments on performance improvement on Intel Xeon Phi:

MASNUM WAVE has good parallel scalability, and benefits from more cores. But it is memory bandwidth bound on Xeon and the hyper thread has no help on performance, which means 36 cores can get the same best performance on Xeon with 1 thread or 2 threads. And the best performance on Intel® Xeon Phi™ 7250 is achieved with 34 MPI ranks and 8 threads per rank, which means it can make good use of all the logical cores (272 cores). The best multi-thread scalability with OpenMP is running 8 threads per rank from our test result. It can reduce MPI communication time by reducing MPI ranks number from 68 to 34.
MASNUM WAVE has been well vectorized, and therefore the added register size available with AVX512 improves performance significantly.
MASNUM WAVE also benefits from MCDRAM because of memory bandwidth bound.

Testing platform configuration:

Intel Xeon processor E5-2697 v4: Dual-socket Intel Xeon processor E5-2697 v4, 2.3 GHz, 18 cores/socket, 36 cores, 72 threads (HT and Turbo ON), DDR4 128 GB, 2400 MHz, Oracle Linux* Server release 6.7.

Intel Xeon Phi processor 7210 (64 cores): Intel Xeon Phi processor 7210, 64 core, 256 threads, 1300 MHz core freq. (HT and Turbo ON), 1600 MHz uncore freq., MCDRAM 16 GB 6.4 GT/s, BIOS 10D42, DDR4 96 GB, 2133 MHz, Red Hat 7.2, SNC-4 cluster mode, MCDRAM cache memory mode.

Intel Xeon Phi processor 7250 (68 cores): Intel Xeon Phi processor 7250, 68 core, 272 threads, 1400 MHz core freq. (HT and Turbo ON), 1700 MHz uncore freq., MCDRAM 16 GB 7.2 GT/s, BIOS 10R00, DDR4 96 GB, 2400 MHz, Red Hat 7.2, SNC-4 cluster mode, MCDRAM cache memory mode.

↧

Recipe: Pricing Options Using Barone-Adesi Whaley Approximation

June 16, 2016, 11:21 am

Latest and popular articles on Intel Technologies

≫ Next: Recipe: RELION for Intel® Xeon Phi™ 7250 processor

≪ Previous: Recipe: Building and Running MASNUM WAVE for Intel® Xeon Phi™ Processors

Introduction

Many people have discussed how parallel programming practice can be applied to the Black-Scholes model and the Black-Scholes formula that prices European options analytically. They have successfully applied the parallelization methods to achieve high performance on European-style options pricing algorithm. However, nearly all the options written on a wide variety of underlying financial products including stocks, commodities and foreign exchanges are American-style with an early exercise clause embedded into the options contracts. Unlike the European-style options pricing problems, there is no close-form solution for this American-style option pricing problem. The pricing of American options has mainly focused on using the finite difference methods of Brennan and Schwartz [1978], the binomial of Cox, Ross Rubinstein [1979] and trinomial of Tian [1993]. While these numerical methods are capable of producing accurate solutions to American option pricing problems, they are also difficult to use and consume at least two magnitude more computationally resources. As a result, for the past 40 years, many talented financial mathematicians have been searching for newer and better numerical methods that can produce results with a more efficient use of computational resources. In this paper, we look at one of these successful efforts, pioneered by Barone-Adesi and Whaley [1989], and apply the high performance parallel computing entailed in the modern microprocessors to create a program that can exceed our expectation for high performance with a suitable numerical result.

Black-Scholes-Merton Formula

Consider an option on a stock providing a dividend yield equal to q. We will denote the difference between the American and European option price by v. Because both the American and the European option prices satisfy the Black–Scholes differential equation, v also does so.

For convenicence, We define

Without loss of generality, we can also write:

v = h(τ)g(S, h)

Change variables and substitution

The approximation involves assuming that the final term on the left-hand side is zero, so that

The ignored term is generally fairly small. When τ is large, 1-h is close to zero; when τ is small, ∂g/∂h is close to zero.

The American call and put prices at time t will be denoted by C(S, t) and P(S, t), where S is the stock price, and the corresponding European call and put price will be denoted by c(S, t) and p(S, t). Equation (1) can be solved using standard techniques. After boundary conditions have been applied, it is found that

The variable S* is the critical price of the stock above which the option should be exercised. It is estimated by solving the equation

Iteratively. For a put option, the valuation formula is

The variable S** is the critical price of the stock below which the option should be exercised. It is estimated by solving the equation

Iteratively, the other variables that have been used here are

Options on stock indices, currencies, and futures contracts are analogous to options on a stock providing a constant dividend yield. Hence the quadratic approximation approach can easily be applied to all of these types of options.

Code Access

The source code for Black-Scholes-Merton formula is maintained by Shuo Li and is available under the BSD 3-Clause Licensing Agreement. The program runs natively on Intel® Xeon Phi™ processors in a single node environment.

To get access to the code and test workloads, go to the source location and download the BAWAmericanOptions.tar file.

Build and Run Directions

Here are the steps for rebuilding the program:

Install Intel® Parallel Studio XE 2016 SP 3 on your system
Source the environment variable script file
Untar the BAWAmericanOptions.tar
Type make to build the binaries for Single and Double Precision
1. For Single Precision processing: am_call_sp.knl
2. For Double Precision processing: am_call_dp.knl

Make sure the host machine is powered by Intel® Xeon Phi™ Processors

[sli@ortce-knl7 ~]$ lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                272
On-line CPU(s) list:   0-271
Thread(s) per core:    4
Core(s) per socket:    68
Socket(s):             1
NUMA node(s):          8
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 87
Model name:            Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz
Stepping:              1
CPU MHz:               1400.273
BogoMIPS:              2793.61
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
NUMA node0 CPU(s):     5,6,11,12,17,18,23-25,73-86,135-148,197-210,259-271
NUMA node1 CPU(s):     1,2,7,8,13,14,19,20,43-58,105-120,167-182,229-244
NUMA node2 CPU(s):     3,4,9,10,15,16,21,22,59-72,121-134,183-196,245-258
NUMA node3 CPU(s):     0,26-42,87-104,149-166,211-228For Double Precision processing:

Run am_call_sp.knl and am_call_dp.knl 
[sli@wsl-knl-02 test_baw]$ ./am_call_sp.knl
 Call price using Barone-Adesi Whaley approximation Optimized = 5.743389
  cycles consumed is 99246
Pricing American Options using BAW Approximation in single precision.
Compiler Version  = 16
Release Update    = 3
Build Time        = May 27 2016 20:27:47
Input Dataset     = 142606336
Worker Threads    = 272

Completed pricing 142.60634 million options in 0.56671 seconds:
Parallel version runs at 251.64027 million options per second.
[sli@wsl-knl-02 test_baw]$ ./am_call_dp.knl
 Call price using Barone-Adesi Whaley approximation Optimized = 5.743386
  cycles consumed is 122640
Pricing American Options using BAW Approximation in double precision.
Compiler Version  = 16
Release Update    = 3
Build Time        = May 27 2016 20:34:32
Input Dataset     = 142606336
Worker Threads    = 272

Completed pricing 142.60634 million options in 2.10704 seconds:
Parallel version runs at 67.68101 million options per second.

Intel Sample Source Code License Agreement

↧

Recipe: RELION for Intel® Xeon Phi™ 7250 processor

June 16, 2016, 12:35 pm

Latest and popular articles on Intel Technologies

≫ Next: Recipe: The Black-Scholes-Merton Formula Optimization for Intel® Xeon Phi™ Processor

≪ Previous: Recipe: Pricing Options Using Barone-Adesi Whaley Approximation

I. Overview

This article provides a recipe for how to obtain, compile, and run an optimized version of relion-1.4 on Intel® Xeon® processors and Intel® Xeon Phi™ processors.

The source for this version of relion-1.4 can be downloaded from: http://www2.mrc-lmb.cam.ac.uk/relion/index.php

II. Introduction

RELION is an image processing software package and widely used to achieve high resolution Cryo-EM structures. It uses Bayesian MAP+EM algorithm to provide more reliable structures than existing methods which is more suitable for heterogeneous data. RELION is distributed under a GPL license, it is completely free, open-source software for both academia and industry. The code is based on C++. Parallelization is achieved through the MPI and Pthread. More information about relion can refer to http://www2.mrc-lmb.cam.ac.uk/relion/index.php

This project optimizes the performance of the auto-refine part of RELION on both Intel® Xeon® processor and Intel® Xeon Phi™ processor.

Optimizations in this package include:

Improve data alignment with 64-byte to reach better performance. With this data alignment, it will take about 10% performance improvement for this workload
Vectorize the hotspot loop. Especially for the first hotspot loop, it is used very frequently during program running. So after Vectorize two hotspot loops, it can take above 30% performance improvement for this workload.
RELOIN is memory bond application, and taking advantage of available fast MCDRAM on the Xeon Phi 7250 processor should improve the performance. Using the MCDRAM in cache mode we see about a 10% performance boost for this workload.

III. Preliminaries

To match these results, the Intel® Xeon Phi™ processor machine needs to be booted with bios settings for quad cluster mode and MCDRAM cache mode. Please review this document for further information.
To build this package, install the Intel® MPI Library for Linux* 5.1(Update 3) and Intel® Parallel Studio XE Composer Edition for C++ Linux* Version 2016 (Update 3) or higher products on your system.
Download relion-1.4.tar.bz2 from http://www2.mrc-lmb.cam.ac.uk/relion/index.php

Set up the Intel® MPI Library and Intel® C++ Compiler environments:

> source /opt/intel/impi/<version>/bin64/mpivars.sh> source /opt/intel/composer_xe_<version>/bin/compilervars.sh intel64

Unpack the source code to /home/users.

> cp relion-1.4.tar.bz2 /home/users> tar –xjvf relion-1.4.tar.bz2
> cd ./relion-1.4

Please contact Peking University, Yanan Zhu <ynzhu@pku.edu.cn> to get testing workload. Please request the version used for the Intel KNL Recipes.

Copy the workload to your home directory, the workload will include the following files:

>cp relion-autorefine-5000.tar.gz /home/users>cd /home/users>tar –xzvf relion-autorefine-5000.tar.gz

IV. Add optimized code into relion

Reload new and delete of class MultidimArray in src/multidim_array.h

>cd /home/users/relion-1.4>vi src/multidim_array.h

Insert the below optimized code before line 496

void *operator new(size_t size)
{
     return _mm_malloc(size, 64);
}
void operator delete(void *p)
{
     _mm_free(p);
}

Vectorize the hotspot loop

>cd /home/users/relion-1.4>vi src/multidim_array.h

Replace the original code with optimized code

Original code is(about line 930):
       for (long int l = 0; l < Ndim; l++)
            for (long int k = 0; k < Zdim; k++)
                for (long int i = 0; i < Ydim; i++)
                    for (long int j = 0; j < Xdim; j++)
                    {
                        T val;
                        if (k >= ZSIZE(*this))
                            val = 0;
                        else if (i >= YSIZE(*this))
                            val = 0;
                        else if (j >= XSIZE(*this))
                            val = 0;
                        else
                            val = DIRECT_A3D_ELEM(*this, k, i, j);
                        new_data[l*ZYXdim + k*YXdim+i*Xdim+j] = val;
                    }

Optimized code is:

if ( (ZSIZE(*this)<= Zdim) && (YSIZE(*this)<= Ydim) && (XSIZE(*this)<= Xdim) ) {
        for (long int l = 0; l < Ndim; l++)
            for (long int k = 0; k < Zdim; k++)
                for (long int i = 0; i < Ydim; i++) {
   #pragma simd
                    for (long int j = 0; j < Xdim; j++)
                    {
                        new_data[l*ZYXdim + k*YXdim+i*Xdim+j] = 0;
                    }
                }
        for (long int l = 0; l < Ndim; l++)
            for (long int k = 0; k < ZSIZE(*this); k++)
                for (long int i = 0; i < YSIZE(*this); i++) {
   #pragma simd
                    for (long int j = 0; j < XSIZE(*this); j++)
                    {
                        new_data[l*ZYXdim + k*YXdim+i*Xdim+j] = DIRECT_A3D_ELEM(*this, k, i, j);
                    }
                }
} else {

        for (long int l = 0; l < Ndim; l++)
            for (long int k = 0; k < Zdim; k++)
                for (long int i = 0; i < Ydim; i++)
                    for (long int j = 0; j < Xdim; j++)
                    {
                        T val;
                        if (k >= ZSIZE(*this))
                            val = 0;
                        else if (i >= YSIZE(*this))
                            val = 0;
                        else if (j >= XSIZE(*this))
                            val = 0;
                        else
                            val = DIRECT_A3D_ELEM(*this, k, i, j);
                        new_data[l*ZYXdim + k*YXdim+i*Xdim+j] = val;
           }
}

Vectorize the hotspot loop

>cd /home/users/relion-1.4>vi src/ml_optimiser.cpp

Replace the original code with optimized code
Original code is(line 3652):

FOR_ALL_DIRECT_ELEMENTS_IN_MULTIDIMARRAY(Frefctf)
{
diff2 -= (DIRECT_MULTIDIM_ELEM(Frefctf, n)).real * (*(Fimg_shift + n)).real;
diff2 -= (DIRECT_MULTIDIM_ELEM(Frefctf, n)).imag * (*(Fimg_shift + n)).imag;
suma2 += norm(DIRECT_MULTIDIM_ELEM(Frefctf, n));
}

Optimized code is:

Complex *opp;                                                                                                   FOR_ALL_DIRECT_ELEMENTS_IN_MULTIDIMARRAY(Frefctf)                                                                                {
diff2 -= (DIRECT_MULTIDIM_ELEM(Frefctf, n)).real * (*(Fimg_shift + n)).real;
diff2 -= (DIRECT_MULTIDIM_ELEM(Frefctf, n)).imag * (*(Fimg_shift + n)).imag;
opp = & DIRECT_MULTIDIM_ELEM(Frefctf, n);
suma2 += opp->real * opp->real + opp->imag * opp->imag;
}

There is a known issue in relion 1.4 as references in the following link:
http://www2.mrc-lmb.cam.ac.uk/relion/index.php/Known_issue
Change line 405 in src/ml_optimiser_mpi.cpp from:
length_fn_ctf = exp_fn_img.length() + 1; // +1 to include \0 at the end of the string
into:
length_fn_ctf = exp_fn_ctf.length() + 1; // +1 to include \0 at the end of the string

V. Prepare for Intel® Xeon® processor

Set environment variables for compilation of relion:

>export CC=icc>export CXX=icpc>export F77=ifort>export MPICC=mpiicc>export MPICXX=mpiicpc>export CFLAGS="-O3 -xHost -fno-alias -align">export FFLAGS="-O3 -xHost -fno-alias -align">>export CXXFLAGS="-O3 -xHost -fno-alias -align"

Suggestion: you can also add -qopt-report=5 into CFLAGS/FFLAGS/CXXFLAGS to see optimization report

Build the library for the Intel® Xeon processor.
```
> cd /home/users> cd ./relion-1.4> ./INSTALL.sh
```

VI. Prepare for Intel® Xeon® Phi™ processor

Set environment variables for compilation of relion:

>export CC=icc>export CXX=icpc>export F77=ifort>export MPICC=mpiicc>export MPICXX=mpiicpc>export CFLAGS="-O3 -xMIC-AVX512 -fno-alias -align">export FFLAGS="-O3 -xMIC-AVX512 -fno-alias -align">export CXXFLAGS="-O3 -xMIC-AVX512 -fno-alias -align"

Suggestion: you can also add -qopt-report=5 into CFLAGS/FFLAGS/CXXFLAGS to see optimization report

Build the library for the Intel® Xeon Phi™ processor.
```
cd /home/users
cd ./relion-1.4
./INSTALL.sh 
```

VII. Run the test workload on Intel® Xeon processor

Create running scripts for this workload

>vi auotrefine.sh
#!/bin/sh
nprocs=9
nthreads=4
mrcsfile="adkc_05000.mrcs.mrcs"
starfile="adkc_05000.star"
defocusfile="3eulerctf_05000.dat"

echo ""> $starfile
echo "data_">> $starfile
echo "">> $starfile
echo "loop_">> $starfile
echo "_rlnVoltage #1">> $starfile
echo "_rlnDefocusU #2">> $starfile
echo "_rlnDefocusV #3">> $starfile
echo "_rlnDefocusAngle #4">> $starfile
echo "_rlnSphericalAberration #5">> $starfile
echo "_rlnAmplitudeContrast #6">> $starfile
echo "_rlnImageName #7">> $starfile

awk 'BEGIN{npart=0}{if($1 !~/^;/){npart++; printf("%s %s %s %s %s %s %d@'"$mrcsfile"'n", $1,$3,$4,$5,$6,$7,npart)}}' $defocusfile >> $starfile
mkdir -p Refine/adkc_05000
mpirun -np $nprocs relion_refine_mpi --o Refine/adkc_05000/run01 --j $nthreads --iter 10 --split_random_halves --i $starfile  --particle_diameter 86 --angpix 0.86 --ref adkc.mrc --firstiter_cc --ini_high 60 --ctf --tau2_fudge 4 --K 1 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 10 --offset_step 2 --sym C1 --norm --scale --memory_per_thread 0.5 --auto_local_healpix_order 5 --low_resol_join_halves 40

Set PATH to add LD_LIBRARY_PATH to running relion autorefine workload

>export PATH=$PATH:/home/users/relion-1.4/bin>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/users/relion-1.4/lib

Running this workload

>cd /home/users/5000_vec1>sh auotrefine.sh

VIII. Run the test workload on Intel® Xeon Phi™ processor

Create running scripts for this workload

>vi auotrefine.sh
#!/bin/sh
nprocs=65
nthreads=4
mrcsfile="adkc_05000.mrcs.mrcs"
starfile="adkc_05000.star"
defocusfile="3eulerctf_05000.dat"

echo ""> $starfile
echo "data_">> $starfile
echo "">> $starfile
echo "loop_">> $starfile
echo "_rlnVoltage #1">> $starfile
echo "_rlnDefocusU #2">> $starfile
echo "_rlnDefocusV #3">> $starfile
echo "_rlnDefocusAngle #4">> $starfile
echo "_rlnSphericalAberration #5">> $starfile
echo "_rlnAmplitudeContrast #6">> $starfile
echo "_rlnImageName #7">> $starfile

awk 'BEGIN{npart=0}{if($1 !~/^;/){npart++; printf("%s %s %s %s %s %s %d@'"$mrcsfile"'n", $1,$3,$4,$5,$6,$7,npart)}}' $defocusfile >> $starfile
mkdir -p Refine/adkc_05000
mpirun -np $nprocs relion_refine_mpi --o Refine/adkc_05000/run01 --j $nthreads --iter 10 --split_random_halves --i $starfile  --particle_diameter 86 --angpix 0.86 --ref adkc.mrc --firstiter_cc --ini_high 60 --ctf --tau2_fudge 4 --K 1 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 10 --offset_step 2 --sym C1 --norm --scale --memory_per_thread 0.5 --auto_local_healpix_order 5 --low_resol_join_halves 40

Set PATH to add LD_LIBRARY_PATH to running relion autorefine

>export PATH=$PATH:/home/users/relion-1.4/bin>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/users/relion-1.4/lib

Running this workload

>cd /home/users/5000_vec1>sh auotrefine.sh

IX. Performance gain

For this autorefine workload we described above, the following graph shows the speedup achieved from this optimization. As you can see, up to a 1.31x speedup can be achieved when running this code on one Intel® Xeon Phi™ 7250 vs. one 2-Socket Intel® Xeon® Processor E5-2697 v4. Up to a 1.23x speedup can be achieved when running this code on one Intel® Xeon Phi™ 7210 vs. one 2-Socket Intel® Xeon® Processor E5-2697 v4

2S Intel® Xeon® processor E5-2697 v4, (18 Ranks)
1 Intel® Xeon Phi™ 7210 (63 Ranks)
1 Intel® Xeon Phi™ 7250 (65 Ranks)

Testing platform configuration:

Intel® Xeon® processor E5-2697 v4: Dual Socket ® processor E5-2697 v4 2.3 GHz (Turbo ON) , 18 Cores/Socket, 36 Cores, 72 Threads (HT off), DDR4 128GB, 2400 MHz, CentOS release 6.7 (Final)

Intel® Xeon Phi™ processor 7210 (64 cores): Intel® Xeon Phi™ processor 7210 64 core, 256 threads, 1300 MHz core freq. (Turbo ON), , MCDRAM 16 GB 6.4 GT/s, BIOS GVPRCRB1.86B.0010.R00.1603251732, DDR4 96GB 2133 MHz, Red Hat 7.2(Maipo), quad cluster mode, MCDRAM cache mode

Intel® Xeon Phi™ processor 7250 (68 cores): Intel® Xeon Phi™ processor 7250 68 core, 272 threads, 1400 MHz core freq. (Turbo ON), MCDRAM 16 GB 7.2 GT/s, BIOS GVPRCRB1.86B.0010.D42.1604182214, DDR4 96GB 2400 MHz, Red Hat 7.2(Maipo), quad cluster mode, MCDRAM cache mode

↧

Recipe: The Black-Scholes-Merton Formula Optimization for Intel® Xeon Phi™ Processor

June 16, 2016, 2:27 pm

Latest and popular articles on Intel Technologies

≫ Next: Tutorial on Intel® Xeon Phi™ Processor Optimization

≪ Previous: Recipe: RELION for Intel® Xeon Phi™ 7250 processor

Introduction

Financial derivative pricing is a cornerstone of quantitative finance. The most common form of financial derivatives is common stock options, which are contracts between two parties regarding buying or selling an asset (specifically shares of stock) at a certain time at an agreed price. The two types of options are calls and puts. A call option gives the holder the right to buy the asset by a certain date for a certain price. A put option gives the holder the right to sell the asset by a certain date for a certain price. The contract price is called the exercise price or strike price. The date in the contract is known as the expiration date or maturity. American options can be exercised at any time before the expiration date. European options can be exercised only on the expiration date.

Typically, the value of an option f, is determined by the following factors:

S – the current price of the underlying asset
X – the strike price of the option
T – the time to the expiration
σ – the volatility of the underlying asset
r – the continuously compounded risk-free rate

In their 1973 paper, “The Pricing of Options and Corporate Liabilities,” Fischer Black and Myron Scholes created a mathematical description of financial markets and stock options in frameworks built by researchers from Luis Bachelier to Paul Samuelson. Jack Treynor arrived at partial differential equations, which Robert Merton first referred as Black-Scholes Model.

Black-Scholes Model

This PDE has many solutions, corresponding to all the different derivatives with the same underlying asset S. The specific derivative obtained depends on the boundary conditions used while solving this equation. In the case of European call options, the key boundary condition is

f_call = max(S-K, 0) when t=T

In the case of European put options, it is

f_put = max(K-S, 0) when t=T

Black-Scholes-Merton Formula

Shortly after Black-Scholes’s historical paper, Robert Merton was the first one to publish a paper recognizing the significance and coined the term Black-Scholes option pricing model. Merton is also credited with a closed-form solution to the Black-Scholes Equation for European call options c, and the European put option p known as the Black-Scholes-Merton formula.

Black-Scholes-Merton Formula

The function N(x) is the cumulative normal distribution function. It calculates the probability that a variable with a standard normal distribution of Ф(0,1) will be less than x. In most cases, N(x) is approximated using a polynomial function defined as:

Black-Scholes-Merton Formula

Code Access

To get access to the code and test workloads, go to the source location and download the BlackScholes.tar file.

Build and Run Directions

Here are the steps for building the program:

Install Intel® Parallel Studio XE 2016 SP 3 on your system.
Source the environment variable script file
Untar the BlackScholes.tar
Type make to build the binaries for Single and Double Precision:
1. For Single Precision processing: BlackScholesSP.knl
2. For Double Precision processing: BlackScholesDP.knl
Make sure the host machine is powered by Intel® Xeon Phi™ processors
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 272
On-line CPU(s) list: 0-271
Thread(s) per core: 4
Core(s) per socket: 68
Socket(s): 1
NUMA node(s): 8
Vendor ID: GenuineIntel
CPU family: 6
Model: 87
Model name: Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz
Stepping: 1
CPU MHz: 1400.273
BogoMIPS: 2793.61
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
NUMA node0 CPU(s):     5,6,11,12,17,18,23-25,73-86,135-148,197-210,259-271
NUMA node1 CPU(s):     1,2,7,8,13,14,19,20,43-58,105-120,167-182,229-244
NUMA node2 CPU(s):     3,4,9,10,15,16,21,22,59-72,121-134,183-196,245-258
NUMA node3 CPU(s):     0,26-42,87-104,149-166,211-228For Double Precision processing: BlackScholesDP.kn

Make sure the memory mode for Intel® Xeon Phi™ processors is flat

# syscfg /d biossettings "memory mode"
Memory Mode
===========
Current Value : Flat
   ---------------------
   Possible Values
   ---------------
   Cache : 00
   Flat : 01
   Hybrid : 02
   Auto : 03

Run BlackScholesSP.knl and BlackSchoelsDP.knl under numactl with parameter –m1.

[sli@wsl-knl-02 clean]$ numactl -m1 ./BlackScholesSP.knl
Black-Scholes Formula Single Precision.
Compiler Version  = 16
Release Update    = 3
Build Time        = May 27 2016 17:57:13
Input Dataset     = 160432128
Repetitions       = 1000
Chunk Size        = 64
Worker Threads    = 272

==========================================
Total Cycles                   = 19090695400
Cycles/OptionPair at thread 0  = 32.37
Time Elapsed                   = 13.6678
Options/sec                    = 23.4759

==========================================
[sli@wsl-knl-02 clean]$ numactl -m1 ./BlackScholesDP.knl
Black-Scholes Formula Double Precision.
Compiler Version  = 16
Release Update    = 3
Build Time        = May 27 2016 17:57:14
Input Dataset     = 160432128
Repetitions       = 1000
Chunk Size        = 256
Worker Threads    = 272

=============================================
Total Cycles                   = 22208060392
Cycles/OptionPair at thread 0  = 37.65
Time Elapsed                   = 15.8996
Options/sec                    = 20.1806
=============================================

Intel Sample Source Code License Agreement

↧

Tutorial on Intel® Xeon Phi™ Processor Optimization

June 20, 2016, 1:20 pm

Latest and popular articles on Intel Technologies

≫ Next: Meet the Experts - Gergana Slavova

≪ Previous: Recipe: The Black-Scholes-Merton Formula Optimization for Intel® Xeon Phi™ Processor

Download File[TAR 20KB]

1. Introduction

In this tutorial, we demonstrate some possible ways to optimize an application to run on the Intel® Xeon Phi™ processor. The optimization process in this tutorial is divided into three parts:

The first part describes the general optimization techniques that are used to vectorize (data parallelism) the code.
The second part describes how thread-level parallelism is added to utilize all the available cores in the processor.
The third part optimizes the code by enabling memory optimization on the Intel Xeon Phi processor.

We conclude the tutorial by showing the graph that shows the performance improvement at each optimization step.

This work is organized as follows: a serial, suboptimal sample code is used as the base. Then we apply some optimization techniques to that code to obtain the vectorized version of the code, and we add threaded parallelism to the vectorized code to have the parallel version of the code. Finally, we use Intel® VTune™ Amplifier to analyze memory bandwidth of the parallel code to improve the performance further using the high bandwidth memory. All three versions of the code (mySerialApp.c, myVectorizedApp.c, and myParallelApp.c) are included as an attachment with this tutorial.

The sample code is a streaming application with two large sets of buffer containing the inputs and outputs. The first large set of input data contains the coefficients for quadratic equations. The second large set of output data is used to hold the roots of each quadratic equation. For simplicity, the coefficients are chosen so that we always have two real roots of the quadratic equation.

Consider the quadratic equation:

The two roots are the solutions to the known formula:

The conditions for having two real distinct roots are: and

2. Hardware and Software

The program runs on a preproduction Intel® Xeon Phi™ processor, model 7250, with 68 cores clocked at 1.4 GHz, 96 GB of DDR4 RAM, and 16 GB Multi Channel Dynamic Random Access Memory (MCDRAM). With 4 hardware threads per core, this system can run with a total of 272 hardware threads. We install Red Hat Enterprise Linux* 7.2, the Intel® Xeon Phi™ processor software version 1.3.1, and Intel® Parallel Studio XE 2016 update 3 on this system.

To check the type of processor and the number of processors in your system, you can display the output using /proc/cpuinfo. For example:

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 87
model name      : Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz
stepping        : 1
microcode       : 0xffff0180
cpu MHz         : 1515.992
cache size      : 1024 KB
physical id     : 0
siblings        : 272
core id         : 0
cpu cores       : 68
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdt
scp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc ap
erfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16
 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f1
6c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm tpr_shadow vnmi
flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdsee
d adx avx512pf avx512er avx512cd xsaveopt
bogomips        : 2793.59
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:
……………………………………

The complete output from the test system shows 272 CPUs, or hardware threads. Note that in the flags field, it shows instruction extensions avx512f, avx512pf, avx512er, avx512cd; those are the instruction extensions that the Intel Xeon Phi processor supports.

You can also display information about the CPU by running lscpu:

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                272
On-line CPU(s) list:   0-271
Thread(s) per core:    4
Core(s) per socket:    68
Socket(s):             1
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 87
Model name:            Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz
Stepping:              1
CPU MHz:               1365.109
BogoMIPS:              2793.59
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
NUMA node0 CPU(s):     0-271
NUMA node1 CPU(s):

The above command shows that the system has 1 socket, 68 cores, and 272 CPUs. It also indicates that this system has 2 NUMA nodes, all 272 CPUs belong to the NUMA node 0. For more information on NUMA, please refer to An Intro to MCDRAM (High Bandwidth Memory) on Knights Landing.

Before analyzing and optimizing the sample program, compile the program and run the binary to get the baseline performance.

3. Benchmarking the Baseline Code

A naïve implementation of the solution is shown in the attached program mySerialApp.c. The coefficients a, b, and c are grouped in the structure Coefficients; roots x₁ and x₂ are grouped in the structure Roots. Coefficient and root are single-precision floating numbers. Every coefficients tuple corresponds to a roots tuple. The program allocates N coefficient tuples and N root tuples. N is a large number (N = 512M elements, or to be exact, 512*1024*1024 = 536,870,912 elements). The coefficients structure and the roots structure are shown below:

struct Coefficients {
        float a;
        float b;
        float c;
    } coefficients;

struct Roots {
        float x1;
        float x2;
    } roots;

This simple program computes the real roots x₁ and x₂ according to the above formula. Also, we use the standard system timer to measure the computation time. The buffer allocation time and initialization time are not measured. The simple program repeats the calculation process 10 times.

To start, benchmark the application by compiling the baseline code using the Intel® C++ Compiler:

$ icc mySerialApp.c

By default, the compiler compiles with the switch -O2, which is optimized for maximum speed. We then run the application:

$ ./a.out No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... SERIAL Elapsed time in msec: 461,222 (after 10 iterations)

The output indicates that for that large number of entries (N = 512M elements), the system takes 461,222 msec to complete 10 iterations to stream data, to compute the roots, and to store the results. For each coefficients tuple, this program calculates the roots tuple. Note that this baseline code does not take advantage of the high number of available cores in the system or SIMD instructions since it runs in the serial and scalar mode (only one thread that processes one tuple element at a time). Therefore, only one hardware thread (CPU) is running, all the rest of the CPUs are idle. You can verify this by generating a vectorization report (*.optrpt) using the compiler option -qopt-report=5 -qopt-report-phase:vec.

$ icc mySerialApp.c -qopt-report=5 -qopt-report-phase:vec

After measuring the baseline code performance, we can start vectorizing the code.

4. Vectorizing Code

4.1. Change array of structure to structure of arrays. Do not use multiple layers in buffer allocation.

The first way to improve the code performance is to convert the array of structure (AoS) to structure of arrays (SoA). SoA can increase the amount of data accessed with unit strides. Instead of defining a large number of coefficients tuples (a, b, c) and roots tuples (x₁,x₂), we can rearrange the data structure so that we allocate them in 5 large arrays called a, b, c, x₁ and x₂ (refer to the program myVectorizedApp.c). In addition, instead of using malloc to allocate memory, we use _mm_malloc to align data with 64-byte boundary (see next section).

float *coef_a  = (float *)_mm_malloc(N * sizeof(float), 64);
float *coef_b  = (float *)_mm_malloc(N * sizeof(float), 64);
float *coef_c  = (float *)_mm_malloc(N * sizeof(float), 64);
float *root_x1 = (float *)_mm_malloc(N * sizeof(float), 64);
float *root_x2 = (float *)_mm_malloc(N * sizeof(float), 64);

4.2. Further improvement: removing type conversion, data alignment

The next step is to remove unnecessary type conversion. For example, the function sqrt() takes a double precision as input. But since we pass a single precision as input in this program, the compiler needs to convert single precision to double precision. To remove unnecessary data type conversion, use sqrtf() instead of sqrt(). Similarly, use single precision instead of integer, and so on. For example, instead of using 4, we use 4.0f. Note that 4.0 (without the f suffix) is a double-precision floating number while 4.0f is a single-precision floating number.

Data alignment helps data move efficiently when data have to be moved from and to memory. For the Intel Xeon Phi processor, data movement is optimal when the data starting address lies on the 64-byte boundary, just like for the Intel® Xeon Phi™ coprocessor. To help the compiler vectorize, you need to allocate memory with 64-byte alignment, and use pragma/directives where data is used to tell the compiler that memory access is aligned. Vectorization performs best with properly aligned data. In this document, vectorization refers to the ability to process multiple data in a single instruction (SIMD).

In the above example, to align heap-allocated data, we use _mm_malloc() and _mm_free() to allocate arrays. Note that _mm_malloc() behaves as malloc() but takes an alignment parameter (in bytes) as a second argument, which is 64 for the Intel Xeon Phi processor. We need to inform the compiler by inserting a clause before the data used assume_aligned(a, 64) to indicate the array a is aligned. To inform the compiler that all arrays accessed in a given loop are aligned, add the clause #pragma vector aligned before the loop.

4.3. Use auto-vectorization, run a compiler report, and disable vectorization via a compiler switch

Vectorization refers to programming techniques that use the Vector Processing Units (VPU) available to perform an operation on multiple values simultaneously. Auto-vectorization is the capability of a compiler to identify opportunities in the loops and perform vectorization accordingly. You can take advantage of the auto-vectorization feature in the Intel compiler since auto-vectorization is enabled for the default optimization level –O2 or higher.

For example, when the mySerialApp.c sample code is compiled using the Intel compiler icc, by default, the compiler looks for vectorization opportunities at the loops. However, the compiler needs to follow certain rules (loop trip must be known, single entry and single exit, straight-line code, the innermost loop of a nest, and so on) in order to vectorize these loops. You can help the compiler to vectorize these loops by providing additional information.

To determine whether or not your code is vectorized, you can generate a vectorization report by specifying the option -qopt-report=5 -qopt-report-phase:vec. A vectorization report (*.optrpt) is then generated by the compiler. The report tells you whether or not vectorization is done at each loop and a brief explanation about loop vectorization. Note that the vectorization report option is –qopt-report=<n>, where n is used to specify the level of details.

4.4. Compile with optimization level `–O3`

Now compile the application with optimization level –O3. This optimization level is for maximum speed and is a more aggressive optimization than the default optimization level –O2.

With auto-vectorization, at each iteration loop, instead of processing one element at a time, the compiler packs 16 single-precision floating numbers in a vector register and performs the operation on that vector.

$ icc myVectorizedApp.c –O3 -qopt-report -qopt-report-phase:vec -o myVectorizedApp

The compiler generates the following output files: a binary, myVectorizedApp, and a vectorization report, myVectorizedApp.optrpt. To run the binary:

$ ./myVectorizedApp No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... Elapsed time in msec: 30496 (after 10 iterations)

The binary runs with only one thread but with vectorization. The myVectorizedApp.optrpt report should confirm that all the inner loops are vectorized.

To compare, also compile the program with the -no-vec option:

$ icc myVectorizedApp.c –O3 -qopt-report -qopt-report-phase:vec -o myVectorizedApp-noVEC -no-vec icc: remark #10397: optimization reports are generated in *.optrpt files in the output location

Now run the myVectorizedApp–noVEC binary:

$ ./myVectorizedApp-noVEC No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... Elapsed time in msec: 180375 (after 10 iterations)

This time the myVectorizedApp.optrpt report shows that loops are not vectorized because auto-vectorization is disabled as expected.

We now can observe two improvements. The improvement from the original version (461,222 msec) to the no-vec version (180,375 msec) is basically due to general optimization techniques. The improvement from the version without vectorization (180,375 msec) to the one with vectorization (30,496 msec) is due to auto-vectorization.

Even with this improvement in performance, it is still only one thread performing the calculation. The code can be further enhanced by enabling multiple threads running in parallel to take advantage of the multi-core architecture.

5. Enabling Multi-Threading

5.1. Thread-level parallelism: OpenMP*

To take advantage of the high number of cores on the Intel Xeon Phi processor (68 cores in this system), you can scale the application by running OpenMP threads in parallel. OpenMP is the standard API and programing model for shared memory.

To use OpenMP threads, you need to include the header file "omp.h", and to link the code with the flag –qopenmp. In the myParallelApp.c program, the below directive is added before the for-loop:

#pragma omp parallel for simd

This pragma directive added before the for-loop tells the compiler to generate a team of threads and to break the work in the for-loop into many chunks. Each thread executes a number of work chunks according to the OpenMP runtime schedule. The SIMD construct simply indicates that multiple iterations of the loop can be executed concurrently using SIMD instructions. It informs the compiler to ignore assumed vector dependencies found in the loop, so use it carefully.

In this program, thread parallelism and vectorization occurs at the same loop. Each thread starts with its own lower-bound for the loop. To ensure that OpenMP (static scheduling) has good alignment results, we may restrict the number of parallel loops and the remaining loops are processed serially.

#pragma omp parallel
#pragma omp master
    {
        int tid = omp_get_thread_num();
 numthreads = omp_get_num_threads();

        printf("thread num=%d\n", tid);
        printf("Initializing\r\n");

// Assuming omp static scheduling, carefully limit the loop-size to N1 instead of N
        N1 = ((N / numthreads)/16) * numthreads * 16;
        printf("numthreads = %d, N = %d, N1 = %d, num-iters in remainder serial loop = %d, parallel-pct = %f\n", numthreads, N, N1, N-N1, (float)N1*100.0/N);
    }

And the function that computes the roots becomes

for (j=0; j<ITERATIONS; j++)
    {
#pragma omp parallel for simd
#pragma vector aligned
        for (i=0; i<serial; i++)   // Perform in parallel fashion
        {
            x1[i] = (- b[i] + sqrtf((b[i]*b[i] - 4.0f*a[i]*c[i])) ) / (2.0f*a[i]);
            x2[i] = (- b[i] - sqrtf((b[i]*b[i] - 4.0f*a[i]*c[i])) ) / (2.0f*a[i]);
        }

#pragma vector aligned
        for( i=serial; i<vectorSize; i++)
        {
            x1[i] = (- b[i] + sqrtf((b[i]*b[i] - 4.0f *a[i]*c[i])) ) / (2.0f*a[i]);
            x2[i] = (- b[i] - sqrtf((b[i]*b[i] - 4.0f *a[i]*c[i])) ) / (2.0f*a[i]);
        }
    }

Now you can compile the program and link it with –qopenmp:

$ icc myParallelApp.c –O3 -qopt-report=5 -qopt-report-phase:vec,openmp -o myParallelAppl -qopenmp

Check the myParallelApp.optrpt report that confirms all loops are vectorized and parallelized with OpenMP.

5.2. Use environment variables to set the number of threads and to set affinity

The OpenMP implementation can start a number of threads in parallel. By default, the number of threads is set to the maximum hardware threads in the system. In this case, 272 OpenMP threads will be running by default. However, we can use the OMP_NUM_THREADS environment variable to set the number of OpenMP threads. For example, the command below starts 68 OpenMP threads:

$ export OMP_NUM_THREADS=68

Thread affinity (the ability to bind OpenMP thread to a CPU) can be set by using the KMP_AFFINITY environment variable. To distribute the threads evenly across the system, set the variable to scatter:

$ export KMP_AFFINITY=scatter

Now run the program using all the cores in the system and vary the number of threads running per core. Here is the output from our test runs comparing the performance when running 1, 2, 3, and 4 threads per core.

Running 1 thread per core in the test system:

$ export KMP_AFFINITY=scatter $ export OMP_NUM_THREADS=68 $ ./myParallelApp No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 68, N = 536870912, N1 = 536870336, num-iters in remainder serial loop = 576, parallel-pct = 99.999893 Starting Compute on 68 threads Elapsed time in msec: 1722 (after 10 iterations)

Running 2 threads per core:

$ export OMP_NUM_THREADS=136 $ ./myParallelApp No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 136, N = 536870912, N1 = 536869248, num-iters in remainder serial loop = 1664, parallel-pct = 99.999690 Starting Compute on 136 threads Elapsed time in msec: 1781 (after 10 iterations)

Run 3 threads per core:

$ export OMP_NUM_THREADS=204 $ ./myParallelApp No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 204, N = 536870912, N1 = 536869248, num-iters in remainder serial loop = 1664, parallel-pct = 99.999690 Starting Compute on 204 threads Elapsed time in msec: 1878 (after 10 iterations)

Running 4 threads per core:

$ export OMP_NUM_THREADS=272 $ ./myParallelApp No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 272, N = 536870912, N1 = 536867072, num-iters in remainder serial loop = 3840, parallel-pct = 99.999285 Starting Compute on 272 threads Elapsed time in msec: 1940 (after 10 iterations)

From the above results, the best performance is obtained when one thread runs on one core, and uses all 68 cores.

6. Optimizing the code for the Intel Xeon Phi processor

6.1. Memory bandwidth optimization

There are two different types of memory on the system: the on-package memory 16 GB of MCDRAM and 96 GB of traditional on-platform 6 channels of DDR4 RAM (with option to extend to a maximum of 384 GB). MCDRAM bandwidth is about 500 GB/s while DDR4 peak performance bandwidth is about 90 GB/s.

There are three possible configuration modes for the MCDRAM: flat mode, cache mode, or hybrid mode. If MCRDAM is configured as addressable memory (flat mode), the user can explicitly allocate memory in MCDRAM. If MCDRAM is configured as cache, the entire MCDRAM is used as last-level cache, between L2 cache and DDR4 memory. If MCDRAM is configured as hybrid, some portion of MCDRAM is used as cache and the rest is used as addressable memory. The pros and cons of these configurations are shown in the below table:

Memory Mode	Pros	Cons
Flat	User can control MCDRAM to take advantage of high bandwidth memory	User needs to use `numactl` or modify code
Cache	Transparent to user Extend cache level	Potentially increases latency to load/store memory in DDR4
Hybrid	Application can take advantage of both flat and cache modes	Cons associated with both flat and cache modes

With respect to Non Uniform Memory Access (NUMA) architecture, the Intel Xeon Phi processor can appear as one or two nodes depending on how the MCDRAM is configured. If MCDRAM is configured as cache, the Intel Xeon Phi processor appears as 1 NUMA node. If MCDRAM is configured as flat or hybrid, the Intel Xeon Phi processor appears as 2 NUMA nodes. Note that Clustering Mode can further partition the Intel Xeon Phi processor up to 8 NUMA nodes; however, in this tutorial, Clustering Mode is not covered.

The numactl utility can be used to display the NUMA nodes in the system. For example, executing “numactl –H” in this system, where MCDRAM is configured in flat mode, will show two NUMA nodes. Node 0 consists of 272 CPUs and the DDR4 memory with 96 GB, while node 1 consists of the MCDRAM memory with 16 GB.

$ numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 node 0 size: 98200 MB node 0 free: 92888 MB node 1 cpus: node 1 size: 16384 MB node 1 free: 15926 MB node distances: node 0 1 0: 10 31 1: 31 10

The "numactl" tool can be used to allocate memory in certain NUMA modes. In this example, the node 0 contains all CPUs and the on-platform memory DDR4, while node 1 has the on-packet memory MCDRAM. The switch –m , or –-membind, can be used to force the program to allocate memory to a NUMA node number.

To force the application to allocate DDR memory (node 0), run the command:
$ numactl -m 0 ./myParallelApp

This is equivalent to:
$ ./myParallelApp

Now run the application with 68 OpenMP threads:

$ export KMP_AFFINITY=scatter $ export OMP_NUM_THREADS=68

$ numactl -m 0 ./myParallelApp No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 68, N = 536870912, N1 = 536870336, num-iters in remainder serial loop = 576, parallel-pct = 99.999893 Starting Compute on 68 threads Elapsed time in msec: 1730 (after 10 iterations)

To display another view of the NUMA nodes, run the command “lstopo”. This command displays not only the NUMA nodes, but also the L1 and L2 cache associated with these nodes.

6.2. Analyze memory usage

Is this application bandwidth bound? Use the Intel VTune Amplifier to analyze memory accesses. DDR4 DRAM peak performance bandwidth is about 90 GB/s (gigabytes per second), while MCDRAM memory peak performance is around 500 GB/s.

Install Intel VTune Amplifier on your system, and then run the following Intel VTune Amplifier command to collect memory access information while the application allocates DDR memory:

$ export KMP_AFFINITY=scatter; export OMP_NUM_THREADS=68; amplxe-cl -collect memory-access -- numactl -m 0 ./myParallelApp

You can view bandwidth utilization of your application by looking at the “Bandwidth Utilization Histogram” field. The histogram shows that DDR bandwidth utilization is high.

Bandwidth Utilization Histogram

Looking at the memory access profile, we observe that the peak DDR4 bandwidth reaches 96 GB/s, which is around the peak performance bandwidth of DDR4. The result suggests that this application is memory bandwidth bound.

Memory access profile peak DDR4 bandwidth reaches 96 GB/s

Looking at the memory allocation in the application, we see allocating of 5 large arrays of 512 M elements (to be precise, 512 * 1024 * 1024 elements). Each element is a single-precision float (4 bytes); therefore the total size of each array is about 4*512 M or 2 GB. The total memory allocation is 2 GB * 5 = 10 GB. This memory size fits well in the MCDRAM (16 GB capacity), so it is likely that allocating memory in MCDRAM (flat mode) will benefit the application.

To allocate the memory in the MCDRAM (node 1), pass the argument –m 1 to the command numactl as below:

$ numactl -m 1 ./myParallelApp No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 68, N = 536870912, N1 = 536870336, num-iters in remainder serial loop = 576, parallel-pct = 99.999893 Starting Compute on 68 threads Elapsed time in msec: 498 (after 10 iterations)

We clearly see a significant performance improvement when the application allocates memory in MCDRAM.

For comparison purpose, run the following Intel VTune Amplifier command collecting memory access information while the application allocates MCDRAM memory:

$ export KMP_AFFINITY=scatter; export OMP_NUM_THREADS=68; amplxe-cl -collect memory-access -- numactl -m 1 ./myParallelApp

The histogram shows that DDR bandwidth utilization is low and MCDRAM use is high:

Histogram shows DDR bandwidth utilization is low and MCDRAM use is high

Bandwidth Utilization Histogram

Looking at the memory access profile, we observe that DDR4 peak bandwidth reaches 2.3 GB/s, while MCDRAM peak bandwidth reaches 437 GB/s.

Memory access profile DDR4 peak bandwidth 2.3 GB/s, MCDRAM peak bandwidth 437 GB/s

6.3. Compiling using the compiler knob `–xMIC-AVX512`

The Intel Xeon Phi processor supports x87, Intel® Streaming SIMD Extensions (Intel® SSE), Intel® SSE2, Intel® SSE3, Supplemental Streaming SIMD Extensions 3, Intel® SSE4.1, Intel® SSE4.2, Intel® Advanced Vector Extensions (Intel® AVX), Intel Advanced Vector Extensions 2 (Intel® AVX2) and Intel® Advanced Vector Extensions 512 (Intel AVX-512) Instruction Set Architecture (ISA). It does not support Intel® Transactional Synchronization Extensions.

Intel AVX-512 is implemented in the Intel Xeon Phi processor. The Intel Xeon Phi processor supports the following groups: Intel AVX-512F, Intel AVX-512CD, Intel AVX-512ER, and Intel AVX-FP. Intel AVX-512F (Intel AVX-512 Foundation Instructions) includes extensions of the Intel AVX and Intel AVX2 of SIMD instructions for 512-bit vector registers; Intel AVX-512CD, (Intel AVX-512 Conflict Detection) enables efficient conflict detection to allow more loops vectorized; Intel AVX-512ER (Intel AVX-512 Exponential and Reciprocal instructions) provides instructions for base 2 exponential functions, reciprocals, and inverse square root. Intel AVX-512PF (Intel AVX-512 Prefetch instructions) are useful for reducing memory operation latency.

To take advantage of Intel AVX-512, compile the program with the compiler knob –xMIC-AVX512
$ icc myParallelApp.c -o myParallelApp-AVX512 -qopenmp -O3 -xMIC-AVX512

$ export KMP_AFFINITY=scatter $ export OMP_NUM_THREADS=68

$ numactl -m 1 ./myParallelApp-AVX512 No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 68, N = 536870912, N1 = 536870336, num-iters in remainder serial loop = 576, parallel-pct = 99.999893 Starting Compute on 68 threads Elapsed time in msec: 316 (after 10 iterations)

Note that you can run the following command, which generates an assembly file named myParallelApp.s:

$ icc -O3 myParallelApp.c -qopenmp -xMIC-AVX512 -S -fsource-asm

By examining the assembly file, you can confirm that Intel AVX512 ISA is generated

6.4. Using `–no-prec-div -fp-model fast=2` optimization flags.

If high precision is not required, we can relax the floating-point model by compiling with -fp-model fast=2, which provides more optimization for the floating number (but is unsafe). The compiler uses a faster and less-precise implementation of square root, division. For example:

$ icc myParallelApp.c -o myParallelApp-AVX512-FAST -qopenmp -O3 -xMIC-AVX512 -no-prec-div -no-prec-sqrt -fp-model fast=2 $ export OMP_NUM_THREADS=68

$ numactl -m 1 ./myParallelApp-AVX512-FAST No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 68, N = 536870912, N1 = 536870336, num-iters in remainder serial loop = 576, parallel-pct = 99.999893 Starting Compute on 68 threads Elapsed time in msec: 310 (after 10 iterations)

6.5. Configuring MCDRAM as cache

In the BIOS setting, configure MCDRAM as cache and reboot the system. The numactl utility should confirm that there is only one NUMA node since the MCDRAM is configured as cache, thus transparent to this utility:

$ numactl -H available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 node 0 size: 98200 MB node 0 free: 94409 MB node distances: node 0 0: 10

Recompile the program:

$ icc myParalledApp.c -o myParalledApp -qopenmp -O3 -xMIC-AVX512 -no-prec-div -no-prec-sqrt -fp-model fast=2

And run it:

$ export OMP_NUM_THREADS=68 $ ./myParalledApp-AVX512-FAST No. of Elements : 512M Repetitions = 10 Start allocating buffers and initializing .... thread num=0 Initializing numthreads = 68, N = 536870912, N1 = 536870336, num-iters in remainder serial loop = 576, parallel-pct = 99.999893 Starting Compute on 68 threads Elapsed time in msec: 325 (after 10 iterations)

Observe that there is no additional benefit of using MCDRAM as cache in this application.

7. Summary and Conclusion

In this tutorial, the following topics were discussed:

Memory alignment
Vectorizing
Generating the compiler report to assist code analysis
Using command line utilities cpuinfo, lscpu, numactl, lstopo
Using OpenMP to add thread-level parallelism
Setting environment variables
Using Intel VTune Amplifier to profile bandwidth utilization
Allocating MCDRAM memory using numactl
Compiling with the Intel AVX512 flag to get better performance

The graph below shows the performance improvement for each step from the baseline code: general optimization with data alignment, vectorizing, adding thread-level parallelism, allocating MCDRAM memory in flat mode, compiling with Intel AVX512, compiling with no-precision flag, and using MCDRAM as cache.

Performance Improvement (in second)

We were able to reduce the execution time significantly making use of all the available cores, Intel AVX-512 vectorization, and MCDRAM bandwidth.

References:

About the Author

Loc Q Nguyen received an MBA from University of Dallas, a master’s degree in Electrical Engineering from McGill University, and a bachelor's degree in Electrical Engineering from École Polytechnique de Montréal. He is currently a software engineer with Intel Corporation's Software and Services Group. His areas of interest include computer networking, parallel computing, and computer graphics.

Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Intel Sample Source Code License Agreement

↧

Meet the Experts - Gergana Slavova

June 20, 2016, 8:16 pm

Latest and popular articles on Intel Technologies

≫ Next: Recipe: Building and Running VLPL-S for Intel® Xeon Phi™ Processors

≪ Previous: Tutorial on Intel® Xeon Phi™ Processor Optimization

Gergana Slavova
Technical Consulting Engineer

Gergana Slavova received her bachelor’s degree in computer science at the University of Illinois at Urbana-Champaign in 2005. Following graduation, she joined the Intel® Software and Services Group as a Technical Consulting Engineer, a position she has held since the start.

She works in the high performance computing area where she provides technical support, training, and consulting expertise for a suite of distributed memory development tools. Specifically, Gergana works with the Message Passing Interface (MPI) – a programming model that defines the explicit communication rules for a set of parallel processes to solve complex scientific computing problems (e.g., fluid dynamics, structural mechanics, weather forecasting, bioinformatics, and astrophysics).

As a woman in technology, she’s passionate about recruitment and retention of women in the STEM fields. Her efforts vary from teaching middle school girls about what an LED is to speaking at a local Women in Tech conference. Her favorite time of year is when attending the Grace Hopper Celebration of Women in Computing (followed closely by Halloween).

Meet our Modern Code experts

↧

Recipe: Building and Running VLPL-S for Intel® Xeon Phi™ Processors

June 16, 2016, 11:39 am

Latest and popular articles on Intel Technologies

≫ Next: [Series] Know Your Customer: Defining Audience Personas

≪ Previous: Meet the Experts - Gergana Slavova

I. Overview

This article provides a recipe for how to obtain, compile, and run an optimized version of VLPL-S on Intel® Xeon® processors and Intel® Xeon Phi™ processors.

The source for this version of VLPL-S as well as the workload can be obtained by contacting Prof. Minchen at minchen@sjtu.edu.cn

II. Introduction

VLPL-S is the in-house code from Germany HHUD and PRC SJTU, paralleled with MPI and written in C++. The application is about a Particle-in-Cell method for laser plasma simulation by solving the particles motion equation, and the current density distribution and Maxwell equations. The version of VLPL-S is optimized for the performance on both Intel Xeon processors and Intel Xeon Phi processors. Optimizations in this package include:

Removing unnecessary computation and memory access
Improving cache hit rate by prefetch
Avoiding unnecessary precision conversion of constant and function call
Improving function call efficiency by removing the virtual function call and inter-procedural optimization
Vectorization

III. Preliminaries

To build this package, install Intel® MPI Library 5.1 or higher and Intel® Parallel Studio XE 2016 or higher on your host system.
Contact Prof. Minchen at minchen@sjtu.edu.cn to get the optimized VLPL-S source package and test workload. Please specify that you’d like the Intel Recipes version.
Set up the Intel MPI Library and Intel® Fortran Compiler environments.

> source /opt/intel/compilers_and_libraries_<version>/linux/mpi/bin64/mpivars.sh
> source /opt/intel/compilers_and_libraries_<version>/linux/bin/compilervars.sh intel64
To run VLPL-S on the Intel Xeon Phi processor, reboot the system with SNC-4 cluster mode and cache memory mode via BIOS settings. Please refer to Intel® Xeon Phi™ Processor - Memory Modes and Cluster Modes: Configuration and Use Cases for more details on memory configuration.

IV. Build VLPL-S for the Intel Xeon processor

Unpack the source code to any directory of /home/<user>

> tar xvf VLPL-S.tar.bz2
Build the executables for the Intel Xeon processor.

> cd /home/<user>/VLPL-S
> cp Makfile.cpu Makefile
The executables are located at the path of /home/<user>/VLPL-S/ with the names of v2d_sjtu.e.cpu

V. Build VLPL-S for the Intel Xeon Phi processor

Build the executables for the Intel Xeon Phi processor.

> cd /home/<user>/VLPL-S
> cp Makfile.knl Makefile
This builds the executables for the Intel Xeon Phi processor; the executables are located at the path of /home/<user>/VLPL-S, with the names of v2d_sjtu.e.knl

VI. Run VLPL-S on the Intel Xeon processor and Intel Xeon Phi processor

Run VLPL-S with the test workload on the Intel Xeon processor.

> cd /home/<user>/VLPL-S
> mpirun –n 36 ./v2d_sjtu.e.cpu ./v2d.ini
Run VLPL-S with the test workload on the Intel Xeon Phi processor. Make sure all of binary and workload files are located on KNL.

> cd /home/<user>/VLPL-S
> mpirun –n 272 ./v2d_sjtu.e.knl ./v2d.ini

VIII. Performance gain

For the test workload, the following graph shows the speedup achieved from the Intel Xeon Phi processor, compared to the Intel Xeon processor. As you can see, we get:

Up to 1.55x faster with the Intel® Xeon Phi™ processor 7210 compared to the two-socket Intel® Xeon® processor E5-2697 v4.
Up to 2.00x faster with the Intel® Xeon Phi™ processor 7250 compared to the two-socket Intel Xeon processor E5-2697 v4.

VLPL-S Performance Improvement with the Intel® Xeon Phi™ Processor

Comments on performance improvement on Intel Xeon Phi:

VLPL-S has good parallel scalability, and benefits from more cores. And the best performance on Intel® Xeon Phi™ 7250 is achieved with 272 MPI ranks, which means it can make good use of all the logical cores (272 cores).
VLPL-S code can be vectorized, and therefore 512 bit vector from AVX512 improves performance significantly.
VLPL-S also benefits from MCDRAM because of memory bandwidth bound.

Testing platform configuration:

Intel Xeon processor E5-2697 v4: Dual-Socket Intel Xeon processor E5-2697 v4, 2.3 GHz, 18 cores/socket, 36 cores, 72 threads (HT and Turbo ON), DDR4 128 GB, 2400 MHz, Oracle Linux* Server release 6.7

Intel Xeon Phi processor 7210 (64 cores): Intel Xeon Phi processor 7210, 64 cores, 256 threads, 1300 MHz core freq. (HT and Turbo ON), 1600 MHz uncore freq., MCDRAM 16 GB 6.4 GT/s, BIOS 10D42, DDR4 96 GB, 2133 MHz, Red Hat 7.2, SNC-4 cluster mode, MCDRAM cache memory mode.

Intel Xeon Phi processor 7250 (68 cores): Intel Xeon Phi processor 7250, 68 cores, 272 threads, 1400 MHz core freq. (HT and Turbo ON), 1700 MHz uncore freq., MCDRAM 16 GB 7.2 GT/s, BIOS 10R00, DDR4 96 GB, 2400 MHz, Red Hat 7.2, SNC-4 cluster mode, MCDRAM cache memory mode.

↧

[Series] Know Your Customer: Defining Audience Personas

June 22, 2016, 9:46 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® Threading Building Blocks Celebrates 10 Years!

≪ Previous: Recipe: Building and Running VLPL-S for Intel® Xeon Phi™ Processors

No matter what you’re building, whether it’s a racing game or a calendar app, it’s important to remember the customer who’s at the center of it all. Ask yourself who you’re building this for. Your first answer might be yourself—and it’s good to have passion for the product. But your product will be out in the world, so it’s important to think about who’s really going to use it.

When you know your customer, you can understand their behavior—what they want leads to where they are, and that's how you can get your marketing in front of them at the right time and place. In this three-part series, we’ll explore this topic in depth, addressing the different ways to know your customer—and why it’s important.

First, read on for tips and tools for defining your target audience. Come back soon for parts two and three, which will discuss where to find your target and how to convert them into customers.

Why Does It Matter So Much?

When you create something in a vacuum, or build it with only your own needs in mind, you run the risk of creating something no one needs or wants, even if it's a great product. Knowing your customer forms the basis not only for all of your marketing plans, but also for initial development and ongoing iteration. It helps you decide which features to include, h ow to write your value proposition, how to monetize your product, how to message your product, and how to price it.

Create a Persona

One of the best ways to define your target—and truly understand them—is to create a persona. In other words, rather than a general demographic, such as “Males 18-30,” create a high-level profile of your most likely customer—and then use it as a framework for decision-making.

Name
Picture you can associate with them
Personality
Values
Attitudes
Interests
Relevant characteristics
Relevant behavior
Spare-time activities
Common tasks
Goals (especially as it relates to your game or product)

By creating a profile that feels like a real person, you’ll be able to sympathize with their problems—putting you in a much better position to help solve them. For example, if you’re developing an app that reminds people when to feed their fish, the persona you create might be: John, 28, programmer, loves tropical fish, works long hours, eats takeout 5 times a week, loves to cross things off of checklists, watches SportsCenter, plays Mobile Strike on his phone to relax.

Download the worksheet below to get started!

Do Your Research

Creating a persona is a great way of defining and knowing your target audience, but don’t forget to research and talk to people. You need to confirm your ideas and make sure your assumptions are correct.

Surveys and Focus Groups– If you have budget for primary research, surveys and focus groups can provide a lot of great information. While focus groups allow you to talk in more depth to a smaller group of people, surveys can be distributed more widely—but both can be really helpful in defining who your target is and how you can expect them to behave.

Published Studies– Don’t assume that if you aren’t able to conduct your own research that you’re out of luck. Especially when you're new, it's a good idea to take advantage of the information that already exists. Research on the latest consumer behavior and habits is being published on an annual basis, and there are a lot of experts writing and publishing their thoughts. Look for guidance in some of these places:

Industry groups
Competitors
Google
SlideShare
Social networks
Event organizations (E3, etc)
Business school case studies

Now’s the Time to Get to Know Them!

The sooner you define your target audience, the better. So if you’re still thinking through initial ideas, great—you can develop your entire product with your customer in mind. If development is already well underway, or you’ve already released, no need to panic. Now is still the right time, and you can use this thinking to guide your iterations.

Where Do You Go from Here?

The power of creating a persona is that it makes it easier to understand and anticipate consumer behavior. Not just who the customer is, but what their behavior is likely to be. Once you’ve gone through the initial definition process, it’s time to turn toward the customer journey—or the path your target takes as they discover, become interested in, and hopefully buy your product. This will help you anticipate what they may need at each step along the way—and better meet those needs. Check back soon for Part 2 of this series: Know Your Customer! Pick Your Channels.

Have you ever created a persona? Tell us about it in the comments!

↧

Intel® Threading Building Blocks Celebrates 10 Years!

June 22, 2016, 10:15 am

Latest and popular articles on Intel Technologies

≫ Next: SDN, NFV, DPDK, ONP, OPNFV and more!

≪ Previous: [Series] Know Your Customer: Defining Audience Personas

The new special issue of The Parallel Universe magazine celebrates the 10th anniversary of what some have called the most important new addition to parallel programming in the last decade: Intel® Threading Building Blocks (Intel® TBB).

Intel® TBB is a widely used, award-winning C++ library for creating high-performance, scalable parallel applications. It includes a rich set of components to efficiently implement higher-level, task-based parallelism and is compatible with multiple compilers and portable to various operating systems.

In this don’t-miss special issue, you’ll learn how this important tool suite came to be, what it’s done for parallel programing, and where it’s going in the future. You’ll also get the inside view from the experts who’ve helped develop Intel TBB over the years.

Join the celebration!

Read it now >

Get previous issues >

↧

SDN, NFV, DPDK, ONP, OPNFV and more!

June 22, 2016, 2:56 pm

Latest and popular articles on Intel Technologies

≫ Next: An Introduction to Intel® Active Management Technology Wireless Connections

≪ Previous: Intel® Threading Building Blocks Celebrates 10 Years!

Why does Intel contribute to the SDN and NFV communities?

Software Defined Networking (SDN) and Network Functions Virtualization (NFV) are moving out of the labs and into production deployments as a cheaper, faster, more flexible alternative to traditional hardware network 'appliances'.

Just as virtualization has changed OS and application rollouts, NFV at layers 4 and above (control plane) and SDN at layer 2 and 3 (controlling packet movement) are revolutionizing the deployment and management of network traffic on off-the-shelf hardware and branded or Open Source OSs. This is where OPNFV, including the Intel® Open Network Platform (ONP) Server (reference architecture), can quickly get you started in the design and testing of your network. Just use the instructions at

https://software.intel.com/en-us/articles/quick-set-up-onp for initial setup using standard off the shelf server platforms, ranging from Intel® Atom™ to Intel® Xeon® Processors.

What are all these Open names?

The network and virtualization infrastructure tools define an information model, a set of APIs, and control protocols such as OpenFlow (a communication protocol between the control and forwarding layers) developed for OpenStack* (and branded) OS.

OpenStack* (Juno*-Mitaka) provides the framework to create/manage VMs (virtual machines). VMs are the base OS’s for the virtualized functions. VMs can have multiple virtual Network Interfaces.
OpenStack Neutron* is the networking component that abstracts Linux network configurations using a common API wrapped around the network function solutions (Open vSwitch, VLANs, iptables/netfilter, etc).
OpenDaylight* (Helium, Lithium, Beryllium) provides the code and architecture for virtualizing the network controller (control plane functions for configuration, monitoring, and management)
Open vSwitch* (OVS) 2.5.0 is a production quality, multilayer, virtual network switch. (OVS can be a node connected to an OpenDaylight controller)
Data Plane Development Kit (DPDK) v16.04 is a set of data plane libraries and NIC drivers that provide a programming framework for fast, high speed network packet processing on general purpose processors.

Note: Mitaka contains OVS 2.5 accelerated with DPDK 2.2

https://software.intel.com/en-us/articles/using-open-vswitch-with-dpdk-on-ubuntu

Intel® Open Network Platform (ONP) Server is a reference design providing scripts and more to quickly set up a test network. It's based on
OPNFV which is a hardware and software reference for the advancement of NFV.

Intel Technologies that boost performance

Intel (including Wind River) are major contributors to the DPDK and to Linux. Intel recently merged the Intel® DPDK vSwitch into the Open vSwitch main branch so that Neutron can use Intel’s accelerated packet processing while avoiding proprietary plugins. By building the switching logic on top of the Intel DPDK library, there is a significant boost to the packet switching throughput which can be integrated in both the host and guests of the OpenStack network compute nodes.

The Intel DPDK also adds samples of L3 forwarding, load balancing, and timers, all of which can help reduce development time. It also exposes resources as multiple virtual functions making them available to multiple VMs and available to speed up inter-VM communication.

Additionally Intel is prototyping Open NFV (OPNFV) concepts using the OpenDaylight platform to leverage Intel’s network performance enhancements.

Intel® QuickAssist Acceleration Technology provides accelerator services (encryption, compression and algorithm offload) for up to 14 separate virtualized instantiations. Intel QuickAssist Technology is available on the Intel® EP80579 Integrated Processor, the Intel® Xeon® Processor E5-2600 and E5-2400 Series, and Intel® Core™, Intel® Pentium® and Intel® Celeron® Processors with Intel® Communications Chipset 89xx Series.

Intel is also providing the Intel® Open Network Platform Server. The Intel® ONP Server Reference Architecture provides hardware that is optimized for a comprehensive stack of the latest open source software releases, as a validated template to enable rapid development. The Reference Architecture includes architecture specs, test reports, scripts for optimization and support for Intel network interfaces from 1 GbE to 40Gbe (FTXL710-AM2 4x10GbE )

What SDN/NFV applications are available?

Many solution partners are listed at https://networkbuilders.intel.com/solutionslibrary

Where can I find more information?

Intel maintains several sites including:

01.Org/packet-processing: access to Intel administered open source projects, including the packet processing project
NetworkBuilders.Intel.com: a library of white papers as well as ecosystem partners and other sections
Support and information available from the Intel Networking Forum
SDN/NFV/DPDK and other networking info on Intel Developer Zone
Wired Ethernet community at Intel
https://networkbuilders.intel.com/docs/OPNFV_WhitePaper_Final.pdf
Communications-virtualization-and-nfv-paper.pdf
Learn more about DPDK at dpdk.org
Use Cases that benefit from Optimization of Small Network Packets
Using-open-vswitch-with-dpdk-on-ubuntu
Using-open-vswitch-with-dpdk-for-inter-vm-nfv-applications
Rate-limiting-configuration-and-usage-for-open-vswitch-with-dpdk
qos-configuration-and-usage-for-open-vswitch-with-dpdk
implementing-an-openstack-security-group-firewall-driver-using-ovs-learn-actions

↧

An Introduction to Intel® Active Management Technology Wireless Connections

June 22, 2016, 5:06 pm

Latest and popular articles on Intel Technologies

≫ Next: Arduino 101* Bluetooth* Low Energy

≪ Previous: SDN, NFV, DPDK, ONP, OPNFV and more!

Introduction to Intel® Active Management Technology Wireless

With the introduction of wireless-only platforms starting with Intel Active Management Technology (Intel® AMT) 10, it is even more important for an ISV to integrate support for wireless management of AMT devices.

The wireless feature of Intel AMT is just like any wireless connection; it is not an automatic initial connection process. However, there are several major differences between wired and wireless Intel AMT communication, including the following:

Wireless Intel AMT interfaces are disabled by default and must be enabled and configured with a wireless profile (friendly name, SSID, passwords, encryption, and authentication at a minimum) which is pushed to the client using one of several methods.
Where a wired interface is shared by the host OS and Intel AMT (two different IP addresses), the wireless interface must be DHCP assigned only one IP address and is controlled by the OS unless the host is deemed unavailable, in which case control is given to the Intel AMT firmware.

This article will address the Intel AMT wireless configuration and describe how to handle the various aspects that are important for a clean integration.

Intel AMT Wireless Support Progression for Intel AMT 2.5 through 11

Intel AMT 2.5 and 2.6: Wireless is supported only when the OS is in a powered-on state (S0).
Intel AMT 4.0: Wireless is supported in all sleep states (Sx) but depends on configuration settings (Note: Intel AMT 5.0 did not support wireless).
Intel AMT 6: Syncs Intel AMT and host OS wireless profiles.
Intel AMT 7.0: Wireless is supported and host-based configuration is available; however remote configuration is not available over wireless.
Intel AMT 9.5: First release to support wireless-only platforms. USB provisioning is not supported on these devices.

Understanding Intel AMT Wireless Connection Requirements

The connection parameters for an Intel AMT wireless device closely resemble those required for the Host OS connection. The firmware requires information including SSID, the authentication method, encryption type, and passphrase at a minimum. In more advanced connections, 802.1x profile information is also required.

All these settings are wrapped into a Profile which is considered as either an Admin or User Profile and saved within the Intel AMT firmware. The Admin or IT profiles are added to the firmware using Intel AMT APIs; see a list of configuration methods below. User profiles cannot be added to the MEBx via an Intel AMT API, they are created using the Intel AMT WebUI, or with profile syncing using the Intel® PROSet wireless drivers.

The Intel AMT firmware holds a maximum of 16 total profiles, of which a maximum of 8 can be user profiles. With the ninth user profile, the oldest user profile is overwritten. The combination of Admin and User profile are a maximum of 16 profiles.

Key Differences between Wired and Wireless Intel AMT Support

Default state. The wireless management interface is initially disabled and must be enabled in addition to creating and deploying the wireless profile. In contrast, wired connections are on by default.
Network type. Only infrastructure network types are supported by Intel AMT, not ad hoc or peer-to-peer networks.
DHCP dependence. While wired Intel AMT connections support either DHCP or static IP assignment, wireless AMT connection requires DHCP, and it will share its IP with the host OS.
Power state limitations. Wireless AMT is only available when the system is plugged into AC Power and in the s0 or s5 state.
Microsoft Active Directory* integration. 802.1x wireless authentication requires Active Directory integration with the Intel® Setup and Configuration Software (Intel® SCS.)
OS control of packets. On the wireless connection, all traffic goes directly to the OS (which can then forward it to Intel AMT), unless the OS is off, failed, or in a sleep state. In those cases manageability traffic goes directly to Intel AMT, which means that when the host returns to S0 or the driver is restarted, Intel AMT must return control to the host, or the host will not have wireless connectivity. This affects remote connections to Intel AMT including IDE-R and KVM. See Link Preference details below (added in 6.0 and automated in 8.1)
Wired-only Intel AMT features are not supported on wireless only platforms; Heuristic Policies , Auto-Synch of IP Addresses , Local Setup Directly to Admin Control Mode , 802.1x Profile Configuration.

Configuration Methods

Basic configuration of wireless for Intel AMT is covered in the article: Intel® vPro™ Setup and Configuration Integration, but here is additional information specific to wireless setup.

Wireless profiles can be placed in the Intel AMT firmware several ways. However, any system that is wireless only (no RJ45 connector) cannot be provisioned by a USB key.

Initial Intel AMT configuration
- Profile type: Admin or Client, basic or advanced 802.1x
- Tools available: Acuconfig, ACUWizard or Intel SCS
Intel AMT WebUI
- Profile type: User, basic only.
- Tool used: For web browser, use http://<IPorFQDNofDevice>:16992, or for TLS use https://<FQDNofDevice>:16993

Intel® AMT

Delta configuration
- Profile type: Admin for reconfiguring specific settings only
- Tools available: Acuconfig, ACU Wizard, or Intel SCS
Wi-Fi profile syncing (Intel AMT 6.0 and later)
- Profile type: User
- Requires Intel® PROSet wireless drivers and the Intel® AMT Local Manageability Service (LMS)
- Enables or disables synced OS and AMT wireless profiles (during configuration).
WS-Management
- Profile type: Admin
- Tools: Intel® vPro PowerShell module, WirelessConfiguration.exe, WS-Man custom using CIM_elements

Connection Types - Authentication/Encryption

Intel AMT supports several authentication and encryption types for wireless connections.

User profiles can be configured with WEP or no encryption.
Admin profiles must be TKIP or CCMP with WPA or higher security.
802.1x profiles are not automatically synchronized by the Intel PROSet wireless driver

Table 1 shows the possible security settings for Intel AMT wireless profiles.

	None	wep	tkip	ccmp
Open System	X	X
Shared Key	X	X
Wi-Fi* Protected Access (WPA) Pre-Shared Key (PSK)			X	X
wpa ieee 802.1X			X	X
wpa2 psk			X	X
wpa 2 ieee 802.1X			X	X

Table 1:Security settings for Intel® Active Management Technology wireless profiles.

Settings to Ensure Connectivity during Remote Connection

Link Control and Preference

In a typical Intel AMT remote power management command, the Intel AMT system gets immediately rebooted. With a wireless KVM the session will get dropped as the WLAN because the control of the wireless interface does not get passed to the firmware. This lack of passing the control from the OS to the firmware can take up to two minutes for the Intel AMT wireless connection to be reestablished.

To prevent this connectivity loss, the preferred method is to programmatically perform the change of link control prior to making the power control request.

For additional Information see my blogs: KVM and Link Control and more general Link Preference and Control.

TCP Time Outs

During changes to link control and power transition, wireless connectivity will temporarily be down during these state changes. If that duration lasts too long, the sessions created using the redirection library will be terminated. This termination is due to exceeding the HB setting within the redirection library (see Table 2).

Time Out	Default Values	Suggested Value
Hb (client heartbeat interval)	5 seconds	7.5 seconds
RX (client receive)	2 x Hb	3 x Hb

Table 2:TCP default and suggested changes.

Currently the default session time-out setting works most of the time. However we now recommend changing the HB interval and the client receive interval by adding parameters during calls to the redirection library. These time-out values need to affect both the IDER TCP and SOL TCP sessions. For additional Information, see the following; IMR_IDEROpenTCPSession or IMR_SOLOpenTCPSessionEx.

Wireless Link Policy

Another aspect is the wireless power policy of the firmware. This policy governs power control in different sleep states. The allowable values are: Disable, EnableS0, and EnableS0AndSxAC. These settings are usually set during configuration. However identifying if an Intel AMT client will be able to maintain connectivity after a reboot or power down will improve technician expectation of client behavior.

To query the Wi-Fi Link policy use the HLAPI.Wireless.WiFiLinkPolicy enumeration

To set the Wi-Fi Link policy use the HLAPI.Wireless. IWireless.SetWiFiLinkPolicy method of the Intel AMT HLAPI

Summary

Intel AMT wireless functionality may be called a feature, but this feature should be a cornerstone for any integration of Intel AMT functionality into a console application. Without this integration many devices will not be manageable due at the introduction of Intel AMT version 10).

A successfully basic integration is composed of several factors: Intel AMT wireless configuration, connection verification for wired or wireless, and wireless Link control operations.

Resource Lists

Wireless Networking in Intel® AMT
Wireless Profile Synchronization
Technical Considerations for Intel® AMT in a Wireless Environment
KVM User Experience Over Wireless (The case for Link Preference)
Intel® PROSet/Wireless Software - Downloads
AMT Implementation and Reference Guide – Wireless Manageability
AMT Implementation and Reference Guide – IMR_IDEROpenTCPSession
AMT Implementation and Reference Guide – IMR_SOLOpenTCPSessionEx

About the Author

Joe Oster has been working with Intel® vPro™ technology and Intel AMT technology since 2006. When not working, spending time working on his family’s farm or flying drones and RC aircraft.

↧

Arduino 101* Bluetooth* Low Energy

June 23, 2016, 1:49 pm

Latest and popular articles on Intel Technologies

≫ Next: A Support Vector Machine Implementation for Sign Language Recognition on Intel Edison.

≪ Previous: An Introduction to Intel® Active Management Technology Wireless Connections

Introduction

Bluetooth* Low Energy (Bluetooth LE or BLE) is a low-power, short-range wireless communication technology that is ideal for use on the Internet of Things (IoT). BLE is designed for small and distinct data transfer, providing a fast connection between client and server and a simple user interface, which makes it ideal for control and monitoring applications. Arduino 101* includes on-board Bluetooth LE to enable developers to interact with Bluetooth-enabled devices such as phones and tablets. We will discuss how to create a BLEservice and communicate with an Android device. We also set up a BLE blood pressure monitoring system to demonstrate the BLE capabilities of the Arduino 101*.

Hardware Components

The hardware components used in this project are listed below:

Arduino 101* module
Grove* - Starter Kit Plus
A standard, A-plug-to-B-plug USB cable

This project will use the angle rotary sensor from the Grove kit, as shown with the other components in Figure 1.

For details on installing the Intel® Curie Boards and setting up the software development environment for the Arduino 101* platform, go to https://software.intel.com/en-us/articles/fun-with-the-arduino-101-genuino-101.

Figure 1: Arduino 101* with rotary angle sensor.

Central and Peripheral Devices

Bluetooth LE supports two major roles for networked devices: central and peripheral.

Central: A Bluetooth device such as smart phone, tablet, or PC that initiates an outgoing connection request to an advertising peripheral device. Once connected to the peripheral, the central device can exchange data, read values from the peripheral device, and execute commands on the peripheral devices.

Peripheral: A BLE device that accepts an incoming connection request after advertising. It gathers and publishes data for other devices to consume.

The central device communicates with peripherals through advertising packages. Peripheral devices send out the advertisements, and the central device scans for advertisements.

Figure 2: Central and peripheral device communication.

Generic Attribute Profile (GATT)

The Arduino 101* Bluetooth LE is based on the Generic Attribute Profile (GATT) architecture. GATT defines a hierarchical data structure that is exposed to connected Bluetooth LE devices. The GATT profile is a way of specifying small transmission data over the BLE link. These small data transmissions over a BLE link are called attributes. The GATT is built on top of the Attribute Protocol (ATT). The ATT transports the attributes and the attributes are formatted as characteristics and services. To learn more about the Bluetooth LE and GATT, see https://www.bluetooth.com/what-is-bluetooth-technology/bluetooth-technology-basics/low-energy and https://www.bluetooth.com/specifications/gatt.

Peripheral Data Structure

In the GATT architecture, data is organized into services and characteristics. A Service is a set of features that encapsulate the behavior of the peripheral device. Characteristics are defined attributes of the service that provide additional information about it. For example, the characteristics of the blood pressure service are blood pressure measurement, intermediate cuff pressure, and blood pressure feature.

Figure 3: Bluetooth service and characteristics relationship.

Creating a Blood Pressure BLE Service

To create a BLE service, you’ll need to know the service number and a corresponding characteristic number. On the Bluetooth page, choose GATT Specifications -> Services for the full list of GATT-based services.

Figure 4: Bluetooth GATT Specification pull down menu.

Select the blood pressure service and get the service number for the BLEService constructor.

Figure 5: Bluetooth services.

On the Bluetooth page, select GATT Specifications -> Characteristics to access the blood pressure characteristics number.

Figure 6: Bluetooth characteristics.

Next, include the Arduino 101* CurieBLE library components to enable communication and interaction with other Bluetooth* devices. You can find the open-source CurieBLE library at https://github.com/01org/corelibs-arduino101.

#include <CurieBLE.h>
BLEPeripheral blePeripheral;                // BLE Peripheral Device
BLEService bloodPressureService("1810");    // Blood Pressure Service
// BLE Blood Pressure Characteristic"
BLECharacteristic bloodPressureChar("2A35",  // standard 16-bit characteristic UUID
BLERead | BLENotify, 2);     // remote clients will be able to
// get notifications if this characteristic changes

Set a local name for the peripheral BLE device. When the phone (central device) connects to this peripheral Bluetooth* device, the local name will appear on the phone to identify the connected peripheral.

blePeripheral.setLocalName("BloodPressureSketch");
blePeripheral.setAdvertisedServiceUuid(bloodPressureService.uuid());  // add the service UUID
blePeripheral.addAttribute(bloodPressureService);   // Add the BLE Blood Pressure service
blePeripheral.addAttribute(bloodPressureChar);       // add the blood pressure characteristic

Connect the blood pressure device to analog pin A0 of the Arduino 101* platform. For this example, use the angle rotary sensor to simulate the blood pressure device.

int pressure = analogRead(A0); int bloodPressure = map(pressure, 0, 1023, 0, 100);

Update the blood pressure measurement characteristic. This updated blood pressure value will be seen by the central device. For example, if the phone is connected to the peripheral blood pressure device, the phone can read the updated blood pressure value through an Android app.

bloodPressureChar.setValue(bloodPressureCharArray, 2);

Android Device Communicates with Arduino Sensor

The peripheral communicates with the central Android device through Bluetooth* advertising. In advertising, the peripheral device broadcasts packages to every device around it. The central device scans and connects to the peripheral device to receive data and get more information. Follow the steps below to enable communication between the Android device and Arduino sensor.

Enabled Bluetooth on the Android device.
There are many free BLE Android apps available on Google Play. Search for BLE on Google Play and install a BLE Android app on the Android device.
Start the BLE Android app.
Scan and connect to the BloodPressureSketch peripheral.
Read or write the blood pressure value.

Figure 7 shows an example of an Android device scan for the BloodPressureSketch peripheral.

Figure 7: Android device scans for BLE service.

Turn the rotary angle sensor to see the blood pressure value change on the screen of Android device.

Develop an Android Application

Visit http://developer.android.com/guide/topics/connectivity/bluetooth-le.html for detailed information on developing your own Android app that communicates with the peripheral through the Arduino 101* platform. If you are new to Android, go to https://developer.android.com/training/basics/firstapp/index.html for instructions on creating an Android project and building your first Android app.

Example Arduino IDE Sketch

Code sample 1, below, provides sample code for blood pressure measurement. Open the serial console to see the resulting output.

#include <CurieBLE.h>

/*
   This sketch example partially implements the standard Bluetooth Low-Energy Battery service.
   For more information: https://developer.bluetooth.org/gatt/services/Pages/ServicesHome.aspx
*/

/*  */
BLEPeripheral blePeripheral;                // BLE Peripheral Device (the board you're programming)
BLEService bloodPressureService("1810");    // Blood Pressure Service


// BLE Blood Pressure Characteristic"
//BLECharacteristic bloodPressureChar("2A49",  // standard 16-bit characteristic UUID
BLECharacteristic bloodPressureChar("2A35",  // standard 16-bit characteristic UUID
    BLERead | BLENotify, 2);     // remote clients will be able to
// get notifications if this characteristic changes

int oldBloodPressure = 0;   // last blood pressure reading from analog input
long previousMillis = 0;    // last time the blood pressure was checked, in ms

void setup() {
  Serial.begin(9600);       // initialize serial communication
  pinMode(13, OUTPUT);      // initialize the LED on pin 13 to indicate when a central is connected

  /* Set a local name for the BLE device
     This name will appear in advertising packets
     and can be used by remote devices to identify this BLE device
     The name can be changed but maybe be truncated based on space left in advertisement packet */
  blePeripheral.setLocalName("BloodPressureSketch");
  blePeripheral.setAdvertisedServiceUuid(bloodPressureService.uuid());  // add the service UUID
  blePeripheral.addAttribute(bloodPressureService);   // Add the BLE Blood Pressure service
  blePeripheral.addAttribute(bloodPressureChar); // add the blood pressure characteristic

  const unsigned char charArray[2] = { 0, (unsigned char)0 };
  bloodPressureChar.setValue(charArray, 2);   // initial value for this characteristic

  /* Now activate the BLE device.  It will start continuously transmitting BLE
     advertising packets and will be visible to remote BLE central devices
     until it receives a new connection */
  blePeripheral.begin();
  Serial.println("Bluetooth device active, waiting for connections...");
}

void loop() {
  // listen for BLE peripherals to connect:
  BLECentral central = blePeripheral.central();

  // if a central is connected to peripheral:
  if (central) {
    Serial.print("Connected to central: ");
    // print the central's MAC address:
    Serial.println(central.address());
    // turn on the LED to indicate the connection:
    digitalWrite(13, HIGH);

    // check the blood pressure mesurement every 200ms
    // as long as the central is still connected:
    while (central.connected()) {
      long currentMillis = millis();
      // if 200ms have passed, check the blood pressure mesurement:
      if (currentMillis - previousMillis >= 200) {
        previousMillis = currentMillis;
        updateBloodPressure();
      }
    }
    // when the central disconnects, turn off the LED:
    digitalWrite(13, LOW);
    Serial.print("Disconnected from central: ");
    Serial.println(central.address());
  }
}

void updateBloodPressure() {
  /* Read the current voltage mesurement on the A0 analog input pin.
     This is used here to simulate the blood pressure.
  */
  int pressure = analogRead(A0);
  int bloodPressure = map(pressure, 0, 1023, 0, 100);

  // If the blood pressure has changed
  if (bloodPressure != oldBloodPressure) {
    Serial.print("The current blood pressure is: ");
    Serial.println(bloodPressure);
    const unsigned char bloodPressureCharArray[2] = { 0, (unsigned char)bloodPressure };
    // Update the blood pressure measurement characteristic
    bloodPressureChar.setValue(bloodPressureCharArray, 2);
    // Save the measurement for next comparison
    oldBloodPressure = bloodPressure;
  }
}

Code Sample 1: Blood pressure sample code for Arduino IDE.

Summary

This document summarized how the Arduino 101* platform communicates with an Android device and described the steps for creating the BLE Service. See https://www.arduino.cc/en/Reference/CurieBLE for more examples of using BLE in the Arduino IDE. If you are interested in the Arduino 101* platform, browse to http://www.intel.com/buy/us/en/product/emergingtechnologies/intel-arduino-101-497161 for more information.

Helpful References

Intel® Developer Zone:
https://software.intel.com/en-us/iot/home
Intel® Curie:
https://software.intel.com/en-us/iot/hardware/curie
Arduino 101* hardware:
https://www.arduino.cc/en/Main/ArduinoBoard101
https://www.arduino.cc/en/Guide/Arduino101
https://software.intel.com/en-us/articles/fun-with-the-arduino-101-genuino-101
Grove* - Starter Kit Plus:
https://software.intel.com/en-us/iot/hardware/devkit
Android Bluetooth LE:
http://developer.android.com/guide/topics/connectivity/bluetooth-le.html
Bluetooth*:
https://developer.bluetooth.org/gatt/services/Pages/ServicesHome.aspx
https://www.bluetooth.com/what-is-bluetooth-technology
Bluetooth library:
https://www.arduino.cc/en/Reference/CurieBLE

About the Author

Nancy Le is a software engineer at Intel Corporation in the Software and Services Group working on Intel® Atom™ processor scale-enabling project

↧

A Support Vector Machine Implementation for Sign Language Recognition on Intel Edison.

June 24, 2016, 1:50 am

Latest and popular articles on Intel Technologies

≫ Next: Retired Articles Related to the Intel® C++ and Fortran Compliers

≪ Previous: Arduino 101* Bluetooth* Low Energy

Currently, more than 30 million people in the world have speech impairments and thus to communicate have to use sign language resulting in a language barrier between sign language and non-sign language users. This project explores the development of a sign language to speech translation glove by implementing a Support Vector Machine(SVM) on the Intel Edison to recognize various letters signed by sign language users. The data for the predicted signed gesture is then transmitted to an Android application where it is vocalized.

The Intel Edison is the preferred board of choice for this project primarily because:

The huge and immediate processing needs of the project as Support Vector Machines are a machine learning algorithms that require a lot of processing power and memory. In addition to this we need our output in real-time.
The inbuilt Bluetooth module on the Edison is used to transmit the predicted gesture to the companion Android application for vocalization.

The project source code can be downloaded from Package icon Download application/zip Download glove image

Sign Language glove

1. The Hardware Design

The sign language glove has five flex sensors mounted on each finger to quantify how much a finger is bent.

Flex sensors are sensors that change their resistance depending on the amount of bend on the sensor. They convert the change in bend to electrical resistance - the more the bend, the more the resistance.

flex sensor

The flex sensors used on this project are the unidirectional 4.5” Spectra Symbol flex sensors. Flex sensors are analog resistors that work as variable voltage dividers.

Below is the design of the glove PCB in KiCad

pcb design

We use the flex sensor library in the Intel XDK IoT Edition to read values from each of the flex sensors.

var flexSensor_lib = require('jsupm_flex');
var Flex1 = new flexSensor_lib.Flex(4);

We want data from all the flex sensors in a standardized format since they are over a wide range and thus their interpretation is difficult. To achieve this, we first establish both the minimum and maximum bend values for each flex sensor then use these values to scale down the readings between 1.0 and 2.0. The snippet below shows how this is achieved for one of the sensors.

var ScaleMin = 1.0;
var ScaleMax = 2.0;
var flexOneMin = 280;
var flexOneMax = 400;

var flex1 = (scaleDown(Flex1.value(), flexOneMin, flexOneMax)).toFixed(2);

function scaleDown(flexval, flexMin, flexMax) {

    var new_val = (flexval - flexMin) / (flexMax - flexMin) * ((ScaleMax - ScaleMin) + ScaleMin);

    return new_val;
}

We then pass these values to our support vector classifier.

2.Support Vector Machine Implementation

Support Vector Machines (SVMs) are machine learning supervised models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked for belonging to one of n categories, an SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier. Given labeled training data, the algorithm outputs an optimal hyperplane which categorizes new examples.

We used node-svm which is the JavaScript implementation of LIBSVM-the most popular SVM library. To install the library run:

npm install node-svm

We then copy the node-svm library build folder for the into our project directory.

Additionally, in using node-svm,you will have to install all the require npm packages used by the library which are Stringify-object, Mout, Graceful-fs, Optimist, Osenv, Numeric, Q and underscore. You do this by running:

npm install <package name>

1. We create the classifier, setting all the required kernel parameters

var clf = new svm.CSVC({
    gamma: 0.25,
    c: 1,
    normalize: false,
    reduce: false,
    kFold: 2 // enable k-fold cross-validation
});

The parameter C controls the tradeoff between errors of the SVM on training data and margin maximization. C is used during the training phase and says how much outliers are taken into account in calculating Support Vectors. In LIBSVM, the value for gamma is the inverse of the number of features. The best values for C and gamma parameters are determined using a grid search. We don’t do dimensionality reduction as each of the values(dimensions) from the 5 flex sensors is important in predicting signed gestures.

2. We build the model by training the classifier using a training dataset and generate a training report. Training the classifier should take a few seconds.

svm.read(fileName)
    .then(function (dataset) {
        return clf.train(dataset)
            .progress(function (progress) {
                console.log('training progress: %d%', Math.round(progress*100));
            });
    })
    .spread(function (model, report) {
        console.log('SVM trained. \nReport:\n%s', so(report));
    }).done(function () {
        console.log('Training Complete.');
    });

3. We then use the classifier to synchronously predict the signed gesture. The classifier accepts a 1-D array as input to make the predictions. We pass the flex sensor values as the parameters.

prediction = clf.predictSync([flex1, flex2, flex3, flex4, flex5]);

We can also get the probability of each of the predicted gestures at each instance by running the following:

probability= clf.predic ProbabilitiesSync ([flex1, flex2, flex3, flex4, flex5]);

The predicted symbol is transmitted to the Android device each time the application receives a read request from the Android app.

Creating the training file.

The training.ds file which is the training file for the project has 832 lines of training data. It would be time-consuming to manually key in all these values so this was done using the code snippet below from logtrainingdata.js

var data = "X" + "" + "1:" + f1ex1 + "" + "2:" + flex2 + "" + "3:" + flex3 + "" + "4:" + flex4 + "" + "5:" + flex5 + "\n";
//X is the current alphabet letter we are training on. We represent these as numbers e.g. A-=0, B=1,C=2…
//append the data to the dataset file
fs.appendFile('training.ds', data, function(err) {
    if (err) {
        console.log(err)
    }
});

Screenshot of a small section of the training data

3. Running the program

We need to turn on the Bluetooth radio on the Edison before we begin advertising. To do this we run:

rfkill unblock bluetooth
killall bluetoothd
hciconfig hci0 up

You can verify that the Bluetooth radio is up with the command below which will return the Bluetooth MAC address of the Edison.

hcitool dev

We then execute the JavaScript program with:

node main.js

4. Android Companion App

The Android companion application uses the Android text to speech engine to vocalize the various predicted gestures The application presents the user with the option of testing and setting the language locale, voice speed and voice pitch as shown in the screenshots. The settings are stored in SharedPreferences. The user also sets the read request interval to the Edison for getting the advertised characteristics. It is important that users set their own preferred interval since a user just beginning to learn sign language will sign at a slower rate compared to a sign language expert.

companion app screenshot

In the next activity, the user scans and connects to the Intel Edison device which is currently advertising the predicted gestures.They then proceed to the activity where the predicted gestures are displayed and vocalized as shown in the video below.

Roy Allela is an intern with the Software and Services Group at Intel.

↧

Retired Articles Related to the Intel® C++ and Fortran Compliers

June 24, 2016, 7:32 am

Latest and popular articles on Intel Technologies

≫ Next: Selftime-based FLOPS computing (Vectorization Advisor)

≪ Previous: A Support Vector Machine Implementation for Sign Language Recognition on Intel Edison.

The article you are looking for has been retired!

The article is no longer applicable to the Supported Intel® C++/Fortran Compiler Versions.

Refer to the latest Intel® C++/Fortran Compiler information:

Intel C++ Compiler: https://software.intel.com/en-us/c-compilers
Intel Fortran Compiler: https://software.intel.com/en-us/fortran-compilers

Have questions?
Check out the FAQ
Or ask in our Developer forums

↧

Selftime-based FLOPS computing (Vectorization Advisor)

June 24, 2016, 8:25 am

Latest and popular articles on Intel Technologies

≫ Next: Getting started with Intel® Advisor Roofline feature

≪ Previous: Retired Articles Related to the Intel® C++ and Fortran Compliers

Let's talk about important specifics of computing FLOPS that can significantly affect FLOPS data interpretation especially in the Roofline chart.

At the moment, we are computing FLOPS using self-time (noninclusive). This means if you have nested loops, the FLOPS and arithmetic intensity computed for outer loop do not include operations happening in the inner loop. Our recommendation is to use FLOPS and Roofline information for outer loop taking in account these specifics.

This becomes trickier when you call functions inside your loop. Again, a noninclusive approach is used. So, operations, which happen inside such function, will not be counted to the loops' FLOPS and arithmetic intensity. The self time of the loop will be used to compute loops' FLOPS. Results of roofline analysis for such loop can lead to wrong conclusion and action plan.

Consider the following example where a modified matrix multiplication is used.

double compute(double a, double b)
{
	double factor = a/b;
	return (((((1+factor)*factor+1)*factor+1)*factor+1)*factor+1)*factor+1;
}
void multiply_d_noinline(int arrSize, double **aMatrix, double **bMatrix, double **product)
{
    for(int i=0;i<arrSize;i++) {
        for(int j=0;j<arrSize;j++) {
            double sum = 0;
            for(int k=0;k<arrSize;k++) {
#pragma noinline
                sum += compute(aMatrix[i][k],bMatrix[k][j]);
            }
	    product[i][j] = sum;
        }
    }
}

void multiply_d_inline(int arrSize, double **aMatrix, double **bMatrix, double **product)
{
    for(int i=0;i<arrSize;i++) {
        for(int j=0;j<arrSize;j++) {
            double sum = 0;
#pragma novector
            for(int k=0;k<arrSize;k++) {
                sum += compute(aMatrix[i][k],bMatrix[k][j]);
            }
	    product[i][j] = sum;
        }
    }
}

In the multiply_d_noinline function, most of computation performed in the compute routine that is called from the innermost computational loop. So all the computational and memory operations are excluded from the FLOPS and arithmetic intensity for the loop whereas all computations are inlined in the multiply_d_inline function and involved in the FLOPS metric calculation.

Let's see what roofline plot looks like.

Significant difference in the position of inlined and not inlined loops on the roofline plot

Although both loops do similar compute work, their positions on the plot are diverged considerably. Moreover, interpretation of the roofline plot results tells that the not inlined version of the loop is memory bound and requires better cache usage. On the other hand, the inlined loop is compute bound and vectorization is essential for performance improvement. What actually required is vectorization improvement for both loops.To take notice of these specifics we recommend to enable "Loops and Functions" filtering in the filter bar. You can see an extra dot appeared on the chart that represents a compute function FLOPS. So, interpreting roofline data for the loop with nested calls you should not only take in account the loop itself but also all the nested calls.

Roofline functions view is enabled

The sample code used in this article can be downloaded by the following link.

↧

Getting started with Intel® Advisor Roofline feature

June 24, 2016, 9:29 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® XDK FAQs - General

≪ Previous: Selftime-based FLOPS computing (Vectorization Advisor)

This document describes the suggested scenario for using Roofline feature of Intel® Advisor.

Intel® Advisor XE can be executed as a standalone GUI tool or integrated Visual Studio plug-in on Windows or in command line. If you plan to use command line or standalon GUI you should run advixe-vars.bat (on Windows) or source advixe-vars.sh (on Linux) to set up environment variables.
To run a standalone GUI use advixe-gui.exe or advixe-gui command.
For standalone tool create an Advisor project. Make sure the checkbox “Collect information about FLOPS…” is checked on the “Survey Trip Count Analysis” project settings page.
In the Visual studio you can configure project settings by pressing toolbar button.
To get Roofline analysis results you will have to run two different analysis of vectorization workflow. First run survey to collect information about program structure and loop execution times. Press button below the Survey Target analysis or press button on toolbar.
After passing survey analysis you’ll have a general information about loops in your program. There are different types of information collected for every loop. You can refer to Advisor documentation to obtain help on what is displayed in the grid and how to use it for improving performance of your application.
The second analysis to run is "Trip Counts and FLOPS" analysis. Please run it to collect number of fp-opeations and memory operations. Press button for the “Find Trip Counts and FLOPs” to run it
Open a “roofline chart’ tab to see collected roofline data for your application.
On the chart you can see different rooflines available on your machine. Memory/Cache bounds and compute bounds. Those roofline obtained dynamically by running a small benchmark prior to running your application. Memory/Cache rooflines define a performance ceiling if data cannot fit to the particular cache. The compute rooflines show compute performance bounds if scalar or single/double precision vector or FMA computations are used.
You can disable/enable rooflines in the toolbox hidden over the three-stripes small box in the upper-right corner of the chart. Hot loops display parameters can be also tuned here.
You can see a red dot in the bottom represents position of my hottest loop on the roofline plot. As you can see I have a huge performance improvement opportunity in this application. Sometimes opening roofline tab you can miss some of your loops. It means they are so slow and do not fit to the bottom of your chart. To locate them simply resize the roofline chart panel
If your application is not threaded you can use single-threaded rooflines by checking a check-box in the top of the chart.There are also zooming controls available.
Selection of a particular loop on the roofline plot makes source code of the loop to be displayed in the bottom pane. You also can select loops on the survey report page and then switch to the roofline page with the loop highlighted on roofline.
The bottom pane contains several tab with various information about loops. Please refer to Advisor documentation to get additional help on it.
If you have nested loops in nested routines changing a filtering mode to “Loops And Functions” can be helpful because only selflime FLOPS metric is calculated/ To analyse FLOPS data for outer loops all nested loops and functions calls should be carefully reviewed. For more information on this topic refer to the following article. Selftime-based FLOPS computing.
For every hot loop in your program analyse loop position in roofline plot. Identify performance gaps and opportunities for every loop. Use other information and recommendations exposed by Advisor to improve performance of your application.

If you have any questions or problems please contact Advisor team by email vector_advisor@intel.com.

↧

Intel® XDK FAQs - General

June 24, 2016, 6:29 pm

Latest and popular articles on Intel Technologies

≫ Next: OpenStack App Developer Survey

≪ Previous: Getting started with Intel® Advisor Roofline feature

How can I get started with Intel XDK?

There are plenty of videos and articles that you can go through here to get started. You could also start with some of our demo apps. It may also help to read Five Useful Tips on Getting Started Building Cordova Mobile Apps with the Intel XDK, which will help you understand some of the differences between developing for a traditional server-based environment and developing for the Intel XDK hybrid Cordova app environment.

Having prior understanding of how to program using HTML, CSS and JavaScript* is crucial to using the Intel XDK. The Intel XDK is primarily a tool for visualizing, debugging and building an app package for distribution.

You can do the following to access our demo apps:

Select Project tab
Select "Start a New Project"
Select "Samples and Demos"
Create a new project from a demo

If you have specific questions following that, please post it to our forums.

Can I use an external editor for development in Intel® XDK?

Yes, you can open your files and edit them in your favorite editor. However, note that you must use Brackets* to use the "Live Layout Editing" feature. Also, if you are using App Designer (the UI layout tool in Intel XDK) it will make many automatic changes to your index.html file, so it is best not to edit that file externally at the same time you have App Designer open.

Some popular editors among our users include:

Sublime Text* (Refer to this article for information on the Intel XDK plugin for Sublime Text*)
Notepad++* for a lighweight editor
Jetbrains* editors (Webstorm*)
Vim* the editor

How do I get code refactoring capability in Brackets* (the Intel XDK code editor)?

...to be written...

Why doesn’t my app show up in Google* play for tablets?

...to be written...

What is the global-settings.xdk file and how do I locate it?

global-settings.xdk contains information about all your projects in the Intel XDK, along with many of the settings related to panels under each tab (Emulate, Debug etc). For example, you can set the emulator to auto-refresh or no-auto-refresh. Modify this file at your own risk and always keep a backup of the original!

You can locate global-settings.xdk here:

Mac OS X*
~/Library/Application Support/XDK/global-settings.xdk
Microsoft Windows*
%LocalAppData%\XDK
Linux*
~/.config/XDK/global-settings.xdk

If you are having trouble locating this file, you can search for it on your system using something like the following:

Windows:
> cd /
> dir /s global-settings.xdk
Mac and Linux:
$ sudo find / -name global-settings.xdk

When do I use the intelxdk.js, xhr.js and cordova.js libraries?

The intelxdk.js and xhr.js libraries were only required for use with the Intel XDK legacy build tiles (which have been retired). The cordova.js library is needed for all Cordova builds. When building with the Cordova tiles, any references to intelxdk.js and xhr.js libraries in your index.html file are ignored.

How do I get my Android (and Crosswalk) keystore file?

New with release 3088 of the Intel XDK, you may now download your build certificates (aka keystore) using the new certificate manager that is built into the Intel XDK. Please read the initial paragraphs of Managing Certificates for your Intel XDK Account and the section titled "Convert a Legacy Android Certificate" in that document, for details regarding how to do this.

It may also help to review this short, quick overview video (there is no audio) that shows how you convert your existing "legacy" certificates to the "new" format that allows you to directly manage your certificates using the certificate management tool that is built into the Intel XDK. This conversion process is done only once.

If the above fails, please send an email to html5tools@intel.com requesting help. It is important that you send that email from the email address associated with your Intel XDK account.

How do I rename my project that is a duplicate of an existing project?

See this FAQ: How do I make a copy of an existing Intel XDK project?

How do I recover when the Intel XDK hangs or won't start?

If you are running Intel XDK on Windows* it must be Windows* 7 or higher. It will not run reliably on earlier versions.

Delete the "project-name.xdk" file from the project directory that Intel XDK is trying to open when it starts (it will try to open the project that was open during your last session), then try starting Intel XDK. You will have to "import" your project into Intel XDK again. Importing merely creates the "project-name.xdk" file in your project directory and adds that project to the "global-settings.xdk" file.

Rename the project directory Intel XDK is trying to open when it starts. Create a new project based on one of the demo apps. Test Intel XDK using that demo app. If everything works, restart Intel XDK and try it again. If it still works, rename your problem project folder back to its original name and open Intel XDK again (it should now open the sample project you previously opened). You may have to re-select your problem project (Intel XDK should have forgotten that project during the previous session).

Clear Intel XDK's program cache directories and files.

On a Windows machine this can be done using the following on a standard command prompt (administrator is not required):

> cd %AppData%\..\Local\XDK
> del *.* /s/q

To locate the "XDK cache" directory on [OS X*] and [Linux*] systems, do the following:

$ sudo find / -name global-settings.xdk
$ cd <dir found above>
$ sudo rm -rf *

You might want to save a copy of the "global-settings.xdk" file before you delete that cache directory and copy it back before you restart Intel XDK. Doing so will save you the effort of rebuilding your list of projects. Please refer to this question for information on how to locate the global-settings.xdk file.

If you save the "global-settings.xdk" file and restored it in the step above and you're still having hang troubles, try deleting the directories and files above, along with the "global-settings.xdk" file and try it again.

Do not store your project directories on a network share (Intel XDK currently has issues with network shares that have not yet been resolved). This includes folders shared between a Virtual machine (VM) guest and its host machine (for example, if you are running Windows* in a VM running on a Mac* host). This network share issue is a known issue with a fix request in place.

There have also been issues with running behind a corporate network proxy or firewall. To check them try running Intel XDK from your home network where, presumably, you have a simple NAT router and no proxy or firewall. If things work correctly there then your corporate firewall or proxy may be the source of the problem.

Issues with Intel XDK account logins can also cause Intel XDK to hang. To confirm that your login is working correctly, go to the Intel XDK App Center and confirm that you can login with your Intel XDK account. While you are there you might also try deleting the offending project(s) from the App Center.

If you can reliably reproduce the problem, please send us a copy of the "xdk.log" file that is stored in the same directory as the "global-settings.xdk" file to html5tools@intel.com.

Is Intel XDK an open source project? How can I contribute to the Intel XDK community?

No, It is not an open source project. However, it utilizes many open source components that are then assembled into Intel XDK. While you cannot contribute directly to the Intel XDK integration effort, you can contribute to the many open source components that make up Intel XDK.

The following open source components are the major elements that are being used by Intel XDK:

Node-Webkit
Chromium
Ripple* emulator
Brackets* editor
Weinre* remote debugger
Crosswalk*
Cordova*
App Framework*

How do I configure Intel XDK to use 9 patch png for Android* apps splash screen?

Intel XDK does support the use of 9 patch png for Android* apps splash screen. You can read up more at https://software.intel.com/en-us/xdk/articles/android-splash-screens-using-nine-patch-png on how to create a 9 patch png image and link to an Intel XDK sample using 9 patch png images.

How do I stop AVG from popping up the "General Behavioral Detection" window when Intel XDK is launched?

You can try adding nw.exe as the app that needs an exception in AVG.

What do I specify for "App ID" in Intel XDK under Build Settings?

Your app ID uniquely identifies your app. For example, it can be used to identify your app within Apple’s application services allowing you to use things like in-app purchasing and push notifications.

Here are some useful articles on how to create an App ID:

Is it possible to modify the Android Manifest or iOS plist file with the Intel XDK?

You cannot modify the AndroidManifest.xml file directly with our build system, as it only exists in the cloud. However, you may do so by creating a dummy plugin that only contains a plugin.xml file containing directives that can be used to add lines to the AndroidManifest.xml file during the build process. In essence, you add lines to the AndroidManifest.xml file via a local plugin.xml file. Here is an example of a plugin that does just that:

<?xml version="1.0" encoding="UTF-8"?><plugin xmlns="http://apache.org/cordova/ns/plugins/1.0" id="my-custom-intents-plugin" version="1.0.0"><name>My Custom Intents Plugin</name><description>Add Intents to the AndroidManifest.xml</description><license>MIT</license><engines><engine name="cordova" version=">=3.0.0" /></engines><!-- android --><platform name="android"><config-file target="AndroidManifest.xml" parent="/manifest/application"><activity android:configChanges="orientation|keyboardHidden|keyboard|screenSize|locale" android:label="@string/app_name" android:launchMode="singleTop" android:name="testa" android:theme="@android:style/Theme.Black.NoTitleBar"><intent-filter><action android:name="android.intent.action.SEND" /><category android:name="android.intent.category.DEFAULT" /><data android:mimeType="*/*" /></intent-filter></activity></config-file></platform></plugin>

You can inspect the AndroidManifest.xml created in an APK, using apktool with the following command line:

$ apktool d my-app.apk
$ cd my-app
$ more AndroidManifest.xml

This technique exploits the config-file element that is described in the Cordova Plugin Specification docs and can also be used to add lines to iOS plist files. See the Cordova plugin documentation link for additional details.

Here is an example of such a plugin for modifying the iOS plist file, specifically for adding a BIS key to the plist file:

<?xml version="1.0" encoding="UTF-8"?><plugin
    xmlns="http://apache.org/cordova/ns/plugins/1.0"
    id="my-custom-bis-plugin"
    version="0.0.2"><name>My Custom BIS Plugin</name><description>Add BIS info to iOS plist file.</description><license>BSD-3</license><preference name="BIS_KEY" /><engines><engine name="cordova" version=">=3.0.0" /></engines><!-- ios --><platform name="ios"><config-file target="*-Info.plist" parent="CFBundleURLTypes"><array><dict><key>ITSAppUsesNonExemptEncryption</key><true/><key>ITSEncryptionExportComplianceCode</key><string>$BIS_KEY</string></dict></array></config-file></platform></plugin>

How can I share my Intel XDK app build?

You can send a link to your project via an email invite from your project settings page. However, a login to your account is required to access the file behind the link. Alternatively, you can download the build from the build page, onto your workstation, and push that built image to some location from which you can send a link to that image.

Why does my iOS build fail when I am able to test it successfully on a device and the emulator?

Common reasons include:

Your App ID specified in the project settings do not match the one you specified in Apple's developer portal.
The provisioning profile does not match the cert you uploaded. Double check with Apple's developer site that you are using the correct and current distribution cert and that the provisioning profile is still active. Download the provisioning profile again and add it to your project to confirm.
In Project Build Settings, your App Name is invalid. It should be modified to include only alpha, space and numbers.

How do I add multiple domains in Domain Access?

Here is the primary doc source for that feature.

If you need to insert multiple domain references, then you will need to add the extra references in the intelxdk.config.additions.xml file. This StackOverflow entry provides a basic idea and you can see the intelxdk.config.*.xml files that are automatically generated with each build for the <access origin="xxx" /> line that is generated based on what you provide in the "Domain Access" field of the "Build Settings" panel on the Project Tab.

How do I build more than one app using the same Apple developer account?

On Apple developer, create a distribution certificate using the "iOS* Certificate Signing Request" key downloaded from Intel XDK Build tab only for the first app. For subsequent apps, reuse the same certificate and import this certificate into the Build tab like you usually would.

How do I include search and spotlight icons as part of my app?

Please refer to this article in the Intel XDK documentation. Create anintelxdk.config.additions.xml file in your top level directory (same location as the otherintelxdk.*.config.xml files) and add the following lines for supporting icons in Settings and other areas in iOS*.

<!-- Spotlight Icon --><icon platform="ios" src="res/ios/icon-40.png" width="40" height="40" /><icon platform="ios" src="res/ios/icon-40@2x.png" width="80" height="80" /><icon platform="ios" src="res/ios/icon-40@3x.png" width="120" height="120" /><!-- iPhone Spotlight and Settings Icon --><icon platform="ios" src="res/ios/icon-small.png" width="29" height="29" /><icon platform="ios" src="res/ios/icon-small@2x.png" width="58" height="58" /><icon platform="ios" src="res/ios/icon-small@3x.png" width="87" height="87" /><!-- iPad Spotlight and Settings Icon --><icon platform="ios" src="res/ios/icon-50.png" width="50" height="50" /><icon platform="ios" src="res/ios/icon-50@2x.png" width="100" height="100" />

For more information related to these configurations, visit http://cordova.apache.org/docs/en/3.5.0/config_ref_images.md.html#Icons%20and%20Splash%20Screens.

For accurate information related to iOS icon sizes, visit https://developer.apple.com/library/ios/documentation/UserExperience/Conceptual/MobileHIG/IconMatrix.html

NOTE: The iPhone 6 icons will only be available if iOS* 7 or 8 is the target.

Cordova iOS* 8 support JIRA tracker: https://issues.apache.org/jira/browse/CB-7043

Does Intel XDK support Modbus TCP communication?

No, since Modbus is a specialized protocol, you need to write either some JavaScript* or native code (in the form of a plugin) to handle the Modbus transactions and protocol.

How do I sign an Android* app using an existing keystore?

New with release 3088 of the Intel XDK, you may now import your existing keystore into Intel XDK using the new certificate manager that is built into the Intel XDK. Please read the initial paragraphs of Managing Certificates for your Intel XDK Account and the section titled "Import an Android Certificate Keystore" in that document, for details regarding how to do this.

If the above fails, please send an email to html5tools@intel.com requesting help. It is important that you send that email from the email address associated with your Intel XDK account.

How do I build separately for different Android* versions?

Under the Projects Panel, you can select the Target Android* version under the Build Settings collapsible panel. You can change this value and build your application multiple times to create numerous versions of your application that are targeted for multiple versions of Android*.

How do I display the 'Build App Now' button if my display language is not English?

If your display language is not English and the 'Build App Now' button is proving to be troublesome, you may change your display language to English which can be downloaded by a Windows* update. Once you have installed the English language, proceed to Control Panel > Clock, Language and Region > Region and Language > Change Display Language.

How do I update my Intel XDK version?

When an Intel XDK update is available, an Update Version dialog box lets you download the update. After the download completes, a similar dialog lets you install it. If you did not download or install an update when prompted (or on older versions), click the package icon next to the orange (?) icon in the upper-right to download or install the update. The installation removes the previous Intel XDK version.

How do I import my existing HTML5 app into the Intel XDK?

If your project contains an Intel XDK project file (<project-name>.xdk) you should use the "Open an Intel XDK Project" option located at the bottom of the Projects List on the Projects tab (lower left of the screen, round green "eject" icon, on the Projects tab). This would be the case if you copied an existing Intel XDK project from another system or used a tool that exported a complete Intel XDK project.

If your project does not contain an Intel XDK project file (<project-name>.xdk) you must "import" your code into a new Intel XDK project. To import your project, use the "Start a New Project" option located at the bottom of the Projects List on the Projects tab (lower left of the screen, round blue "plus" icon, on theProjects tab). This will open the "Samples, Demos and Templates" page, which includes an option to "Import Your HTML5 Code Base." Point to the root directory of your project. The Intel XDK will attempt to locate a file named index.html in your project and will set the "Source Directory" on the Projects tab to point to the directory that contains this file.

If your imported project did not contain an index.html file, your project may be unstable. In that case, it is best to delete the imported project from the Intel XDK Projects tab ("x" icon in the upper right corner of the screen), rename your "root" or "main" html file to index.html and import the project again. Several components in the Intel XDK depend on this assumption that the main HTML file in your project is named index.hmtl. See Introducing Intel® XDK Development Tools for more details.

It is highly recommended that your "source directory" be located as a sub-directory inside your "project directory." This insures that non-source files are not included as part of your build package when building your application. If the "source directory" and "project directory" are the same it results in longer upload times to the build server and unnecessarily large application executable files returned by the build system. See the following images for the recommended project file layout.

I am unable to login to App Preview with my Intel XDK password.

On some devices you may have trouble entering your Intel XDK login password directly on the device in the App Preview login screen. In particular, sometimes you may have trouble with the first one or two letters getting lost when entering your password.

Try the following if you are having such difficulties:

Reset your password, using the Intel XDK, to something short and simple.
Confirm that this new short and simple password works with the XDK (logout and login to the Intel XDK).
Confirm that this new password works with the Intel Developer Zone login.
Make sure you have the most recent version of Intel App Preview installed on your devices. Go to the store on each device to confirm you have the most recent copy of App Preview installed.
Try logging into Intel App Preview on each device with this short and simple password. Check the "show password" box so you can see your password as you type it.

If the above works, it confirms that you can log into your Intel XDK account from App Preview (because App Preview and the Intel XDK go to the same place to authenticate your login). When the above works, you can go back to the Intel XDK and reset your password to something else, if you do not like the short and simple password you used for the test.

How do I completely uninstall the Intel XDK from my system?

Take the following steps to completely uninstall the XDK from your Windows system:

From the Windows Control Panel, remove the Intel XDK, using the Windows uninstall tool.
Then:
> cd %LocalAppData%\Intel\XDK
> del *.* /s/q
Then:
> cd %LocalAppData%\XDK
> copy global-settings.xdk %UserProfile%
> del *.* /s/q
> copy %UserProfile%\global-settings.xdk .
Then:
-- Goto xdk.intel.com and select the download link.
-- Download and install the new XDK.

To do the same on a Linux or Mac system:

On a Linux machine, run the uninstall script, typically /opt/intel/XDK/uninstall.sh.
Remove the directory into which the Intel XDK was installed.
-- Typically /opt/intel or your home (~) directory on a Linux machine.
-- Typically in the /Applications/Intel XDK.app directory on a Mac.
Then:
$ find ~ -name global-settings.xdk $ cd <result-from-above> (for example ~/Library/Application Support/XDK/ on a Mac) $ cp global-settings.xdk ~ $ rm -Rf * $ mv ~/global-settings.xdk .
Then:
-- Goto xdk.intel.com and select the download link.
-- Download and install the new XDK.

Is there a tool that can help me highlight syntax issues in Intel XDK?

Yes, you can use the various linting tools that can be added to the Brackets editor to review any syntax issues in your HTML, CSS and JS files. Go to the "File > Extension Manager..." menu item and add the following extensions: JSHint, CSSLint, HTMLHint, XLint for Intel XDK. Then, review your source files by monitoring the small yellow triangle at the bottom of the edit window (a green check mark indicates no issues).

How do I delete built apps and test apps from the Intel XDK build servers?

You can manage them by logging into: https://appcenter.html5tools-software.intel.com/csd/controlpanel.aspx. This functionality will eventually be available within Intel XDK after which access to app center will be removed.

I need help with the App Security API plugin; where do I find it?

Visit the primary documentation book for the App Security API and see this forum post for some additional details.

When I install my app or use the Debug tab Avast antivirus flags a possible virus, why?

If you are receiving a "Suspicious file detected - APK:CloudRep [Susp]" message from Avast anti-virus installed on your Android device it is due to the fact that you are side-loading the app (or the Intel XDK Debug modules) onto your device (using a download link after building or by using the Debug tab to debug your app), or your app has been installed from an "untrusted" Android store. See the following official explanation from Avast:

Your application was flagged by our cloud reputation system. "Cloud rep" is a new feature of Avast Mobile Security, which flags apks when the following conditions are met:
The file is not prevalent enough; meaning not enough users of Avast Mobile Security have installed your APK.
The source is not an established market (Google Play is an example of an established market).
If you distribute your app using Google Play (or any other trusted market) your users should not see any warning from Avast.

Following are some of the Avast anti-virus notification screens you might see on your device. All of these are perfectly normal, they are due to the fact that you must enable the installation of "non-market" apps in order to use your device for debug and the App IDs associated with your never published app or the custom debug modules that the Debug tab in the Intel XDK builds and installs on your device will not be found in a "established" (aka "trusted") market, such as Google Play.

If you choose to ignore the "Suspicious app activity!" threat you will not receive a threat for that debug module any longer. It will show up in the Avast 'ignored issues' list. Updates to an existing, ignored, custom debug module should continue to be ignored by Avast. However, new custom debug modules (due to a new project App ID or a new version of Crosswalk selected in your project's Build Settings) will result in a new warning from the Avast anti-virus tool.

How do I add a Brackets extension to the editor that is part of the Intel XDK?

The number of Brackets extensions that are provided in the built-in edition of the Brackets editor are limited to insure stability of the Intel XDK product. Not all extensions are compatible with the edition of Brackets that is embedded within the Intel XDK. Adding incompatible extensions can cause the Intel XDK to quit working.

Despite this warning, there are useful extensions that have not been included in the editor and which can be added to the Intel XDK. Adding them is temporary, each time you update the Intel XDK (or if you reinstall the Intel XDK) you will have to "re-add" your Brackets extension. To add a Brackets extension, use the following procedure:

exit the Intel XDK
download a ZIP file of the extension you wish to add
on Windows, unzip the extension here:
%LocalAppData%\Intel\XDK\xdk\brackets\b\extensions\dev
on Mac OS X, unzip the extension here:
/Applications/Intel\ XDK.app/Contents/Resources/app.nw/brackets/b/extensions/dev
start the Intel XDK

Note that the locations given above are subject to change with new releases of the Intel XDK.

Why does my app or game require so many permissions on Android when built with the Intel XDK?

When you build your HTML5 app using the Intel XDK for Android or Android-Crosswalk you are creating a Cordova app. It may seem like you're not building a Cordova app, but you are. In order to package your app so it can be distributed via an Android store and installed on an Android device, it needs to be built as a hybrid app. The Intel XDK uses Cordova to create that hybrid app.

A pure Cordova app requires the NETWORK permission, it's needed to "jump" between your HTML5 environment and the native Android environment. Additional permissions will be added by any Cordova plugins you include with your application; which permissions are includes are a function of what that plugin does and requires.

Crosswalk for Android builds also require the NETWORK permission, because the Crosswalk image built by the Intel XDK includes support for Cordova. In addition, current versions of Crosswalk (12 and 14 at the time this FAQ was written)also require NETWORK STATE and WIFI STATE. There is an extra permission in some versions of Crosswalk (WRITE EXTERNAL STORAGE) that is only needed by the shared model library of Crosswalk, we have asked the Crosswalk project to remove this permission in a future Crosswalk version.

If you are seeing more than the following five permissions in your XDK-built Crosswalk app:

android.permission.INTERNET
android.permission.ACCESS_NETWORK_STATE
android.permission.ACCESS_WIFI_STATE
android.permission.INTERNET
android.permission.WRITE_EXTERNAL_STORAGE

then you are seeing permissions that have been added by some plugins. Each plugin is different, so there is no hard rule of thumb. The two "default" core Cordova plugins that are added by the Intel XDK blank templates (device and splash screen) do not require any Android permissions.

BTW: the permission list above comes from a Crosswalk 14 build. Crosswalk 12 builds do not included the last permission; it was added when the Crosswalk project introduced the shared model library option, which started with Crosswalk 13 (the Intel XDK does not support 13 builds).

How do I make a copy of an existing Intel XDK project?

If you just need to make a backup copy of an existing project, and do not plan to open that backup copy as a project in the Intel XDK, do the following:

Exit the Intel XDK.
Copy the entire project directory:
- on Windows, use File Explorer to "right-click" and "copy" your project directory, then "right-click" and "paste"
- on Mac use Finder to "right-click" and then "duplicate" your project directory
- on Linux, open a terminal window, "cd" to the folder that contains your project, and type "cp -a old-project/ new-project/" at the terminal prompt (where "old-project/" is the folder name of your existing project that you want to copy and "new-project/" is the name of the new folder that will contain a copy of your existing project)

If you want to use an existing project as the starting point of a new project in the Intel XDK. The process described below will insure that the build system does not confuse the ID in your old project with that stored in your new project. If you do not follow the procedure below you will have multiple projects using the same project ID (a special GUID that is stored inside the Intel XDK <project-name>.xdk file in the root directory of your project). Each project in your account must have a unique project ID.

Exit the Intel XDK.
Make a copy of your existing project using the process described above.
Inside the new project that you made (that is, your new copy of your old project), make copies of the <project-name>.xdk file and <project-name>.xdke files and rename those copies to something like project-new.xdk and project-new.xdke (anything you like, just something different than the original project name, preferably the same name as the new project folder in which you are making this new project).
Using a TEXT EDITOR (only) (such as Notepad or Sublime or Brackets or some other TEXT editor), open your new "project-new.xdk" file (whatever you named it) and find the projectGuid line, it will look something like this:
"projectGuid": "a863c382-ca05-4aa4-8601-375f9f209b67",
Change the "GUID" to all zeroes, like this: "00000000-0000-0000-000000000000"
Save the modified "project-new.xdk" file.
Open the Intel XDK.
Goto the Projects tab.
Select "Open an Intel XDK Project" (the green button at the bottom left of the Projects tab).
To open this new project, locate the new "project-new.xdk" file inside the new project folder you copied above.
Don't forget to change the App ID in your new project. This is necessary to avoid conflicts with the project you copied from, in the store and when side-loading onto a device.

My project does not include a www folder. How do I fix it so it includes a www or source directory?

The Intel XDK HTML5 and Cordova project file structures are meant to mimic a standard Cordova project. In a Cordova (or PhoneGap) project there is a subdirectory (or folder) named www that contains all of the HTML5 source code and asset files that make up your application. For best results, it is advised that you follow this convention, of putting your source inside a "source directory" inside of your project folder.

This most commonly happens as the result of exporting a project from an external tool, such as Construct2, or as the result of importing an existing HTML5 web app that you are converting into a hybrid mobile application (eg., an Intel XDK Corodova app). If you would like to convert an existing Intel XDK project into this format, follow the steps below:

Exit the Intel XDK.
Copy the entire project directory:
- on Windows, use File Explorer to "right-click" and "copy" your project directory, then "right-click" and "paste"
- on Mac use Finder to "right-click" and then "duplicate" your project directory
- on Linux, open a terminal window, "cd" to the folder that contains your project, and type "cp -a old-project/ new-project/" at the terminal prompt (where "old-project/" is the folder name of your existing project that you want to copy and "new-project/" is the name of the new folder that will contain a copy of your existing project)
Create a "www" directory inside the new duplicate project you just created above.
Move your index.html and other source and asset files to the "www" directory you just created -- this is now your "source" directory, located inside your "project" directory (do not move the <project-name>.xdk and xdke files and any intelxdk.config.*.xml files, those must stay in the root of the project directory)
Inside the new project that you made above (by making a copy of the old project), rename the <project-name>.xdk file and <project-name>.xdke files to something like project-copy.xdk and project-copy.xdke (anything you like, just something different than the original project, preferably the same name as the new project folder in which you are making this new project).
Using a TEXT EDITOR (only) (such as Notepad or Sublime or Brackets or some other TEXT editor), open the new "project-copy.xdk" file (whatever you named it) and find the line named projectGuid, it will look something like this:
"projectGuid": "a863c382-ca05-4aa4-8601-375f9f209b67",
Change the "GUID" to all zeroes, like this: "00000000-0000-0000-000000000000"
A few lines down find: "sourceDirectory": "",
Change it to this: "sourceDirectory": "www",
Save the modified "project-copy.xdk" file.
Open the Intel XDK.
Goto the Projects tab.
Select "Open an Intel XDK Project" (the green button at the bottom left of the Projects tab).
To open this new project, locate the new "project-copy.xdk" file inside the new project folder you copied above.

Can I install more than one copy of the Intel XDK onto my development system?

Yes, you can install more than one version onto your development system. However, you cannot run multiple instances of the Intel XDK at the same time. Be aware that new releases sometimes change the project file format, so it is a good idea, in these cases, to make a copy of your project if you need to experiment with a different version of the Intel XDK. See the instructions in a FAQ entry above regarding how to make a copy of your Intel XDK project.

Follow the instructions in this forum post to install more than one copy of the Intel XDK onto your development system.

On Apple OS X* and Linux* systems, does the Intel XDK need the OpenSSL* library installed?

Yes. Several features of the Intel XDK require the OpenSSL library, which typically comes pre-installed on Linux and OS X systems. If the Intel XDK reports that it could not find libssl, go to https://www.openssl.org to download and install it.

I have a web application that I would like to distribute in app stores without major modifications. Is this possible using the Intel XDK?

Yes, if you have a true web app or “client app” that only uses HTML, CSS and JavaScript, it is usually not too difficult to convert it to a Cordova hybrid application (this is what the Intel XDK builds when you create an HTML5 app). If you rely heavily on PHP or other server scripting languages embedded in your pages you will have more work to do. Because your Cordova app is not associated with a server, you cannot rely on server-based programming techniques; instead, you must rewrite any such code to user RESTful APIs that your app interacts with using, for example, AJAX calls.

What is the best training approach to using the Intel XDK for a newbie?

First, become well-versed in the art of client web apps, apps that rely only on HTML, CSS and JavaScript and utilize RESTful APIs to talk to network services. With that you will have mastered 80% of the problem. After that, it is simply a matter of understanding how Cordova plugins are able to extend the JavaScript API for access to features of the platform. For HTML5 training there are many sites providing tutorials. It may also help to read Five Useful Tips on Getting Started Building Cordova Mobile Apps with the Intel XDK, which will help you understand some of the differences between developing for a traditional server-based environment and developing for the Intel XDK hybrid Cordova app environment.

What is the best platform to start building an app with the Intel XDK? And what are the important differences between the Android, iOS and other mobile platforms?

There is no one most important difference between the Android, iOS and other platforms. It is important to understand that the HTML5 runtime engine that executes your app on each platform will vary as a function of the platform. Just as there are differences between Chrome and Firefox and Safari and Internet Explorer, there are differences between iOS 9 and iOS 8 and Android 4 and Android 5, etc. Android has the most significant differences between vendors and versions of Android. This is one of the reasons the Intel XDK offers the Crosswalk for Android build option, to normalize and update the Android issues.

In general, if you can get your app working well on Android (or Crosswalk for Android) first you will generally have fewer issues to deal with when you start to work on the iOS and Windows platforms. In addition, the Android platform has the most flexible and useful debug options available, so it is the easiest platform to use for debugging and testing your app.

Is my password encrypted and why is it limited to fifteen characters?

Yes, your password is stored encrypted and is managed by https://signin.intel.com. Your Intel XDK userid and password can also be used to log into the Intel XDK forum as well as the Intel Developer Zone. the Intel XDK does not store nor does it manage your userid and password.

The rules regarding allowed userids and passwords are answered on this Sign In FAQ page, where you can also find help on recovering and changing your password.

Why does the Intel XDK take a long time to start on Linux or Mac?

...and why am I getting this error message? "Attempt to contact authentication server is taking a long time. You can wait, or check your network connection and try again."

At startup, the Intel XDK attempts to automatically determine the proxy settings for your machine. Unfortunately, on some system configurations it is unable to reliably detect your system proxy settings. As an example, you might see something like this image when starting the Intel XDK.

On some systems you can get around this problem by setting some proxy environment variables and then starting the Intel XDK from a command-line that includes those configured environment variables. To set those environment variables, similar to the following:

$ export no_proxy="localhost,127.0.0.1/8,::1"
$ export NO_PROXY="localhost,127.0.0.1/8,::1"
$ export http_proxy=http://proxy.mydomain.com:123/
$ export HTTP_PROXY=http://proxy.mydomain.com:123/
$ export https_proxy=http://proxy.mydomain.com:123/
$ export HTTPS_PROXY=http://proxy.mydomain.com:123/

IMPORTANT! The name of your proxy server and the port (or ports) that your proxy server requires will be different than those shown in the example above. Please consult with your IT department to find out what values are appropriate for your site. Intel has no way of knowing what configuration is appropriate for your network.

If you use the Intel XDK in multiple locations (at work and at home), you may have to change the proxy settings before starting the Intel XDK after switching to a new network location. For example, many work networks use a proxy server, but most home networks do not require such a configuration. In that case, you need to be sure to "unset" the proxy environment variables before starting the Intel XDK on a non-proxy network.

After you have successfully configured your proxy environment variables, you can start the Intel XDK manually, from the command-line.

On a Mac, where the Intel XDK is installed in the default location, type the following (from a terminal window that has the above environment variables set):

$ open /Applications/Intel\ XDK.app/

On a Linux machine, assuming the Intel XDK has been installed in the ~/intel/XDK directory, type the following (from a terminal window that has the above environment variables set):

$ ~/intel/XDK/xdk.sh &

In the Linux case, you will need to adjust the directory name that points to the xdk.sh file in order to start. The example above assumes a local install into the ~/intel/XDK directory. Since Linux installations have more options regarding the installation directory, you will need to adjust the above to suit your particular system and install directory.

How do I generate a P12 file on a Windows machine?

See these articles:

How do I change the default dir for creating new projects in the Intel XDK?

You can change the default new project location manually by modifying a field in the global-settings.xdk file. Locate the global-settings.xdk file on your system (the precise location varies as a function of the OS) and find this JSON object inside that file:

"projects-tab": {"defaultPath": "/Users/paul/Documents/XDK","LastSortType": "descending|Name","lastSortType": "descending|Opened","thirdPartyDisclaimerAcked": true
  },

The example above came from a Mac. On a Mac the global-settings.xdk file is located in the "~/Library/Application Support/XDK" directory.

On a Windows machine the global-settings.xdk file is normally found in the "%LocalAppData%\XDK" directory. The part you are looking for will look something like this:

"projects-tab": {"thirdPartyDisclaimerAcked": false,"LastSortType": "descending|Name","lastSortType": "descending|Opened","defaultPath": "C:\\Users\\paul/Documents"
  },

Obviously, it's the defaultPath part you want to change.

BE CAREFUL WHEN YOU EDIT THE GLOBAL-SETTINGS.XDK FILE!! You've been warned...

Make sure the result is proper JSON when you are done, or it may cause your XDK to cough and hack loudly. Make a backup copy of global-settings.xdk before you start, just in case.

Where I can find recent and upcoming webinars list?

How can I change the email address associated with my Intel XDK login?

Login to the Intel Developer Zone with your Intel XDK account userid and password and then locate your "account dashboard." Click the "pencil icon" next to your name to open the "Personal Profile" section of your account, where you can edit your "Name & Contact Info," including the email address associated with your account, under the "Private" section of your profile.

What network addresses must I enable in my firewall to insure the Intel XDK will work on my restricted network?

Normally, access to the external servers that the Intel XDK uses is handled automatically by your proxy server. However, if you are working in an environment that has restricted Internet access and you need to provide your IT department with a list of URLs that you need access to in order to use the Intel XDK, then please provide them with the following list of domain names:

appcenter.html5tools-software.intel.com (for communication with the build servers)
s3.amazonaws.com (for downloading sample apps and built apps)
download.xdk.intel.com (for getting XDK updates)
debug-software.intel.com (for using the Test tab weinre debug feature)
xdk-feed-proxy.html5tools-software.intel.com (for receiving the tweets in the upper right corner of the XDK)
signin.intel.com (for logging into the XDK)
sfederation.intel.com (for logging into the XDK)

Normally this should be handled by your network proxy (if you're on a corporate network) or should not be an issue if you are working on a typical home network.

I cannot create a login for the Intel XDK, how do I create a userid and password to use the Intel XDK?

If you have downloaded and installed the Intel XDK but are having trouble creating a login, you can create the login outside the Intel XDK. To do this, go to the Intel Developer Zone and push the "Join Today" button. After you have created your Intel Developer Zone login you can return to the Intel XDK and use that userid and password to login to the Intel XDK. This same userid and password can also be used to login to the Intel XDK forum.

Installing the Intel XDK on Windows fails with a "Package signature verification failed." message.

If you receive a "Package signature verification failed" message (see image below) when installing the Intel XDK on your system, it is likely due to one of the following two reasons:

Your system does not have a properly installed "root certificate" file, which is needed to confirm that the install package is good.
The install package is corrupt and failed the verification step.

The first case can happen if you are attempting to install the Intel XDK on an unsupported version of Windows. The Intel XDK is only supported on Microsoft Windows 7 and higher. If you attempt to install on Windows Vista (or earlier) you may see this verification error. The workaround is to install the Intel XDK on a Windows 7 or greater machine.

The second case is likely due to a corruption of the install package during download or due to tampering. The workaround is to re-download the install package and attempt another install.

If you are installing on a Windows 7 (or greater) machine and you see this message it is likely due to a missing or bad root certificate on your system. To fix this you may need to start the "Certificate Propagation" service. Open the Windows "services.msc" panel and then start the "Certificate Propagation" service. Additional links related to this problem can be found here > https://technet.microsoft.com/en-us/library/cc754841.aspx

See this forum thread for additional help regarding this issue > https://software.intel.com/en-us/forums/intel-xdk/topic/603992

Troubles installing the Intel XDK on a Linux or Ubuntu system, which option should I choose?

Choose the local user option, not root or sudo, when installing the Intel XDK on your Linux or Ubuntu system. This is the most reliable and trouble-free option and is the default installation option. This will insure that the Intel XDK has all the proper permissions necessary to execute properly on your Linux system. The Intel XDK will be installed in a subdirectory of your home (~) directory.

Inactive account/ login issue/ problem updating an APK in store, How do I request account transfer?

As of June 26, 2015 we migrated all of Intel XDK accounts to the more secure intel.com login system (the same login system you use to access this forum).

We have migrated nearly all active users to the new login system. Unfortunately, there are a few active user accounts that we could not automatically migrate to intel.com, primarily because the intel.com login system does not allow the use of some characters in userids that were allowed in the old login system.

If you have not used the Intel XDK for a long time prior to June 2015, your account may not have been automatically migrated. If you own an "inactive" account it will have to be manually migrated -- please try logging into the Intel XDK with your old userid and password, to determine if it no longer works. If you find that you cannot login to your existing Intel XDK account, and still need access to your old account, please send a message to html5tools@intel.com and include your userid and the email address associated with that userid, so we can guide you through the steps required to reactivate your old account.

Alternatively, you can create a new Intel XDK account. If you have submitted an app to the Android store from your old account you will need access to that old account to retrieve the Android signing certificates in order to upgrade that app on the Android store; in that case, send an email to html5tools@intel.com with your old account username and email and new account information.

Connection Problems? -- Intel XDK SSL certificates update

On January 26, 2016 we updated the SSL certificates on our back-end systems to SHA2 certificates. The existing certificates were due to expire in February of 2016. We have also disabled support for obsolete protocols.

If you are experiencing persistent connection issues (since Jan 26, 2016), please post a problem report on the forum and include in your problem report:

the operation that failed
the version of your XDK
the version of your operating system
your geographic region
and a screen capture

How do I resolve build failure: "libpng error: Not a PNG file"?

f you are experiencing build failures with CLI 5 Android builds, and the detailed error log includes a message similar to the following:

Execution failed for task ':mergeArmv7ReleaseResources'.> Error: Failed to run command: /Developer/android-sdk-linux/build-tools/22.0.1/aapt s -i .../platforms/android/res/drawable-land-hdpi/screen.png -o .../platforms/android/build/intermediates/res/armv7/release/drawable-land-hdpi-v4/screen.png

Error Code: 42

Output: libpng error: Not a PNG file

You need to change the format of your icon and/or splash screen images to PNG format.

The error message refers to a file named "screen.png" -- which is what each of your splash screens were renamed to before they were moved into the build project resource directories. Unfortunately, JPG images were supplied for use as splash screen images, not PNG images. So the files were renamed and found by the build system to be invalid.

Convert your splash screen images to PNG format. Renaming JPG images to PNG will not work! You must convert your JPG images into PNG format images using an appropriate image editing tool. The Intel XDK does not provide any such conversion tool.

Beginning with Cordova CLI 5, all icons and splash screen images must be supplied in PNG format. This applies to all supported platforms. This is an undocumented "new feature" of the Cordova CLI 5 build system that was implemented by the Apache Cordova project.

Why do I get a "Parse Error" when I try to install my built APK on my Android device?

Because you have built an "unsigned" Android APK. You must click the "signed" box in the Android Build Settings section of the Projects tab if you want to install an APK on your device. The only reason you would choose to create an "unsigned" APK is if you need to sign it manually. This is very rare and not the normal situation.

My converted legacy keystore does not work. Google Play is rejecting my updated app.

The keystore you converted when you updated to 3088 (now 3240 or later) is the same keystore you were using in 2893. When you upgraded to 3088 (or later) and "converted" your legacy keystore, you re-signed and renamed your legacy keystore and it was transferred into a database to be used with the Intel XDK certificate management tool. It is still the same keystore, but with an alias name and password assigned by you and accessible directly by you through the Intel XDK.

If you kept the converted legacy keystore in your account following the conversion you can download that keystore from the Intel XDK for safe keeping (do not delete it from your account or from your system). Make sure you keep track of the new password(s) you assigned to the converted keystore.

There are two problems we have experienced with converted legacy keystores at the time of the 3088 release (April, 2016):

Using foreign (non-ASCII) characters in the new alias name and passwords were being corrupted.
Final signing of your APK by the build system was being done with RSA256 rather than SHA1.

Both of the above items have been resolved and should no longer be an issue.

If you are currently unable to complete a build with your converted legacy keystore (i.e., builds fail when you use the converted legacy keystore but they succeed when you use a new keystore) the first bullet above is likely the reason your converted keystore is not working. In that case we can reset your converted keystore and give you the option to convert it again. You do this by requesting that your legacy keystore be "reset" by filling out this form. For 100% surety during that second conversion, use only 7-bit ASCII characters in the alias name you assign and for the password(s) you assign.

IMPORTANT: using the legacy certificate to build your Android app is ONLY necessary if you have already published an app to an Android store and need to update that app. If you have never published an app to an Android store using the legacy certificate you do not need to concern yourself with resetting and reconverting your legacy keystore. It is easier, in that case, to create a new Android keystore and use that new keystore.

If you ARE able to successfully build your app with the converted legacy keystore, but your updated app (in the Google store) does not install on some older Android 4.x devices (typically a subset of Android 4.0-4.2 devices), the second bullet cited above is likely the reason for the problem. The solution, in that case, is to rebuild your app and resubmit it to the store (that problem was a build-system problem that has been resolved).

How can I have others beta test my app using Intel App Preview?

Apps that you sync to your Intel XDK account, using the Test tab's green "Push Files" button, can only be accessed by logging into Intel App Preview with the same Intel XDK account credentials that you used to push the files to the cloud. In other words, you can only download and run your app for testing with Intel App Preview if you log into the same account that you used to upload that test app. This restriction applies to downloading your app into Intel App Preview via the "Server Apps" tab, at the bottom of the Intel App Preview screen, or by scanning the QR code displayed on the Intel XDK Test tab using the camera icon in the upper right corner of Intel App Preview.

If you want to allow others to test your app, using Intel App Preview, it means you must use one of two options:

give them your Intel XDK userid and password
create an Intel XDK "test account" and provide your testers with that userid and password

For security sake, we highly recommend you use the second option (create an Intel XDK "test account").

A "test account" is simply a second Intel XDK account that you do not plan to use for development or builds. Do not use the same email address for your "test account" as you are using for your main development account. You should use a "throw away" email address for that "test account" (an email address that you do not care about).

Assuming you have created an Intel XDK "test account" and have instructed your testers to download and install Intel App Preview; have provided them with your "test account" userid and password; and you are ready to have them test:

sign out of your Intel XDK "development account" (using the little "man" icon in the upper right)
sign into your "test account" (again, using the little "man" icon in the Intel XDK toolbar)
make sure you have selected the project that you want users to test, on the Projects tab
goto the Test tab
make sure "MOBILE" is selected (upper left of the Test tab)
push the green "PUSH FILES" button on the Test tab
log out of your "test account"
log into your development account

Then, tell your beta testers to log into Intel App Preview with your "test account" credentials and instruct them to choose the "Server Apps" tab at the bottom of the Intel App Preview screen. From there they should see the name of the app you synced using the Test tab and can simply start it by touching the app name (followed by the big blue and white "Launch This App" button). Staring the app this way is actually easier than sending them a copy of the QR code. The QR code is very dense and is hard to read with some devices, dependent on the quality of the camera in their device.

Note that when running your test app inside of Intel App Preview they cannot test any features associated with third-party plugins, only core Cordova plugins. Thus, you need to insure that those parts of your apps that depend on non-core Cordova plugins have been disabled or have exception handlers to prevent your app from either crashing or freezing.

I'm having trouble making Google Maps work with my Intel XDK app. What can I do?

There are many reasons that can cause your attempt to use Google Maps to fail. Mostly it is due to the fact that you need to download the Google Maps API (JavaScript library) at runtime to make things work. However, there is no guarantee that you will have a good network connection, so if you do it the way you are used to doing it, in a browser...

<script src="https://maps.googleapis.com/maps/api/js?key=API_KEY&sensor=true"></script>

...you may get yourself into trouble, in an Intel XDK Cordova app. See Loading Google Maps in Cordova the Right Way for an excellent tutorial on why this is a problem and how to deal with it. Also, it may help to read Five Useful Tips on Getting Started Building Cordova Mobile Apps with the Intel XDK, especially item #3, to get a better understanding of why you shouldn't use the "browser technique" you're familiar with.

An alternative is to use a mapping tool that allows you to include the JavaScript directly in your app, rather than downloading it over the network each time your app starts. Several Intel XDK developers have reported very good luck with the open-source JavaScript library named LeafletJS that uses OpenStreet as it's map database source.

You can also search the Cordova Plugin Database for Cordova plugins that implement mapping features, in some cases using native SDKs and libraries.

How do I fix "Cannot find the Intel XDK. Make sure your device and intel XDK are on the same wireless network." error messages?

You can either disable your firewall or allow access through the firewall for the Intel XDK. To allow access through the Windows firewall goto the Windows Control Panel and search for the Firewall (Control Panel > System and Security > Windows Firewall > Allowed Apps) and enable Node Webkit (nw or nw.exe) through the firewall

See the image below (this image is from a Windows 8.1 system).

Google Services needs my SHA1 fingerprint. Where do I get my app's SHA fingerprint?

Your app's SHA fingerprint is part of your build signing certificate. Specifically, it is part of the signing certificate that you used to build your app. The Intel XDK provides a way to download your build certificates directly from within the Intel XDK application (see the Intel XDK documentation for details on how to manage your build certificates). Once you have downloaded your build certificate you can use these instructions provided by Google, to extract the fingerprint, or simply search the Internet for "extract fingerprint from android build certificate" to find many articles detailing this process.

Why am I unable to test or build or connect to the old build server with Intel XDK version 2893?

This is an Important Note Regarding the use of Intel XDK Versions 2893 and Older!!

As of June 13, 2016, versions of the Intel XDK released prior to March 2016 (2893 and older) can no longer use the Build tab, the Test tab or Intel App Preview; and can no longer create custom debug modules for use with the Debug and Profile tabs. This change was necessary to improve the security and performance of our Intel XDK cloud-based build system. If you are using version 2893 or older, of the Intel XDK, you must upgrade to version 3088 or greater to continue to develop, debug and build Intel XDK Cordova apps.

The error message you see below, "NOTICE: Internet Connection and Login Required," when trying to use the Build tab is due to the fact that the cloud-based component that was used by those older versions of the Intel XDK work has been retired and is no longer present. The error message appears to be misleading, but is the easiest way to identify this condition.

How do I run the Intel XDK on Fedora Linux?

See the instructions below, copied from this forum post:

$ find /opt/intel/XDK -name libudev.so.0
$ cd dir/found/above
$ rm libudev.so.0
$ sudo ln -s /lib64/libudev.so.1 libudev.so.0

Back to FAQs Main

↧

Performance Improvement Overview

Performance Improvement Preparation

Performance Improvement Tactics

Performance Improvement Tactics to Reduce Traffic

Performance Improvement Tactics to Increase Available Bandwidth or Decrease Latency

Bring More Useful Data per Transfer

Tile

Place Data on Closer or Faster Devices

Control Data Placement

Store Data Near Where It Is Produced

Store Data Near Where It Is Consumed

Handle Shared Variables

Keep Data in Memory

Use More of the Hardware

Duplicate Data If It Is Needed in Several Places

Request Data Before It Is Needed

Find Other Things to Do While Waiting for Data to Arrive

Efficiently Access Far Memory That Is Near Another Processor

Best Design Practices

Algorithms That Sequentially Access Data

Algorithms That Randomly Access Data

Summary

About the Author

Resources

Resources

Contents

Introduction

Background and Problem Statement

Version #0: The Algorithm

Version #1: SIMD

Solution

Version #2 (Part 1): Data Preconditioning by Sorting for Uniform Data

Version #2 (Part 2): Data Padding

Version #3: SDLT Container

Version #4: sdlt::uniform_soa_over1d

Conclusion

References

Footnotes

I. Overview

II. Introduction

III. Preliminaries

IV. Build MASNUM WAVE for the Intel Xeon processor

V. Build MASNUM WAVE for Intel Xeon Phi processor

VI. Run MASNUM WAVE on the Intel Xeon processor and Intel Xeon Phi processor

VIII. Performance gain

Introduction

Black-Scholes-Merton Formula

Code Access

Build and Run Directions

I. Overview

II. Introduction

III. Preliminaries

IV. Add optimized code into relion

V. Prepare for Intel® Xeon® processor

VI. Prepare for Intel® Xeon® Phi™ processor

VII. Run the test workload on Intel® Xeon processor

VIII. Run the test workload on Intel® Xeon Phi™ processor

IX. Performance gain

Introduction

Black-Scholes-Merton Formula

Code Access

Build and Run Directions

1. Introduction

2. Hardware and Software

3. Benchmarking the Baseline Code

4. Vectorizing Code

4.1. Change array of structure to structure of arrays. Do not use multiple layers in buffer allocation.

4.2. Further improvement: removing type conversion, data alignment

4.3. Use auto-vectorization, run a compiler report, and disable vectorization via a compiler switch

4.4. Compile with optimization level –O3

5. Enabling Multi-Threading

5.1. Thread-level parallelism: OpenMP*

5.2. Use environment variables to set the number of threads and to set affinity

6. Optimizing the code for the Intel Xeon Phi processor

6.1. Memory bandwidth optimization

6.2. Analyze memory usage

6.3. Compiling using the compiler knob –xMIC-AVX512

6.4. Using –no-prec-div -fp-model fast=2 optimization flags.

6.5. Configuring MCDRAM as cache

7. Summary and Conclusion

4.4. Compile with optimization level `–O3`

6.3. Compiling using the compiler knob `–xMIC-AVX512`

6.4. Using `–no-prec-div -fp-model fast=2` optimization flags.

Gergana Slavova
Technical Consulting Engineer