Building and Probing Prolog* with Intel® Architecture

July 21, 2017, 9:25 am

Latest and popular articles on Intel Technologies

≫ Next: Machine learning on Intel® FPGAs

≪ Previous: Object Classification Using CNN Across Intel® Architecture

Introduction

A lot of buzz talks over Internet which suggests that machine learning and Artificial Intelligence (AI) are basically the same thing, but this is a misunderstanding. Both machine learning and Knowledge Reasoning have the same concern: the construction of intelligent software. However, while machine learning is an approach to AI based on algorithms whose performance improve as they are exposed to more data over time, Knowledge Reasoning is a sibling approach based on symbolic logic.

Knowledge Reasoning’s strategy is usually developed by using functional and logic based programming languages such as Lisp*, Prolog*, and ML* due to their ability to perform symbolic manipulation. This kind of manipulation is often associated with expert systems, where high level rules are often provided by humans and used to simulate knowledge, avoiding low-level language details. This focus is called Mind Centered. Commonly, some kind of (backward or forward) logical inference is needed.

Machine learning, on its turn, is associated with low-level mathematical representations of systems and a set of training data that lead the system toward performance improvement. Once there is no high-level modeling, the process is called Brain Centered. Any language that facilitates writing vector algebra and numeric calculus over an imperative paradigm works just fine. For instance, there are several machine learning systems written in Python* simply because the mathematical support is available as libraries for such programming language.

This article aims to explore what happens when Intel solutions support functional and logic programming languages that are regularly used for AI. Despite machine learning systems success over the last two decades, the place for traditional AI has neither disappeared nor diminished, especially in systems where it is necessary to explain why a computer program behaves the way it does. Hence, it is not feasible to believe that next generations of learning systems will be developed without high-level descriptions, and thus it is expected that some problems will demand symbolical solutions. Prolog and similar programming languages are valuable tools for solving such problems.

As it will be detailed below, this article proposes a Prolog interpreter recompilation using Intel® C++ Compiler and libraries in order to evaluate their contribution to logic based AI. The two main products used are Intel® Parallel Studio XE Cluster Edition and SWI-Prolog interpreter. An experiment with a classical AI problem is also presented.

Building Prolog for Intel® Architecture

1. The following description uses a system equipped with: Intel® Core™ i7 4500U@1.8 GHz processor, 64 bits, Ubuntu 16.04 LTS operating system, 8GB RAM, and hyper threading turned on with 2 threads per core (you may check it by typing sudo dmidecode -t processor | grep -E '(Core Count|Thread Count)') . Different operating systems may require minor changes.

2. Preparing the environment.

Optimizing performance on hardware is an iterative process. Figure 1 shows a flow chart describing how the various Intel tools help you in several stages of such optimization task.

Figure 1: Optimizing performance flowchart and libraries. Extracted from Intel® Parallel Studio documentation^₁.

The most convenient way to install Intel tools is downloading and installing Intel® Parallel Studio XE 2017. Extracting the .tgz file, you will obtain a folder called parallel_studio_xe_2017update4_cluster_edition_online (or similar version). Open the terminal and then choose the graphical installation:

<user>@<host>:~% cd parallel_studio_xe_2017update4_cluster_edition_online<user>@<host>:~/parallel_studio_xe_2017update4_cluster_edition_online% ./install_GUI.sh

Although you may prefer to perform a full install, this article will choose a custom installation with components that are frequently useful for many developers. It is recommended that these components also be installed to allow further use of such performance libraries in subsequent projects.

Intel® Trace Analyzer and Collector
Intel® Advisor
Intel® C++ Compiler
Intel® Math Kernel Library (Intel® MKL) for C/C++
Intel® Threading Building Blocks (Intel® TBB)
Intel® Data Analytics Acceleration Library (Intel® DAAL)
Intel® MPI Library

The installation is very straight-forward, and it does not require many comments to be made. After finishing such task, you must test the availability of Intel® C++ Compiler by typing in your terminal:

<user>@<host>:~% cd ..<user>@<host>:~% icc --version
icc (ICC) 17.0.4 20170411

If the icc command was not found, it is because the environment variables for running the compiler environment were not set. You must do it by running a predefined script with an argument that specifies the target architecture:

<user>@<host>:~% source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh -arch intel64 -platform linux

If you wish, you may save disk space by doing:

<user>@<host>:~% rm -r parallel_studio_xe_2017update4_cluster_edition_online

3. Building Prolog.

This article uses the SWI-Prolog interpreter², which is covered by the Simplified BSD license. SWI-Prolog offers a comprehensive free Prolog environment. It is widely used in research and education as well as commercial applications. You must download the sources in .tar.gz format. At the time this article was written, the available version is 7.4.2. First, decompress the download file:

<user>@<host>:~% tar zxvf swipl-<version>.tar.gz

Then, create a folder where the Prolog interpreter will be installed:

<user>@<host>:~% mkdir swipl_intel

After that, get ready to edit the building variables:

<user>@<host>:~% cd swipl-<version><user>@<host>:~/swipl-<version>% cp -p build.templ build<user>@<host>:~/swipl-<version>% <edit> build

At the build file, look for the PREFIX variable, which indicates the place where SWI-Prolog will be installed. You must set it to:

PREFIX=$HOME/swipl_intel

Then, it is necessary to set some compilation variables. The CC variable must be changed to indicate that Intel® C++ Compiler will be used instead of other compilers. The COFLAGS enables optimizations for speed. The compiler vectorization is enabled at –O2. You may choose higher levels (–O3), but the suggested flag is the generally recommended optimization level. With this option, the compiler performs some basic loop optimizations, inlining of intrinsic, intra-file interprocedural optimization, and most common compiler optimization technologies. The –mkl=parallel option allows access to a set of math functions that are optimized and threaded to explore all the features of the latest Intel® Core™ processors. It must be used with a certain Intel® MKL threading layer, depending on the threading option provided. In this article, the Intel® TBB is such an option and it is used by choosing –tbb flag. At last, the CMFLAGS indicates the compilation will create a 64-bit executable.

export CC="icc"
export COFLAGS="-O2 -mkl=parallel -tbb"
export CMFLAGS="-m64"

Save your build file and close it.

Note that when this article was written, SWI-Prolog was not Message Passing Interface (MPI) ready³. Besides, when checking its source-code, no OpenMP* macros were found (OMP) and thus it is possible that SWI-Prolog is not OpenMP ready too.

If you already have an SWI-Prolog instance installed on your computer you might get confused with which interpreter version was compiled with Intel libraries, and which was not. Therefore, it is useful to indicate that you are using the Intel version by prompting such feature when you call SWI-Prolog interpreter. Thus, the following instruction provides a customized welcome message when running the interpreter:

<user>@<host>:~/swipl-<version>% cd boot<user>@<host>:~/swipl-<version>/boot% <edit> messages.pl

Search for:

prolog_message(welcome) -->
    [ 'Welcome to SWI-Prolog (' ],
    prolog_message(threads),
    prolog_message(address_bits),
    ['version ' ],
    prolog_message(version),
    [ ')', nl ],
    prolog_message(copyright),
    [ nl ],
    prolog_message(user_versions),
    [ nl ],
    prolog_message(documentaton),
    [ nl, nl ].

and add @ Intel® architecture by changing it to:

prolog_message(welcome) -->
    [ 'Welcome to SWI-Prolog (' ],
    prolog_message(threads),
    prolog_message(address_bits),
    ['version ' ],
    prolog_message(version),
    [ ') @ Intel® architecture', nl ],
    prolog_message(copyright),
    [ nl ],
    prolog_message(user_versions),
    [ nl ],
    prolog_message(documentaton),
    [ nl, nl ].

Save your messages.pl file and close it. Start building.

<user>@<host>:~/swipl-<version>/boot% cd ..<user>@<host>:~/swipl-<version>% ./build

The compilation performs several checking and it takes some time. Don’t worry, it is really very verbose. Finally, you will get something like this:

make[1]: Leaving directory '~/swipl-<version>/src'
Warning: Found 9 issues.
No errors during package build

Now you may run SWI-Prolog interpreter by typing:

<user>@<host>:~/swipl-<version>% cd ~/swipl_intel/lib/swipl-7.4.2/bin/x86_64-linux<user>@<host>:~/swipl_intel/lib/swipl-<version>/bin/x86_64-linux% ./swipl

Welcome to SWI-Prolog (threaded, 64 bits, version 7.4.2) @ Intel® architecture SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software. Please run ?- license. for legal details. For online help and background, visit http://www.swi-prolog.org For built-in help, use ?- help(Topic). or ?- apropos(Word). 1 ?-

For exiting the interpreter, type halt. . Now you a ready to use Prolog, powered by Intel® architecture.

You may also save disk space by doing:

<user>@<host>:~/swipl_intel/lib/swipl-<version>/bin/x86_64-linux% cd ~<user>@<host>:~% rm -r swipl-<version>

Probing Experiment

Until now, there is an Intel compiled version of SWI-Prolog in your computer. Since this experiment intends to compare such combination with another environment, a SWI-Prolog interpreter using a different compiler, such as gcc 5.4.0, is needed. The procedure for building an alternative version is quite similar to the one described in this article.

The Tower of Hanoi puzzle⁴ is a classical AI problem and it was used for probing the Prolog interpreters. The following code is the most optimized implementation:

move(1,X,Y,_) :-
     write('Move top disk from '),
     write(X),
     write(' to '),
     write(Y),
     nl.

move(N,X,Y,Z) :-
    N>1,
    M is N-1,
    move(M,X,Z,Y),
    move(1,X,Y,_),
    move(M,Z,Y,X).

It moves the disks between pylons and logs their moments. When loading such implementation and running a 3 disk instance problem (move(3,left,right,center)), the following output is obtained after 48 inferences:

Move top disk from left to right
Move top disk from left to center
Move top disk from right to center
Move top disk from left to right
Move top disk from center to left
Move top disk from center to right
Move top disk from left to right
true .

This test intends to compare the performance of Intel SWI-Prolog version against gcc compiled version. Note that terminal output printing is a slow operation, so it is not recommended to use it in benchmarking tests since it masquerades results. Therefore, the program was changed in order to provide a better probe with a dummy sum of two integers.

move(1,X,Y,_) :-
     S is 1 + 2.

move(N,X,Y,Z) :-
    N>1,
    M is N-1,
    move(M,X,Z,Y),
    move(1,X,Y,_),
    move(M,Z,Y,X).

Recall that the SWI-Prolog source-code did not seem to be OpenMP ready. However, most loops can be threaded by inserting the macro #pragma omp parallel for right before the loop. Thus, time-consuming loops from SWI-Prolog proof procedure were located and the OpenMP macro was attached to such loops. The source-code was compiled with –openmp option, a third compilation of Prolog interpreter was built, and 8 threads were used. If the reader wishes to build this parallelized version of Prolog, the following must be done.

At ~/swipl-/src/pl-main.c add #include <omp.h> to the header section of pl-main.c; if you chose, you can add omp_set_num_threads(8) inside main method to specify 8 OpenMP threads. Recall that this experiment environment provides 4 cores and hyper threading turned on with 2 threads per core, thus 8 threads are used, otherwise leave it out and OpenMP will automatically allocate the maximum number of threads it can.

int main(int argc, char **argv){
  omp_set_num_threads(8);
  #if O_CTRLC
    main_thread_id = GetCurrentThreadId();
    SetConsoleCtrlHandler((PHANDLER_ROUTINE)consoleHandlerRoutine, TRUE);
  #endif

  #if O_ANSI_COLORS
    PL_w32_wrap_ansi_console();	/* decode ANSI color sequences (ESC[...m) */
  #endif

    if ( !PL_initialise(argc, argv) )
      PL_halt(1);

    for(;;)
    { int status = PL_toplevel() ? 0 : 1;

      PL_halt(status);
    }

    return 0;
  }

At ~/swipl-<version>/src/pl-prof.c add #include <omp.h> to the header section of pl-prof.c; add #pragma omp parallel for right before the for-loop from methods activateProfiler, add_parent_ref, profResumeParent, freeProfileNode, freeProfileData(void).

int activateProfiler(prof_status active ARG_LD){
  .......... < non relevant source code ommited > .......…

  LD->profile.active = active;
  #pragma omp parallel for
  for(i=0; i<MAX_PROF_TYPES; i++)
  { if ( types[i] && types[i]->activate )
      (*types[i]->activate)(active);
  }
  .......... < non relevant source code ommited > ..........

  return TRUE;
}



static void add_parent_ref(node_sum *sum,
	       call_node *self,
	       void *handle, PL_prof_type_t *type,
	       int cycle)
{ prof_ref *r;

  sum->calls += self->calls;
  sum->redos += self->redos;

  #pragma omp parallel for
  for(r=sum->callers; r; r=r->next)
  { if ( r->handle == handle && r->cycle == cycle )
    { r->calls += self->calls;
      r->redos += self->redos;
      r->ticks += self->ticks;
      r->sibling_ticks += self->sibling_ticks;

      return;
    }
  }

  r = allocHeapOrHalt(sizeof(*r));
  r->calls = self->calls;
  r->redos = self->redos;
  r->ticks = self->ticks;
  r->sibling_ticks = self->sibling_ticks;
  r->handle = handle;
  r->type = type;
  r->cycle = cycle;
  r->next = sum->callers;
  sum->callers = r;
}



void profResumeParent(struct call_node *node ARG_LD)
{ call_node *n;

  if ( node && node->magic != PROFNODE_MAGIC )
    return;

  LD->profile.accounting = TRUE;
  #pragma omp parallel for
  for(n=LD->profile.current; n && n != node; n=n->parent)
  { n->exits++;
  }
  LD->profile.accounting = FALSE;

  LD->profile.current = node;
}



static void freeProfileNode(call_node *node ARG_LD)
{ call_node *n, *next;

  assert(node->magic == PROFNODE_MAGIC);

  #pragma omp parallel for
  for(n=node->siblings; n; n=next)
  { next = n->next;

    freeProfileNode(n PASS_LD);
  }

  node->magic = 0;
  freeHeap(node, sizeof(*node));
  LD->profile.nodes--;
}



static void freeProfileData(void)
{ GET_LD
  call_node *n, *next;

  n = LD->profile.roots;
  LD->profile.roots = NULL;
  LD->profile.current = NULL;

  #pragma omp parallel for
  for(; n; n=next)
  { next = n->next;
    freeProfileNode(n PASS_LD);
  }

  assert(LD->profile.nodes == 0);
}

The test employs a 20 disk instance problem, which is accomplished after 3,145,724 inferences. The time was measured using Prolog function called time. Each test ran 300 times in a loop and any result that is much higher than others was discarded. Figure 2 presents the CPU time consumed by all three configurations.

Figure 2: CPU time consumed by SWI-Prolog compiled with gcc, Intel tools, Intel tools+OpenMP.

Considering the gcc compiled Prolog as baseline, the speedup obtained by Intel tools was 1.35. This is a good result since the source-code was not changed at all, parallelism was not explored by the developer and specialized methods were not called, that is, all blind duty was delegated to Intel® C++ Compiler and libraries. When Intel implementation of OpenMP 4.0 was used, the same speedup increased to 4.60x.

Conclusion

This article deliberately paid attention to logic based AI. It shows that benefits with using Intel development tools for AI problems are not restricted to machine learning. A common distribution of Prolog was compiled with Intel® C++ Compiler, Intel® MKL and Intel implementation of OpenMP 4.0. A significant acceleration was obtained, even though the algorithm of Prolog inference mechanism is not easily optimized. Therefore, any solution for a symbolic logic problem, implemented in such Prolog interpreter, will be powered by an enhanced engine.

References

1. Intel. Getting Started with Intel® Parallel Studio XE 2017 Cluster Edition for Linux*, Intel® Parallel Studio 2017 Documentation, 2017.

2. SWI-Prolog, 2017. http://www.swi-prolog.org/, access on June 18^th, 2017.

3. Swiprolog - Summary and Version Information, High Performance Computing, Division of Information Technology, University of Maryland, 2017. http://hpcc.umd.edu/hpcc/help/software/swiprolog.html, access on June 20^th, 2017.

4. A. Beck, M. N. Bleicher, D. W. Crowe, Excursions into Mathematics, A K Peters, 2000.

5. Russell, Stuart; Norvig, Peter. Artificial Intelligence: A Modern Approach, Prentice Hall Series in Artificial Intelligence, Pearson Education Inc., 2^nd edition, 2003.

↧

Machine learning on Intel® FPGAs

July 24, 2017, 10:26 am

Latest and popular articles on Intel Technologies

≫ Next: Optimizing Computer Applications for Latency: Part 1: Configuring the Hardware

≪ Previous: Building and Probing Prolog* with Intel® Architecture

Introduction

Artificial intelligence (AI) originated in classical philosophy and has been loitering in computing circles for decades. Twenty years ago, AI surged in popularity, but interest waned as technology lagged. Today, technology is catching up, and AI’s resurgence far exceeds its past glimpses of popularity. This time, the compute, data sets, and technology can deliver, and Intel leads the AI pack in innovation.

Among Intel’s many technologies contributing to AI’s advancements, field-programmable gate arrays (FPGAs) provide unique and significant value propositions across the spectrum. Understanding the current and future capabilities of Intel® FPGAs requires a solid grasp on how AI is transforming industries in general.

AI Is Transforming Industries

Industries in all sectors benefit from AI. Three key factors contribute to today’s successful resurgence of AI applications:

Large data sets
Recent AI research
Hardware performance and capabilities

The combination of massive data collections, improved algorithms, and powerful processors enables today’s ongoing, rapid advancements in machine learning, deep learning, and artificial intelligence overall. AI applications now touch the entire data spectrum from data centers to edge devices (including cars, phones, cameras, home and work electronics, and more), and infiltrate every segment of technology, including:

Consumer devices
Enterprise efficiency systems
Healthcare, energy, retail, transportation, and others

Some of AI’s largest impacts are found in self-driving vehicles, financial analytics, surveillance, smart cities, and cyber security. Figure 1 illustrates AI’s sizable impact on just a few areas.

Figure 1. Examples of how AI is transforming industries.

To support AI’s growth today and well into the future, Intel provides a range of AI products in its AI ecosystem. Intel® FPGAs are a key component in this ecosystem.

Intel’s AI Ecosystem and Portfolio

As a technology leader, Intel offers a complete AI ecosystem that concentrates far beyond today’s AI—Intel is committed to fueling the AI revolution deep into the future. It’s a top priority for Intel, as demonstrated through in-house research, development, and key acquisitions. FPGAs play an important role in this commitment.

Intel’s comprehensive, flexible, and performance-optimized AI portfolio of products for machine and deep learning covers the entire spectrum from hardware platforms to end user applications, as shown in Figure 2, including:

Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN)
Compute Library for Deep Neural Networks
Deep Learning Accelerator Library for FPGAs
Frameworks such as Caffe* and TensorFlow*
Tools like the Deep Learning Deployment Toolkit from Intel

Figure 2. Intel’s AI Portfolio of products for machine and deep learning.

Overall, Intel provides a unified front end for the broad variety of backend hardware platforms, enabling users to develop a system with one device today and seamlessly switch to a newer, different hardware platform tomorrow. This comprehensive nature of the Intel’s AI Ecosystem and portfolio means Intel is uniquely situated to help developers at all levels access the full capacity of Intel hardware platforms, both current and future. This approach empowers hardware and software developers to take advantage of the FPGAs’ capabilities with machine learning, leading to increased productivity and shorter design cycles.

The Intel® FPGA Effect

Intel® FPGAs offer unique value propositions, and they are now enabled for Intel’s AI ecosystem. Intel® FPGAs provide excellent system acceleration with deterministic low latency, power efficiency, and future proofing, as illustrated in Figure 3.

Figure 3. Intel® FPGAs offer unique value propositions for AI.

System Acceleration

Today, people are looking for ways to leverage CPU and GPU architectures to get more total operations processing out of them, which helps with compute performance. FPGAs are concerned with system performance. Intel® FPGAs accelerate and aid the compute and connectivity required to collect and process the massive quantities of information around us by controlling the data path. In addition to FPGAs being used as compute offload, they can also directly receive data and process it inline without going through the host system. This frees the processor to manage other system events and provide higher real time system performance.

Real time is key. AI often relies on real-time processing to draw instantaneous conclusions and respond accurately. Imagine a self-driving car waiting for feedback after another car breaks hard or a deer leaps from the bushes. Immediacy has been a challenge given the amount of data involved, and lag can mean the difference between responding to an event and missing it entirely.

FPGAs’ flexibility enables them to deliver deterministic low latency (the guaranteed upper limit on the amount of time between a message sent and received under all system conditions) and high bandwidth. This flexibility supports the creation of custom hardware for individual solutions in an optimal way. Regardless of the custom or standard data interface, topology, or precision requirement, an FPGA can implement the exact architecture defined, which allows for unique solutions and fixed data paths. This also equates to excellent power efficiency and future proofing.

Power Efficiency

FPGAs’ ability to create custom solutions means they can create power-efficient solutions. They enable the creation of solutions that address specific problems, in the way each problem needs to be solved, by removing individual bottlenecks in the computation, not by pushing solutions through fixed architectures.

Intel® FPGAs have over 8 TB/s of on-die memory bandwidth. Therefore, solutions tend to keep the data on the device tightly coupled with the next compute. This minimizes the need to access external memory, which results in running at significantly lower frequencies. These lower frequencies and efficient compute implementations result in very powerful and efficient solutions. For example, FPGAs show up to an 80% power reduction when using AlexNet* (a convolutional neural network) compared to CPUs.

Future Proofing

In addition to system acceleration and power efficiency, Intel® FPGAs help future proof systems. With such a dynamic technology as machine learning, which is evolving and changing constantly, Intel® FPGAs provide the flexibility unavailable in fixed devices. As precisions drop from 32-bit to 8-bit and even binary/ternary networks, an FPGA has the flexibility to support those changes instantly. As next generation architectures and methodologies are developed, FPGAs will be there to implement them. By reprogramming an FPGA’s image, its functionality can be changed completely. Dedicated ASICs can provide a higher total cost of ownership (TCO) in the long run, and with such a dynamic technology, there is a higher and higher threshold to warrant building them, especially if FPGAs can meet a system’s needs.

Some markets demand longevity and high reliability from hardware with systems being deployed for 5, 10, 15, or more years in harsh environments. For example, imagine putting smart cameras on the street or compute systems in automobiles and requiring the same 18 month refresh cycle that CPUs and GPUs expect. The FPGAs flexibility enables users to update the hardware capabilities for the system without requiring a hardware refresh. This results in longer lifespans of deployed products. FPGAs have a history of long production cycles with devices being built for well over 15 to 20 years. They have been used in space, military, and extremely high reliability environments for decades.

For these reasons and more, developers at all levels need to understand how the Intel’s AI Ecosystem and portfolio employs Intel® FPGAs. This knowledge will enable developers to use Intel® FPGAs to accelerate and extend the life and efficiency of AI applications.

Increased Productivity and Shortened Design Cycles

Most developers know FPGAs are flexible and robust devices providing a wide variety of uses:

FPGAs can become any digital circuit as long as the unit has enough logic blocks to implement that circuit.
Their flexible platform enables custom system architectures that other devices simply cannot efficiently support.
FPGAs can perform inline data processing, such as machine learning, from a video camera or Ethernet stream, for example, and then pass the results to a storage device or to the process for further processing. FPGAs can do this while simultaneously performing, in parallel, compute offload.

But not all developers know how to access Intel® FPGAs’ potential or that they can do so with shorter-than-ever design cycles (illustrated in Figure 4).

Figure 4. Intel’s AI ecosystem is now enabled for FPGA.

To help developers bring FPGAs to market running machine learning workloads, Intel has shortened the design time for developers by creating a set of API layers. Developers can interface with the API layers based on their level of expertise, as outlined in Figure 5.

Figure 5. Four Entry Points for Developers

Typical users can start at the SDK or framework level. More advanced users, who want to build their own software stack, can enter at the Software API layer. The Software API layer abstracts away the lower-level OpenCL™ runtime and is the same API the libraries use. Customers who want to build their own software stack can enter at the C++ Embedded API Layer. Advanced platform developers who want to add more than machine learning to their FPGA—such as support for asynchronous parallel compute offload functions or modified source code—can enter in at the OpenCL™ Host Runtime API level or the Intel Deep Learning Architecture Library level, if they want to customize the machine learning library.

Several design entry methods are available for power users looking to modify source code and customize the topology by adding custom primitives. Developers can customize their solutions by using traditional RTL (Verilog or VHDL), which is common for FPGA developers, or the higher level compute languages, such as C/C++ or OpenCL™. By offering these various entry points for developers, Intel makes implementing FPGAs accessible for various skillsets in a timely manner.

Conclusion

Intel is uniquely positioned for AI development—the Intel’s AI Ecosystem offers solutions for all aspects of AI by providing a unified front end for a variety of backend technologies, from hardware to edge devices. In addition, Intel’s ecosystem is now fully enabled for FPGA. Intel® FPGAs provide numerous benefits, including system acceleration opportunities, power efficiency, and future proofing, due to FPGAs’ long lifespans, flexibility, and re-configurability. Finally, to help propel AI today and into the future, Intel AI solutions allow a variety of language-agnostic entry points for developers at all skillset levels.

↧

Optimizing Computer Applications for Latency: Part 1: Configuring the Hardware

July 25, 2017, 11:39 am

Latest and popular articles on Intel Technologies

≫ Next: Optimizing Computer Applications for Latency: Part 2: Tuning Applications

≪ Previous: Machine learning on Intel® FPGAs

For most applications, we think about performance in terms of throughput. What matters is how much work an application can do in a certain amount of time. That’s why hardware is usually designed with throughput in mind, and popular software optimization techniques aim to increase it.

However, there are some applications where latency is more important, such as High Frequency Trading (HFT), search engines and telecommunications. Latency is the time it takes to perform a single operation, such as delivering a single packet. Latency and throughput are closely related, but the distinction is important. You can sometimes increase throughput by adding more compute capacity; for example: double the number of servers to do twice the work in the same amount of time. But you can’t deliver a particular message any quicker without optimizing the way the messages are handled within each server.

Some optimizations improve both latency and throughput, but there is usually a trade-off. Throughput solutions tend to store packets in a buffer and process them in batches, but low latency solutions require every packet to be processed immediately.

Consistency is also important. In HFT, huge profits and losses can be made on global events. When news breaks around elections or other significant events, there can be bursts of trading activity with significant price moves. Having an outlier (a relatively high latency trade) at this busy time could result in significant losses.

Latency tuning is a complex topic requiring a wide and deep understanding of networking, kernel organization, CPU and platform performance, and thread synchronization. In this paper, I’ll outline some of the most useful techniques, based on my work with companies in telecommunications and HFT.

Understanding the Challenge of Latency Optimization

Here’s an analogy to illustrate the challenge of latency optimization. Imagine a group of people working in an office, who communicate by passing paper messages. Each message contains the data of a sender, recipient and an action request. Messages are stored on tables in the office. Some people receive messages from the outside world and store them on the table. Others take messages from the table and deliver them to one of the decision makers. Each decision maker only cares about certain types of messages.

The decision makers read the messages and decide whether the action request is to be fulfilled, postponed or cancelled. The requests that will be fulfilled are stored on another table. Messengers take these requests and deliver them to the people who will carry out the actions. That might involve sending the messages to the outside world, and sending confirmations to the original message sender.

To complicate things even more, there is a certain topology of message-passing routes. For example, the office building might have a complicated layout of rooms and corridors and people may need access to some of the rooms. Under normal conditions the system may function reasonably well in handling, let’s say, two hundred messages a day with an average message turnaround of five minutes.

Now, the goal is to dramatically reduce the turnaround time. At the same time, you want to make sure the turnaround time for a message is never more than twice the average. In other words, you want to be able to handle the bursts in activity without causing any latency outliers.

So, how can you improve office efficiency? You could hire more people to move messages around (increasing throughput), and hire faster people (reducing latency). I can imagine you might reduce latency from five minutes to two minutes (maybe even slightly less if you manage to hire Usain Bolt). But you will eventually hit a wall. There is no one faster than Bolt, right? Comparing this approach to computer systems, the people represent processes and this is about executing more threads or processes (to increase throughput) and buying faster computers (to cut latency).

Perhaps the office layout is not the best for the job. It’s important that everyone has enough space to do their job efficiently. Are corridors too narrow so people get stuck there? Make them wider. Are rooms tiny, so people have to queue to get in? Make them bigger. This is like buying a computer with more cores, larger caches and higher memory and I/O bandwidth.

Next, you could use express delivery services, rather than the normal postal service, for messages coming into and out of the office. In a computer system, this is about the choice of network equipment (adapters and switches) and their tuning. As in the office, the fastest delivery option might be the right choice, but is also probably the most expensive.

So now the latency is down to one minute. You can also instruct people and train them to communicate and execute more quickly. This is like tuning software to execute faster. I’ll take 15 percent off the latency for that. We are at 51 seconds.

The next step is to avoid people bumping into each other, or getting in each other’s way. We would like to enable all the people taking messages from the table and putting messages on it to do so at the same time, with no delay. We may want to keep messages sorted in some way (in separate boxes on the table) to streamline the process. There may also be messages of different priorities. In software, this is about improving thread synchronization. Threads should have as parallel and as quick access to the message queue as possible. Fixing bottlenecks increases throughput dramatically, and should also have some effect on latency. Now we can handle bursts of activity, although we do still have the risk of outliers.

People might stop for a chat sometimes or a door may stick in a locked position. There are a lot of little things that could cause delay. The highest priority is to ensure the following: that there are never more people than could fit into a particular space, there are no restrictions on people’s speed, there are no activities unrelated to the current job, and there is no interference from other people. For a computer application, this means we need to ensure that it never runs out of CPU cores, power states are set to maximum performance, and kernel (operating system) or middleware activities are isolated so they do not evict application thread activities.

Now let’s consider whether the office environment is conducive to our goal. Can people open doors easily? Are the floors slippery, so people have to walk with greater care and less speed? The office environment is like the kernel of an operating system. If the office environment can’t be made good enough, perhaps we can avoid part of it. Instead of going through the door, the most dexterous could pass a message through a window. It might be inconvenient, but it’s fast. This is like using kernel bypass solutions for networking.

Instead of relying on a kernel network stack, kernel bypass solutions implement user space networking. It helps to avoid unnecessary memory copies (kernel space to user space) and avoids the scheduler delay when placing the receiver thread for execution. In kernel bypass, the receiver thread typically uses busy-waiting. Rather than waiting on a lock, it continuously checks the lock variable until it flags: “Go!”

On top of that there may be different methods of exchanging messages through windows. You would likely start with delivering hand to hand. This sounds reliable, but it’s not the fastest. That’s how the Transmission Control Protocol (TCP) protocol works. Moving to User Datagram Protocol (UDP) would mean just throwing messages into the receiver’s window. You don’t need to wait for a person’s readiness to get a message from your hand. Looking for further improvement? How about throwing messages through the window so they land right on the table in the message queue? In a networking world, such an approach is called remote direct memory access (RDMA). I believe the latency has been cut to about 35 seconds now.

What about an office purpose-built, according to your design? You can make sure the messengers are able to move freely and their paths are optimized. That could get the latency down to 30 seconds, perhaps. Redesigning the office is like using a field programmable gate array (FPGA). An FPGA is a compute device that can be programmed specifically for a particular application. CPUs are hardcoded, which means they can only execute a particular instruction set with a data flow design that is also fixed. Unlike CPUs, FPGAs are not hardcoded for any particular instruction set so programming them makes them able to run a particular task and only that task. Data flow is also programmed for a particular application. As with a custom-designed office, it’s not easy to create an FPGA or to modify it later. It might deliver the lowest latency, but if anything changes in the workflow, it might not be suitable any more. An FPGA is also a type of office where thousands of people can stroll around (lots of parallelism), but there’s no running allowed (low frequency). I’d recommend using an FPGA only after considering the other options above.

To go further, you’ll need to use performance analysis tools. In part two of this article, I’ll show you how Intel® VTune^TM Amplifier and Intel® Processor Trace technology can be used to identify optimization opportunities.

Making the Right Hardware Choices

Before we look at tuning the hardware, we should consider the different hardware options available.

Processor choice

One of the most important decisions is whether to use a standard CPU or an FPGA.

The most extreme low latency solutions are developed and deployed on FPGAs. Despite the fact that FPGAs are not particularly fast in terms of frequency, they are nearly unlimited in terms of parallelism, because the device can be designed to satisfy the demands of the task at hand. This only makes a difference if the algorithm is highly parallel. There are two ways that parallelism helps. First, it can handle a huge number of packets simultaneously, so it handles bursts very well with a stable latency. As soon as there are more packets than cores in a CPU, there will be a delay. This has an impact on throughput than latency. The second way that parallelism helps is at the instruction level. A CPU can only carry out four instructions per cycle. An FPGA can carry out a nearly unlimited number of instructions simultaneously. For example, it can parse all the fields of an incoming packet concurrently. This is why it delivers lower latency despite its lower frequency.

In low latency applications, the FPGA usually receives a network signal through a PHY chip and does a full parsing of the network packets. It takes roughly half the time, compared to parsing and delivering packets from a network adapter to a CPU (even using the best kernel bypass solutions). In HFT, Ethernet is typically used because exchanges provide Ethernet connectivity. FPGA vendors provide Ethernet building blocks for various needs.

Some low latency solutions are designed to work across CPUs and FPGAs. Currently a typical connection is by PCI-e, but Intel has announced a development module using Intel® Xeon® processors together with FPGAs, where connectivity is by Intel® QuickPath Interconnect (Intel® QPI) link. This reduces connection latency significantly and also increases throughput.

When using CPU-based solutions, the CPU frequency is obviously the most important parameter for most low latency applications. The typical hardware choice is a trade-off between frequency and the number of cores. For particularly critical workloads, it’s not uncommon for server CPUs and other components to be overclocked. Overclocking memory usually has less impact. For a typical trading platform, memory accounts for about 10 percent of latency, though your mileage may vary, so the gains from overclocking are limited. In most cases, it isn’t worth trying it. Be aware that having more DIMMs may cause a drop in memory speed.

Single-socket servers operating independently are generally better suited for latency because they eliminate some of the complications and delay associated with ensuring consistent inter-socket communication.

Networking

The lowest latencies and the highest throughputs are achieved by high-performance computing (HPC) specialized interconnect solutions, which are widely used in HPC clusters. For Infiniband* interconnect, the half-roundtrip latency could be as low as 700 nanoseconds. (The half-roundtrip latency is measured from the moment a packet arrives at the network port of a server, until the moment the response has been sent from the server’s network port).

In HFT and telco, long range networking is usually based on Ethernet. To ensure the lowest possible latency when using Ethernet, two critical components must be used - a low latency network adapter and kernel bypass software. The fastest half-roundtrip latency you can get with kernel bypass is about 1.1 microseconds for UDP and slightly slower with TCP. Kernel bypass software implements the network stack in user space and eliminates bottlenecks in the kernel (superfluous data copies and context switches).

DPDK

Another high throughput and low latency option for Ethernet networking is the Data Plane Development Kit (DPDK). DPDK dedicates certain CPU cores to be the packet receiver threads and uses a permanent polling mode in the driver to ensure the quickest possible response to arriving packets. For more information, see http://dpdk.org/.

Storage

When we consider low latency applications, storage is rarely on a low latency path. When we do consider it, the best solution is a solid state drive (SSD). With access latencies of dozens of microseconds, SSDs are much faster than hard drives. There are PCI-e-based NVMe SSDs that provide the lowest latencies and the highest bandwidths.

Intel has announced the 3D XPoint^TM technology, and released the first SSD based on it. These disks bring latency down to several microseconds. This makes the 3D XPoint technology ideal for high performance SSD storage, delivering up to ten times the performance of NAND across a PCIe NVMe interface. An even better alternative in the future will be non-volatile memory based on 3D XPoint technology.

Tuning the Hardware for Low Latency

The default hardware settings are usually optimized for the highest throughput and reasonably low power consumption. When we’re chasing latency, that’s not what we are looking for. This section provides a checklist for tuning the hardware for latency.

In addition to these suggestions, check for any specific guidance from OEMs on latency tuning for their servers.

BIOS settings

Ensure that Turbo is on.
Disable lower CPU power states. Settings vary among different vendors, so after turning C-states off, you should check whether there are extra settings like C1E, and memory and PCI-e power saving states, which should also be disabled.
Check for other settings that might influence performance. This varies greatly by OEM, but should include anything power related, such as fan speed settings.
Disable hyper-threading to reduce variations in latency (jitter).
Disable any virtualization options.
Disable any monitoring options.
Disable Hardware Power Management, introduced in the Intel® Xeon® processor E5-2600 v4 product family. It provides more control over power management, but it can cause jitter and so is not recommended for latency-sensitive applications.

Networking

Ensure that the network adapter is inserted into the correct PCI-e slot, where the receiver thread is running. That shaves off inter-socket communication latency and allows Intel® Data Direct I/O Technology to place data directly into the last level cache (LLC) of the same socket.
Bind network interrupts to a core running on the same socket as a receiver thread. Check entry N in /proc/interrupts (where N is the interrupt queue number) and then set it by:
echo core # > /proc/irq/N/smp_affinity
Disable interrupt coalescing. Usually the default mode is adaptive which is much better than any fixed setting, but it is still several microseconds slower than disabling it. The recommended setting is:
ethtool –C <interface> rx-usecs 0 rx-frames 0 tx-usecs 0 tx-frames 0 pkt-rate-low 0 pkt-rate-high 0

Kernel bypass

Kernel bypass solutions usually come tuned for latency, but there still may be some useful options to try out such as polling settings.

Kernel tuning

Set the correct power mode. Edit /boot/grub/grub.conf and add:
nosoftlockup intel_idle.max_cstate=0 processor.max_cstate=0 mce=ignore_ce idle=poll
to the kernel line. For more information, see www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt
Turn off the cpuspeed service.
Disable unnecessary kernel services to avoid jitter.
Turn off the IRQ Balance service if interrupt affinity has been set.
Try tuning IPv4 parameters. Although this is more important for throughput, it can help to handle bursts of network activity.
Disable the TCP timestamps option for better CPU utilization:
sysctl -w net.ipv4.tcp_timestamps=0
Disable the TCP selective acks option for better CPU utilization:
sysctl -w net.ipv4.tcp_sack=0
Increase the maximum length of processor input queues:
sysctl -w net.core.netdev_max_backlog=250000
Increase the TCP maximum and default buffer sizes using setsockopt():
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.core.rmem_default=16777216
sysctl -w net.core.wmem_default=16777216
sysctl -w net.core.optmem_max=16777216
Increase memory thresholds to prevent packet dropping:
sysctl -w net.ipv4.tcp_mem="16777216 16777216 16777216"
Increase the Linux* auto-tuning of TCP buffer limits. The minimum, default, and maximum number of bytes to use are shown below (in the order minimum, default, and maximum):
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
Enable low latency mode for TCP:
sysctl -w net.ipv4.tcp_low_latency=1
For tuning network stack there is a good alternative:
tuned-adm profile network-latency
Disable iptables.
Set the scaling governor to “performance” mode for each core used by a process:
for ((i=0; i<num_of_cores; i++)); do echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done
Configure the kernel as preemptive to help reduce the number of outliers.
Use a tickless kernel to help eliminate any regular timer interrupts causing outliers.
Finally, use the isolcpus parameter to isolate the cores allocated to an application from OS processes.

Conclusion

This article provides an introduction to the challenge of latency tuning, the hardware choices available, and a checklist for configuring it for low latency. In the second article in this series, we look at application tuning, including a working example.

↧

Optimizing Computer Applications for Latency: Part 2: Tuning Applications

July 25, 2017, 12:02 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® Software Development tools integration to Microsoft* Visual Studio 2017 issue

≪ Previous: Optimizing Computer Applications for Latency: Part 1: Configuring the Hardware

For applications such as high frequency trading (HFT), search engines and telecommunications, it is essential that latency can be minimized. My previous article Optimizing Computer Applications for Latency, looked at the architecture choices that support a low latency application. This article builds on that to show how latency can be measured and tuned within the application software.

Using Intel® VTune^TM Amplifier

Intel® VTune^TM Amplifier XE can collect and display a lot of useful data about an application’s performance. You can run a number of pre-defined collections (such as parallelism and memory analysis) and see thread synchronization on a timeline. You can break down activity by process, thread, module, function, or core, and break it down by bottleneck too (memory bandwidth, cache misses, and front-end stalls).

Intel VTune can be used to identify many important performance issues, but it struggles with analyzing intervals measured in microseconds. Intel VTune uses periodic interrupts to collect data and save it. The frequency of those interrupts is limited to roughly one collection point per 100 microseconds per core. While you can filter the data to observe some of the outliers, the data on any single outlier will be limited, and some might be missed by the sampling frequency.

You can download a free trial of Intel VTune AmplifierXE. Read about the VTune AmplifierXE capabilities.

Figure 1. Intel® VTune™ Amplifier XE 2017, showing hotspots (above) and concurrency (below) analyses

Using Intel® Processor Trace Technology

The introduction of Intel® Processor Trace (Intel® PT) technology in the Broadwell architecture, for example in the Intel® Xeon® processor E5-2600 v4, makes it possible to analyze outliers in low latency applications. Intel® PT is a hardware feature that logs information about software execution with minimal impact on system execution. It supports control flow tracing, so decoder software can be used to determine the exact flow of the software execution, including branches taken and not taken, based on the trace log. Intel PT can store both cycle count and timestamp information for deep performance analysis. If you can time stamp other measurements, traces, and screenshots you can synchronize the Intel PT data with them. The granularity of a capture is a basic block. Intel PT is supported by the “perf” performance analysis tool in Linux*.

Typical Low Latency Application Issues

Low latency applications can suffer from the same bottlenecks as any kind of application, including:

Using excessive system library calls (such as inefficient memory allocations or string operations)
Using outdated instruction sets, because of obsolete compilers or compiler options
Memory and other runtime issues leading to execution stalls

On top of those, latency-sensitive applications have their own specific issues. Unlike in high-performance computing (HPC) applications, where loop bodies are usually small, the loop body in a low latency application usually covers a packet processing instruction path. In most cases, this leads to heavy front-end stalls because the decoded instructions for the entire packet processing path do not fit into the instruction (uop) cache. That means instructions have to be decoded on the fly for each loop iteration. Between 40 and 50 per cent of CPU cycles can stall due to the lack of instructions to execute.

Another specific problem is due to inefficient thread synchronization. The impact of this usually increases with a higher packet/transaction rate. Higher latency may lead to a limited throughput as well, making the application less able to handle bursts of activity. One example I’ve seen in customer code is guarding a single-threaded queue with a lock to use it in a multithreaded environment. That’s hugely inefficient. Using a good multithreaded queue, we’ve been able to improve throughput from 4,000 to 130,000 messages per second. Another common issue is using thread synchronization primitives that go to kernel sleep mode immediately or too soon. Every wake-up from kernel sleep takes at least 1.2 microseconds.

One of the goals of a low latency application is to reduce the quantity and extent of outliers. Typical reasons for jitter (in descending order) are:

Thread oversubscriptions, accounting for a few milliseconds
Runtime garbage collector activities, accounting for a few milliseconds
Kernel activities, accounting for up to 100s of microseconds
Power-saving states:
- CPU C-states, accounting for 10s to 100s of microseconds
- Memory states
- PCI-e states
Turbo mode frequency switches, accounting for 7 microseconds
Interrupts, IO, timers: responsible for up to a few microseconds

Tuning the Application

Application tuning should begin by tackling any issues found by Intel VTune. Start with the top hotspots and, where possible, eliminate or reduce excessive activities and CPU stalls. This has been widely covered by others before, so I won’t repeat their work here. If you’re new to Intel VTune, there’s a Getting Started guide.

In this article, we will focus on the specifics for low latency applications. The biggest issue arises from front-end stalls in the instruction decoding pipeline. This issue is difficult to address, and results from the loop body being too big for the uop cache. One approach that might help is to split the packet processing loop and process it by a number of threads passing execution from one another. There will be a synchronization overhead, but if the instruction sequence fits into a few uop caches (each thread bound to different cores, one cache per thread), it may well be worth the exercise.

Thread synchronization issues are somewhat difficult to monitor. Intel VTune Amplifier has a collection that captures all standard thread sync events (Windows* Thread API, pthreads* API, Intel® Threading Building Blocks and OpenMP*). It helps to understand what is going on in the application, but deeper analysis is required to see if a thread sync model introduces any limitations. This is non-trivial exercise requiring quite some expertise. The best advice is to use a highly performant threading solution.

An interesting topic is thread affinities. For complex systems with multiple thread synchronization patterns along the workflow, setting the best affinities may bring some benefit. A synchronization object is a variable or data structure, plus its associated lock/release functionality. Threads synchronized on a particular object should be pinned to a core of the same socket, but they don’t need to be on the same core. Generally the goal of this exercise is to keep thread synchronization on a particular object local to one of the sockets, because cross-socket thread sync is much costlier.

Tackling Outliers in Virtual Machines

If the application runs in a Java* or .NET* virtual machine, the virtual machine needs to be tuned. The garbage collector settings are particularly important. For example, try tuning the tenuring threshold to avoid unnecessary moves of long-lived objects. This often helps to reduce latency and cut outliers down.

One useful technology introduced in the Intel® Xeon® processor E5-2600 v4 product family is Cache Allocation Technology. It allows a certain amount of last level cache (LLC) to be dedicated to a particular core, process, or thread, or to a group of them. For example, a low latency application might get exclusive use of part of the cache so anything else running on the system won’t be able to evict its data.

Another interesting technique is to lock the hottest data in the LLC “indefinitely”. This is a particularly useful technique for outlier reduction. The hottest data is usually considered to be the data that’s accessed most often, but for low latency applications it can instead be the data that is on a critical latency path. A cache miss costs roughly 50 to 100 nanoseconds, so a few cache misses can cause an outlier. By ensuring that critical data is locked in the cache, we can reduce the number and intensity of outliers.

For more information on Cache Allocation Technology, see Using Hardware Features in Intel® Architecture to Achieve High Performance in NFV.

Exercise

Let’s play with a code sample implementing a lockless single-producer single-consumer queue. Download text/x-csrc Download

To start, grab the source code for the test case from the download link above. Build it like this:

gcc spsc.c -lpthread –lm -o spsc

icc spsc.c -lpthread –lm -o spsc

Here’s how you run the spsc test case:

./spsc 100000 10 100000

The parameters are: numofPacketsToSend bufferSize numofPacketsPerSecond. You can experiment with different numbers.

Let’s check how the latency is affected by CPU power-saving settings. Set everything in the BIOS to maximum performance, as described in Part 1 of this series. Specifically, CPU C-states must be set to off and the correct power mode should be used, as described in the Kernel Tuning section. Also, ensure that cpuspeed is off.

Next, set the CPU scaling governor to powersave. In this code, the index i goes up to the number of cores:

for ((i=0; i<23; i++)); do echo powersave > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done

Then set all threads to stay on the same NUMA node using taskset, and run the test case:

taskset –c 0,1 ./spsc 1000000 10 1000000

On a server based on the Intel® Xeon® Processor E5-2697 v2, running at 2.70GHz, we see the following results for average latency with and without outliers, the highest and lowest latency, the number of outliers and the standard deviation (with and without outliers). All measurements are in microseconds:

taskset -c 0,1 ./spsc 1000000 10 1000000

Avg lat = 0.274690, Avg lat w/o outliers = 0.234502, lowest lat = 0.133645, highest lat = 852.247954, outliers = 4023

Stdev = 0.001214, stdev w/o outliers = 0.001015

Now set the performance mode (overwriting the powersave mode) and run the test again:

for ((i=0; i<23; i++)); do echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done

taskset -c 0,1 ./spsc 1000000 10 1000000

Avg lat = 0.067001, Avg lat w/o outliers = 0.051926, lowest lat = 0.045660, highest lat = 422.293023, outliers = 1461

Stdev = 0.000455, stdev w/o outliers = 0.000560

As you can see all the performance metrics improved significantly when we enabled performance mode: the average, the lowest and highest latency, and the number of outliers. (Table 1 summarizes all the results from this exercise for easy comparison).

Let’s compare how the latency is affected by a NUMA locality. I’m assuming you have a machine with more than one processor. We’ve already run the test case bound to a single NUMA node.

Let’s run the test case over two nodes:

taskset -c 8,16 ./spsc 1000000 10 1000000

Avg lat = 0.248679, Avg lat w/o outliers = 0.233011, lowest lat = 0.069047, highest lat = 415.176207, outliers = 1926

Stdev = 0.000901, stdev w/o outliers = 0.001103

All of the metrics, except for the highest latency, are better on a single NUMA node. This results from the cost of communicating with another node, because data needs to be transferred over Intel® QuickPath Interconnect (Intel® QPI) link and over all parts of the cache coherency mechanism.

Don’t be surprised that the highest latency is lower on two nodes. You can run the test multiple times and verify that the highest latency outliers are roughly the same for both one node and two nodes. The lower value shown here for two nodes is most likely a coincidence. The outliers are two to three orders of magnitude higher than the average latency, which shows that NUMA locality doesn’t matter for the highest latency. The outliers are caused by kernel activities that are not related to NUMA.

Test	Avg Lat	Avg Lat w/o Outliers	lowest Lat	highest Lat	outliers	Stdev	Stdev w/o Outliers
Powersave	0.274690	0.234502	0.133645	852.247954	4023	0.001214	0.001015
Performance 1 node	0.067001	0.051926	0.045660	422.293023	1461	0.000455	0.000560
Performance 2 nodes	0.248679	0.233011	0.069047	415.176207	1926	0.000901	0.001103

Table 1: The results of the latency tests conducted under different conditions, measured in microseconds.

I also recommend playing with Linux perf to monitor outliers. Intel PT support starts with Kernel 4.1. You need to add timestamps (start, stop) for all latency intervals, identify a particular outlier and then drill down into perf data to see what was going on during the interval of the outlier.

For more information, see https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/intel-pt.txt.

Conclusion

This two-part article has summarized some of the approaches you can take, and tools you can use, when tuning applications and hardware for low latency. Using the worked example here, you can quickly see the impact of NUMA locality and powersave mode, and you can use the test case to experiment with other settings, and quickly see the impact they can have on latency.

↧

Intel® Software Development tools integration to Microsoft* Visual Studio 2017 issue

July 26, 2017, 7:47 am

Latest and popular articles on Intel Technologies

≫ Next: Announcing the Intel Modern Code Developer Challenge from CERN openlab

≪ Previous: Optimizing Computer Applications for Latency: Part 2: Tuning Applications

Issue: Installation of Intel® Parallel Studio XE with Microsoft* Visual Studio 2017 integration hangs and fails on some systems. The problem is intermittent and not reproducible on every system. Any attempts to repair it fails with the message "Incomplete installation of Microsoft Visual Studio* 2017 is detected". Note, in some cases the installation may complete successfully with no error/crashes, however, the integration to VS2017 is not installed. The issue may be observed with Intel® Parallel Studio XE 2017 Update 4, Intel® Parallel Studio XE 2018 Beta and later versions as well as Intel® System Studio installations.

Environment: Microsoft* Windows, Visual Studio 2017

Root Cause: A root cause was identified and reported to Microsoft*. Note that there may be different reasons of integration failures. We are documenting all cases and providing to Microsoft for further root-cause analysis.

Workaround:

Note that with Intel Parallel Studio XE 2017 Update 4 there is no workaround for this integration problem. The following workaround is expected to be implemented in Intel Parallel Studio XE 2017 Update 5. It is implemented in Intel Parallel Studio XE 2018 Beta Update 1.

Integrate the Intel Parallel Studio XE components manually. You need to run all the files from the corresponding folders:

C++/Fortran Compiler IDE: <installdir>/ide_support_2018/VS15/*.vsix
Amplifier: <installdir>/VTune Amplifier 2018/amplxe_vs2017-integration.vsix
Advisor: <installdir>/Advisor 2018/advi_vs2017-integration.vsix
Inspector: <installdir>/Inspector 2018/insp_vs2017-integration.vsix
Debugger: <InstallDir>/ide_support_2018/MIC/*.vsix
<InstallDir>/ide_support_2018/CPUSideRDM/*.vsix

If this workaround doesn't work and installation still fails then please report the problem to Intel through the Intel® Developer Zone Forums or Online Service Center. You will need to supply the installation log file and error message from Microsoft installer.

↧

Announcing the Intel Modern Code Developer Challenge from CERN openlab

July 26, 2017, 8:32 am

Latest and popular articles on Intel Technologies

≫ Next: Rendering Researchers: Hugues Labbe

≪ Previous: Intel® Software Development tools integration to Microsoft* Visual Studio 2017 issue

It is always an exciting time when I get to announce a Modern Code Developer Challenge from my friends at Intel, but it is even more special when I get to announce a collaboration with the brilliant minds at CERN. Beginning this month (July 2017), and running for nine weeks, five exceptional students participating in the CERN openlab Summer Student Programme are working to research and develop solutions for five modern-code-centered challenges. These are no ordinary challenges, as you might have already guessed—here is a brief summary of what they are tackling:

Smash-simulation software: Teaching algorithms to be faster at simulating particle-collision events.
Connecting the dots: Using machine learning to better identify the particles produced by collision events.
Cells in the cloud: Running biological simulations more efficiently with cloud computing.
Disaster relief: Helping computers to get better at recognizing objects in satellite maps created by a UN agency.
IoT at the LHC: Integrating Internet of Things devices into the control systems for the Large Hadron Collider.

After the nine weeks of interactive support from an open community of developers, scientists, fellow students, and other people passionate about science, one of the five students will be selected to showcase their winning project at a number of leading industry events. The winner will be announced at the upcoming Intel® HPC Developers Conference on November 11, 2017, and will also be shown at the SC17 SuperComputing conference in Denver, Colorado.

Follow the Intel Developer Zone on Facebook for more announcements and information, including those about this exciting new challenge that will surely teach us a thing or two about modern coding.

I will add comments to this blog as I learn more about the opportunities to review/comment/vote on the on-going work of these five CERN openlab Summer Student Programme students working to make the world a better place!

↧

Rendering Researchers: Hugues Labbe

July 26, 2017, 9:03 am

Latest and popular articles on Intel Technologies

≫ Next: System Analyzer Utility for Linux

≪ Previous: Announcing the Intel Modern Code Developer Challenge from CERN openlab

Hugues Labbe Hugues has been passionate about Graphics programming since an early age, starting w/ the Commodore Amiga demo scene in the mid-80s.

He earned his Masters degree in Computer Graphics from IRISA University (France) in 1995. After relocating to California and helping grow a couple of San Francisco bay area startups in the early 2000s, he joined Intel in 2005 where he worked on optimizations of the geometry pipe for Intel’s graphics driver stack, followed by shader compiler architecture and end-to-end optimizations of the Larrabee graphics pipeline, and more recently on an end-to-end Virtual Reality compositor, including design, architecture, implementation, and optimizations for Intel’s Project Alloy VR headset.

Working on competitive graphics innovation for future Intel platforms, his current research focus revolves around advancing the state-of-the-art in real-time rendering, ray-tracing GPU acceleration, and GPU compiler + hardware architecture.

↧

System Analyzer Utility for Linux

July 26, 2017, 4:35 pm

Latest and popular articles on Intel Technologies

≫ Next: Pirate Cove as an Example: How to Bring Steam*VR into Unity*

≪ Previous: Rendering Researchers: Hugues Labbe

Overview

This article describes a utility to help diagnose system and installation issues for Intel(R) Computer Vision SDK, Intel(R) SDK for OpenCL(TM) Applications and Intel(R) Media Server Studio. It is a simple Python script with full source code available.

It is intended as a reference for the kinds of checks to consider from the command line and possibly from within applications. However, this implementation should be considered a prototype/proof of concept -- not a productized tool.

Features

When executed, the tool reports back

Platform readiness: check if processor has necessary GPU components
OS readiness: check if OS can see GPU, and if it has required glibc/gcc level
Install checks for Intel(R) Media Server Studio/Intel(R) SDK for OpenCL Applications components
Results from runs of small smoke test programs for Media SDK and OpenCL

System Requirements

The tool is based on Python 2.7. It should run on a variety of systems with or without necessary components to run GPU applications. However, it is still a work in progress so it may not always exit cleanly when software components are missing.

Using the Software

The display should look like the output below for a successful installation

$ python sys_analyzer_linux.py -v
--------------------------
Hardware readiness checks:
--------------------------
 [ OK ] Processor name: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
 [ INFO ] Intel Processor
 [ INFO ] Processor brand: Core
 [ INFO ] Processor arch: Skylake
--------------------------
OS readiness checks:
--------------------------
 [ INFO ] GPU PCI id     : 1916
 [ INFO ] GPU description: SKL ULT GT2
 [ OK ] GPU visible to OS
 [ INFO ] no nomodeset in GRUB cmdline (good)
 [ INFO ] Linux distro   : Ubuntu 14.04
 [ INFO ] Linux kernel   : 4.4.0
 [ INFO ] glibc version  : 2.19
 [ INFO ] gcc version    : 4.8.4 (>=4.8.2 suggested)
 [ INFO ] /dev/dri/card0 : YES
 [ INFO ] /dev/dri/renderD128 : YES
--------------------------
Intel(R) Media Server Studio Install:
--------------------------
 [ OK ] user in video group
 [ OK ] libva.so.1 found
 [ INFO ] Intel iHD used by libva
 [ OK ] vainfo reports valid codec entry points
 [ INFO ] i915 driver in use by Intel video adapter
 [ OK ] /dev/dri/renderD128 connects to Intel i915

--------------------------
Media SDK Plugins available:
(for more info see /opt/intel/mediasdk/plugins/plugins.cfg)
--------------------------
    H264LA Encoder 	= 588f1185d47b42968dea377bb5d0dcb4
    VP8 Decoder 	= f622394d8d87452f878c51f2fc9b4131
    HEVC Decoder 	= 33a61c0b4c27454ca8d85dde757c6f8e
    HEVC Encoder 	= 6fadc791a0c2eb479ab6dcd5ea9da347
--------------------------
Component Smoke Tests:
--------------------------
 [ OK ] Media SDK HW API level:1.19
 [ OK ] Media SDK SW API level:1.19
 [ OK ] OpenCL check:platform:Intel(R) OpenCL GPU OK CPU OK

↧

Pirate Cove as an Example: How to Bring SteamVR into Unity

July 27, 2017, 7:43 am

Latest and popular articles on Intel Technologies

≫ Next: Introducing: Movidius™ Neural Compute Stick

≪ Previous: System Analyzer Utility for Linux

View PDF [1065 kb]

Who is the Target Audience of This Article?

This article is aimed at an existing Unity* developer who would like to incorporate Steam*VR into their scene. I am making the assumption that the reader already has an HTC Vive* that is set up on their computer and running correctly. If not, follow the instructions at the SteamVR site.

Why Did I Create the My Pirate Cove VR Scene?

My focus at work changed and I needed to ramp up on virtual reality (VR) and working in the Unity environment. I wanted to figure out how to bring SteamVR into Unity, layout a scene, and enable teleporting so I could move around the scene.

This article is intended to talk to some points that I learned along the way as well as show you how I got VR working with my scene. I will not be talking much about laying out the scene and how I used Unity to get everything up and running; rather, the main focus of this article is to help someone get VR incorporated into their scene.

What Was I Trying to Create?

I was trying to create a virtual reality visual experience. Not a game per se, even though I was using Unity to create my experience. I created a visual experience that would simulate what a small, pirate-themed tropical island could look like. I chose something that was pleasing to me; after all, I live in the rain-infested Pacific Northwest. I wanted to experience something tropical.

What Tools Did I Use?

I used Unity* 5.6. From there I purchased a few assets from the Unity Asset Store. The assets I chose were themed around an old, tropical, pirate setting:

Pirates Island
SteamVR
Hessburg Navel Cutter
Aquas

Along with a few other free odds and ends.

What Did I Know Going Into This Project?

I had some experience with Unity while working with our Intel® RealSense™ technology. Our SDK had an Intel RealSense Unity plugin, and I had written about the plugin as well as created a few training examples on it.

Up to this point I had never really tried to lay out a first-person type level in Unity, never worried about performance, frames per second (FPS), or anything like that. I had done some degree of scripting while using Intel RealSense and other simple ramp up tools. However, I’d never had to tackle a large project or scene and any issues that could come with that.

What Was the End Goal of This Project?

What I had hoped for is that I would walk away from this exercise with a better understanding of Unity and incorporating VR into a scene. I wanted to see how difficult it was to get VR up and running. What could I expect once I got it working? Would performance be acceptable? Would I be able to feel steady, not woozy, due to potential lowered frame rates?

And have fun. I wanted to have fun learning all this, which is also why I chose to create a tropical island, pirate-themed scene. I personally have a fascination with the old Caribbean pirate days.

What Misconceptions Did I Have?

As mentioned, I did have some experience with Unity, but not a whole lot.

The first misconception I had was what gets rendered and when. What do I mean? For some reason I had assumed that if I have a terrain including, for example, a 3D model such as a huge cliff, that if I placed the cliff such that only 50 percent of the cliff was above the terrain, Unity would not try to render what was not visible. I somehow thought that that there was some kind of rendering algorithm that would prevent Unity from rendering anything under a terrain object. Apparently that is not the case. Unity still renders the entire cliff 3D model.

This same concept applied to two different 3D cliff models. For example, if I had two cliff game objects, I assumed that if I pushed one cliff into the other to give the illusion of one big cliff, any geometry or texture information that was no longer visible would not get rendered. Again, not the case.

Apparently, if it has geometry and textures, no matter if it’s hidden inside something else, it will get rendered by Unity. This is something to take into consideration. I can’t say this had a big impact on my work or that it caused me to go find a fix; rather, just in the normal process of ramping up on Unity, I discovered this.

Performance

This might be where I learned the most. Laying out a scene using Unity’s terrain tools is pretty straightforward. Adding assets is also pretty straightforward. Before I get called out, I didn’t say it was straightforward to do a GOOD job; I’m just saying that you can easily figure out how to lay things out. While I think my Pirates Cove scene is a thing of beauty, others would scoff, and rightfully so. But, it was my first time and I was not trying to create a first-person shooter level. This is a faux visual experience.

FPS: Having talked with people about VR I had learned that the target FPS for VR is 90. I had initially tried to use the Unity Stats window. After talking with others on the Unity forum, I found out that this is not the best tool for true FPS performance. I was referred to this script to use instead, FPS Display script. Apparently it’s more accurate.

Occlusion culling: This was an interesting situation. I was trying to figure out a completely non-related issue and a co-worker came over to help me out. We got to talking about FPS and things you can do to help rendering. He introduced me to occlusion culling. He was showing me a manual way to do it, where you define the sizes and shapes of the boxes. I must confess, I simply brought up Unity’s Occlusion Culling window and allowed it to figure out the occlusions on its own. This seemed to help with performance.

Vegetation: I didn’t realize that adding grass to the terrain was going to have such a heavy impact. I had observed other scenes that seemed to have a lot of grass swaying in the wind. Thus, I went hog wild; dropped FPS to almost 0 and brought Unity to its knees. Rather than deal with this, I simply removed the grass and used a clover-looking texture that still made my scene look nice, yet without all the draw calls.

How I Got Vive* Working in My Scene

As mentioned at the top of the article, I’m making the assumption that the reader already has Vive set up and running on their workstation. This area is a condensed version of an existing article that I found, the HTC Vive Tutorial for Unity. I’m not planning on going into any detail on grabbing items with the controllers; for this article I will stick to teleporting. I did modify my version of teleporting, not so much because I think mine is better, but rather, by playing with it and discovering things on my own.

Before you can do any HTC Vive* development, you must download and import the SteamVR plugin.

screenshot of Steam*VR plugin logo

Once the plugin has been imported you will see the following file structure in your project:

screenshot of file structure in unity

In my project, I created a root folder in the hierarchy called VR. In there, I copied the SteamVR and CameraRig prefabs. You don’t have to create a VR folder; I just like to keep my project semi-organized.

screenshot of folder organization

I did not need to do anything with the SteamVR plugin other than add it to my project hierarchy; instead, we will be looking at the CameraRig.

I placed the CameraRig in my scene where I wanted it to initially start.

screenshot of game environment

After placing the SteamVR CamerRig prefab, I had to delete the main camera; this is to avoid conflicts. At this point, I was able to start my scene and look around. I was not able to move, but from a stationary point I could look around and see the controllers in my hand. You can’t go anywhere, but at least you can look around.

screenshot of game environment

Getting Tracked Objects

Tracked objects are both the hand controllers as well the headset itself. For this code sample, I didn’t worry about the headset; instead, I needed to get input from the hand controllers. This is necessary for tracking things like button clicks, and so on.

First, we must get an instance of the tracked object that the script is on. In this case it will be the controller; this is done in the Awake function.

void Awake( )

{

    _trackedController = GetComponent( );

}

Then, when you want to test for input from one of the two hand controllers, you can select the specific controller by using the following Get function. It uses the tracked object (the hand controller) that this script is attached to:

private SteamVR_Controller.Device HandController
{
    get
    {
        return SteamVR_Controller.Input( ( int )_trackedObj.index );
    }
}

Creating a Teleport System

Now we want to add the ability to move around in the scene. To do this, I had to create a script that knew how to read the input from the hand controllers. I created a new Unity C# script and named it TeleportSystem.cs.

Not only do we need a script but we need a laser pointer, and in this specific case, a reticle. A reticle is not mandatory by any means but does add a little flair to the scene because the reticle can be used as a visual feedback tool for the user. I will just create a very simple circle with a skull image on it.

Create the Laser

The laser was created by throwing a cube into the scene; high enough in the scene so that it didn’t interfere with any of the other assets in that scene. From there I scaled it to x = 0.005, y = 0.005, and z = 1. This gives it a long, thin shape.

screenshot of model in the unity envrionment

After the laser was created, I saved it as a prefab and removed the original cube from the scene because the cube was no longer needed.

Create the Reticle

I wanted a customized reticle at the end of the laser; not required, but cool nonetheless. I created a prefab that is a circle mesh with a decal on it.

screenshot of pirate flag decal

screenshot of the unity inspector panel

Setting Up the Layers

This is an important step. You have to tell your laser/reticle what is and what is not teleportable. For example, you may not want to give the user the ability to teleport onto water, or you may not want to allow them to teleport onto the side of a cliff. You can restrict them to specific areas in your scene by using layers. I created two layers—Teleportable and NotTeleportable.

screenshot of game layers in unity

Things that are teleportable, like the terrain itself, the grass huts, and the stairs I would put on the Teleportable layer. Things like the cliffs or other items in the scene that I don’t want a user to teleport to, I put on the NotTeleportable layer.

When defining my variables, I defined two-layer masks. One mask just had all layers in it. Then I had a non-teleportable mask that indicates layers are not supposed to be teleportable.

// Layer mask to filter the areas on which teleports are allowed
public LayerMask _teleportMask;

// Layer mask specifies which layers we can NOT teleport to.
public LayerMask _nonTeleportMask;

When defining the public layer masks, you will see them in the script. They contain drop-down lists that let you pick and choose which layers you do not want someone teleporting to.

screenshot of unity CSharp script

Setting up the layers works in conjunction with the LayerMatchTest function.

/// <summary>
/// Checks to see if a GameObject is on a layer in a LayerMask.
/// </summary>
/// <param name="layers">Layers we don't want to teleport to</param>
/// <param name="objInQuestion">Object that the raytrace hit</param>
/// <returns> true if the provided GameObject's Layer matches one of the Layers in the provided LayerMask.</returns>
private static bool LayerMatchTest( LayerMask layers, GameObject objInQuestion )
{
    return( ( 1 << objInQuestion.layer ) & layers ) != 0;
}

When LayerMatchTest() is called, I’m sending the layer mask that has the list of layers I don’t want people teleporting to, and the game object that the HitTest detected. This test will see if that object is or is not in the non-teleportable layer list.

Updating Each Frame

void Update( )
{
    // If the touchpad is held down
    if ( HandController.GetPress( SteamVR_Controller.ButtonMask.Touchpad ) )
    {
        _doTeleport = false;

        // Shoot a ray from controller. If it hits something make it store the point where it hit and show
        // the laser. Takes into account the layers which can be teleported onto
        if ( Physics.Raycast( _trackedController.transform.position, transform.forward, out _hitPoint, 100, _teleportMask ) )
        {
            // Determine if we are pointing at something which is on an approved teleport layer.
            // Notice that we are sending in layers we DON'T want to teleport to.
            _doTeleport = !LayerMatchTest( _nonTeleportMask, _hitPoint.collider.gameObject );

            if( _doTeleport )
            {
                PointLaser( );
            }
            else
            {
                DisplayLaser( false );
            }
        }
    }
    else
    {
        // Hide _laser when player releases touchpad
        DisplayLaser( false );
    }
    if( HandController.GetPressUp( SteamVR_Controller.ButtonMask.Touchpad ) && _doTeleport )
    {
        TeleportToNewPosition();
        ResetTeleporting( );
    }
}

On each update, the code will test to see if the controller’s touchpad button was pressed. If so, I’m getting a Raycast hit. Notice that I’m sending my teleport mask that has everything in it. I then do a layer match test on the hit point. By calling the LayerMatchTest function we determine whether it hit something that is or is not teleportable. Notice that I’m sending the list of layers that I do NOT want to teleport to. This returns a Boolean value that is then used to determine whether or not we can teleport.

If we can teleport, I then display the laser using the PointLaser function. In this function, I’m telling the laser prefab to look in the direction of the HitTest. Next, we stretch (scale) the laser prefab from the controller to the HitTest location. At the same time, I reposition the reticle at the end of the laser.

private void PointLaser( )
{
    DisplayLaser( true );

    // Position laser between controller and point where raycast hits. Use Lerp because you can
    // give it two positions and the % it should travel. If you pass it .5f, which is 50%
    // you get the precise middle point.
    _laser.transform.position = Vector3.Lerp( _trackedController.transform.position, _hitPoint.point, .5f );

    // Point the laser at position where raycast hits.
    _laser.transform.LookAt( _hitPoint.point );

    // Scale the laser so it fits perfectly between the two positions
    _laser.transform.localScale = new Vector3( _laser.transform.localScale.x,
                                            _laser.transform.localScale.y,
                                            _hitPoint.distance );

    _reticle.transform.position = _hitPoint.point + _VRReticleOffset;
}

If the HitTest is pointing to a non-teleportable layer, I ensure that the laser pointer is turned off via the DisplayLaser function.

At the end of the function, if both the touch pad is being pressed AND the shouldTeleport variable is true, I call the Teleport function to teleport the user to the new location.

private void TeleportToNewPosition( )
{
    // Calculate the difference between the positions of the camera's rig's center and players head.
    Vector3 difference = _VRCameraTransform.position - _VRHeadTransform.position;

    // Reset the y-position for the above difference to 0, because the calculation doesn’t consider the
    // vertical position of the player’s head
    difference.y = 0;

    _VRCameraTransform.position =  _hitPoint.point + difference;
}

In Closing

This is pretty much how I got my scene up and running. It involved a lot of discovering things on the Internet, reading other people’s posts, and a lot of trial and error. I hope that you have found this article useful, and I invite you to contact me.

For completeness, here is the full script:

using System.Collections;
using System.Collections.Generic;
using UnityEngine;


/// <summary>
/// Used to teleport the players location in the scene. Attach to SteamVR's ControllerRig/Controller left and right
/// </summary>
public class TeleportSystem : MonoBehaviour
{
    // The controller itself
    private SteamVR_TrackedObject _trackedController;

    // SteamVR CameraRig transform
    public Transform _VRCameraTransform;

    // Reference to the laser prefab set in Inspecter
    public GameObject _VRLaserPrefab;

    // Ref to teleport reticle prefab set in Inspecter
    public GameObject _VRReticlePrefab;

    // Stores a reference to an instance of the laser
    private GameObject _laser;

    // Ref to instance of reticle
    private GameObject _reticle;

    // Ref to players head (the camera)
    public Transform _VRHeadTransform;

    // Reticle offset from the ground
    public Vector3 _VRReticleOffset;

    // Layer mask to filter the areas on which teleports are allowed
    public LayerMask _teleportMask;

    // Layer mask specifies which layers we can NOT teleport to.
    public LayerMask _nonTeleportMask;

    // True when a valid teleport location is found
    private bool _doTeleport;

    // Location where the user is pointing the hand held controller and releases the button
    private RaycastHit _hitPoint;




    /// <summary>
    /// Gets the tracked object. Can be either a controller or the head mount.
    /// But because this script will be on a hand controller, don't have to worry about
    /// knowing if it's a head or hand controller, this will only get the hand controller.
    /// </summary>
    void Awake( )
    {
        _trackedController = GetComponent<SteamVR_TrackedObject>( );
    }


    /// <summary>
    /// Initialize the two prefabs
    /// </summary>
    void Start( )
    {
        // Spawn prefabs, init the classes _hitPoint

        _laser      = Instantiate( _VRLaserPrefab );
        _reticle    = Instantiate( _VRReticlePrefab );

        _hitPoint   = new RaycastHit( );

    }


    /// <summary>
    /// Checks to see if player holding down touchpad button, if so, are the trying to teleport to a legit location
    /// </summary>
    void Update( )
    {
        // If the touchpad is held down
        if ( HandController.GetPress( SteamVR_Controller.ButtonMask.Touchpad ) )
        {
            _doTeleport = false;

            // Shoot a ray from controller. If it hits something make it store the point where it hit and show
            // the laser. Takes into account the layers which can be teleported onto
            if ( Physics.Raycast( _trackedController.transform.position, transform.forward, out _hitPoint, 100, _teleportMask ) )
            {
                // Determine if we are pointing at something which is on an approved teleport layer.
                // Notice that we are sending in layers we DON'T want to teleport to.
                _doTeleport = !LayerMatchTest( _nonTeleportMask, _hitPoint.collider.gameObject );

                if( _doTeleport )
                    PointLaser( );
                else
                    DisplayLaser( false );
            }
        }
        else
        {
            // Hide _laser when player releases touchpad
            DisplayLaser( false );
        }
        if( HandController.GetPressUp( SteamVR_Controller.ButtonMask.Touchpad ) && _doTeleport )
        {
            TeleportToNewPosition( );
            DisplayLaser( false );
        }
    }


    /// <summary>
    /// Gets the specific hand contoller this script is attached to, left or right controller
    /// </summary>
    private SteamVR_Controller.Device HandController
    {
        get
        {
            return SteamVR_Controller.Input( ( int )_trackedController.index );
        }
    }


    /// <summary>
    /// Checks to see if a GameObject is on a layer in a LayerMask.
    /// </summary>
    /// <param name="layers">Layers we don't want to teleport to</param>
    /// <param name="objInQuestion">Object that the raytrace hit</param>
    /// <returns>true if the provided GameObject's Layer matches one of the Layers in the provided LayerMask.</returns>
    private static bool LayerMatchTest( LayerMask layers, GameObject objInQuestion )
    {
        return( ( 1 << objInQuestion.layer ) & layers ) != 0;
    }


    /// <summary>
    /// Displays the lazer and reticle
    /// </summary>
    /// <param name="showIt">Flag </param>
    private void DisplayLaser( bool showIt )
    {
        // Show _laser and reticle
        _laser.SetActive( showIt );
        _reticle.SetActive( showIt );
    }



    /// <summary>
    /// Displays the laser prefab, streteches it out as needed
    /// </summary>
    /// <param name="hit">Where the Raycast hit</param>
    private void PointLaser( )
    {
        DisplayLaser( true );

        // Position laser between controller and point where raycast hits. Use Lerp because you can
        // give it two positions and the % it should travel. If you pass it .5f, which is 50%
        // you get the precise middle point.
        _laser.transform.position = Vector3.Lerp( _trackedController.transform.position, _hitPoint.point, .5f );

        // Point the laser at position where raycast hits.
        _laser.transform.LookAt( _hitPoint.point );

        // Scale the laser so it fits perfectly between the two positions
        _laser.transform.localScale = new Vector3( _laser.transform.localScale.x,
                                                    _laser.transform.localScale.y,
                                                    _hitPoint.distance );

        _reticle.transform.position = _hitPoint.point + _VRReticleOffset;
    }



    /// <summary>
    /// Calculates the difference between the cameraRig and head position. This ensures that
    /// the head ends up at the teleport spot, not just the cameraRig.
    /// </summary>
    /// <returns></returns>
    private void TeleportToNewPosition( )
    {
        Vector3 difference = _VRCameraTransform.position - _VRHeadTransform.position;
        difference.y = 0;
        _VRCameraTransform.position =  _hitPoint.point + difference;
    }
}

About the Author

Rick Blacker works in the Intel® Software and Services Group. His main focus is on virtual reality with focus on Primer VR application development.

↧

Introducing: Movidius™ Neural Compute Stick

July 27, 2017, 1:48 pm

Latest and popular articles on Intel Technologies

≫ Next: The Modern Code Developer Challenge

≪ Previous: Pirate Cove as an Example: How to Bring Steam*VR into Unity*

With the recent announcement and availability of the Movidius™ Neural Compute Stick, a new device for developing and deploying deep learning algorithms at the edge. The Intel® Movidius team created the Neural Compute Stick (NCS) to make deep learning application development on specialized hardware even more widely available.

The NCS is powered by the same low-power Movidius Vision Processing Unit (VPU) that can be found in millions of smart security cameras, gesture-controlled autonomous drones, and industrial machine vision equipment, for example. The convenient USB stick form factor makes it easier for developers to create, optimize and deploy advanced computer vision intelligence across a range of devices at the edge.

The USB form factor easily attaches to existing hosts and prototyping platforms, while the VPU inside provides machine learning on a low-power deep learning inference engine. You start using the NCS with trained Caffe* framework-based feed-forward Convolutional Neural Network (CNN), or you can choose one of our example pre-trained networks. Then, by using our Toolkit, you can profile the neural network, then compile a tuned version ready for embedded deployment using our Neural Compute Platform API.

Here are some of its key features:

Supports CNN profiling, prototyping, and tuning workflow
All data and power provided over a single USB Type A port
Real-time, on device inference – cloud connectivity not required
Run multiple devices on the same platform to scale performance
Quickly deploy existing CNN models or uniquely trained networks

The Intel Movidius team, is inspired by the incredible sophistication of the human brain’s visual system, and we would like to think we’re getting a little closer to matching its capabilities with our new Neural Compute Stick.

To get started, you can visit developer.movidius.com for more information.

↧

The Modern Code Developer Challenge

July 26, 2017, 4:07 pm

Latest and popular articles on Intel Technologies

≫ Next: Getting Started in Linux with Intel® SDK for OpenCL™ Applications

≪ Previous: Introducing: Movidius™ Neural Compute Stick

The Modern Code Developer Challenge is now adding projects on Artificial Intelligence (AI) and Internet of things! (IOT)
Five Projects, Five Highly Talented Student researchers, Nine Weeks at CERN between July and September
Interactive support from an open community of developers, scientists, students, people passionate about science
Winner gets an all-expense paid trip to highlight their project at the upcoming Intel® HPC Developers Conference& SC17’ supercomputing conference in Denver, Colorado, USA
Your voice matters – most compelling project wins
Get access to Intel tools, trainings& next generation hardware
Starting in July!

As part of its ongoing support of the world-wide student developer community and advancement of science, Intel® Software has partnered with CERN through CERN openlab to sponsor the Intel® Modern Code Developer Challenge. The goal for Intel is to give budding developers the opportunity to use modern programming methods to improve code that helps move science forward. Take a look at the winners from the previous challenge here!

The Challenge will take place from July - October, 2017 with the winners announced in November, 2017 at the Intel® HPC Developer Conference.

Check back on this site soon for more information!

1) Smash-simulation software: Teaching algorithms to be faster at simulating particle-collision events

Physicists widely use a software toolkit called GEANT4 to simulate what will happen when a particular kind of particle hits a particular kind of material in a particle detector. In fact, this toolkit is so popular that it is also used by researchers in other fields who want to predict how particles will interact with other matter: it’s used to assess radiation hazards in space, for commercial air travel, in medical imaging, and even to optimise scanning systems for cargo security.

An international team, led by researchers at CERN, is now working to develop a new version of this simulation toolkit, called GeantV. This work is supported by a CERN openlab project with Intel on code modernisation. GeantV will improve physics accuracy and boost performance on modern computing architectures.

The team behind GeantV is currently implementing a ‘deep-learning' tool that will be used to make simulation faster. The goal of this project is to write a flexible mini-application that can be used to support the efforts to train the deep neural network on distributed computing systems.

2) Connecting the dots: Using machine learning to better identify the particles produced by collision events

The particle detectors at CERN are like cathedral-sized 3D digital cameras, capable of recording hundreds of millions of collision events per second. The detectors consist of multiple ‘layers’ of detecting equipment, designed to recognise different types of charged particles produced by the collisions at the heart of the detector. As the charged particles fly outwards through the various layers of the detector, they leave traces, or ‘hits’.

Tracking is the art of connecting the hits to recreate trajectories, thus helping researchers to understand more about and identify the particles. The algorithms used to reconstruct the collision events by identifying which dots belong to which charged particles can be very computationally expensive. And, with the rate of particle collisions in the LHC set to be further increased over the coming decade, it’s important to be able to identify particle tracks as efficiently as possible.

Many track-finding algorithms start by building ‘track seeds’: groups of two or three hits that are potentially compatible with one another. Compatibility between hits can also be inferred from what are known as ‘hit shapes’. These are akin to footprints; the shape of a hit depends on the energy released in the layer, the crossing angle of the hit at the detector, and on the type of particle.

This project investigates the use of machine-learning techniques to help recognise these hit shapes more efficiently. The project will explore the use of state-of-the-art many-core architectures, such as the Intel Xeon Phi processor, for this work.

3) Cells in the cloud: Running biological simulations more efficiently with cloud computing

BioDynaMo is one of CERN openlab’s knowledge-sharing projects. It is part of CERN openlab’s collaboration with Intel on code modernisation, working on methods to ensure that scientific software makes full use of the computing potential offered by today’s cutting-edge hardware technologies.

It is a joint effort between CERN, Newcastle University, Innopolis University, and Kazan Federal University to design and build a scalable and ﬂexible platform for rapid simulation of biological tissue development.

The project focuses initially on the area of brain tissue simulation, drawing inspiration from existing, but low-performance software frameworks. By using the code to simulate the development of the normal and diseased brain, neuroscientists hope to be able to learn more about the causes of — and identify potential treatments for — disorders such as epilepsy and schizophrenia.

Late 2015 and early 2016 saw algorithms already written in Java code ported to C++. Once porting was completed, work was carried out to optimise the code for modern computer processors and co-processors. In order to be able to address ambitious research questions, however, more computational power will be needed. Work will, therefore, be undertaken to adapt the code for running using high-performance computing resources over the cloud. This project focuses on adding network support for the single-node simulator and prototyping the computation management across many nodes.

4) Disaster relief: Helping computers to get better at recognising objects in satellite maps created by a UN agency

UNOSAT is part of the United Nations Institute for Training and Research (UNITAR). It provides a rapid front-line service to turn satellite imagery into information that can aid disaster-response teams. By delivering imagery analysis and satellite solutions to relief and development organizations — both within and outside the UN system — UNOSAT helps to make a difference in critical areas such as humanitarian relief, human security, and development planning.

Since 2001, UNOSAT has been based at CERN and is supported by CERN's IT Department in the work it does. This partnership means UNOSAT can benefit from CERN's IT infrastructure whenever the situation requires, enabling the UN to be at the forefront of satellite-analysis technology. Specialists in geographic information systems and in the analysis of satellite data, supported by IT engineers and policy experts, ensure a dedicated service to the international humanitarian and development communities 24 hours a day, seven days a week.

CERN openlab and UNOSAT are currently exploring new approaches to image analysis and automated feature recognition to ease the task of identifying different classes of objects from satellite maps. This project evaluates available machine-learning-based feature-extraction algorithms. It also investigates the potential for optimising these algorithms for running on state-of-the-art many-core architectures, such as the Intel Xeon Phi processor.

5) IoT at the LHC: Integrating ‘internet-of-things’ devices into the control systems for the Large Hadron Collider

The Large Hadron Collider (LHC) accelerates particles to over 99.9999% of the speed of light. It is the most complex machine ever built, relying on a wide range industrial control systems for proper functioning.

This project will focus on integrating modern ‘systems-on-a-chip’ devices into the LHC control systems. The new, embedded ‘systems-on-a-chip’ available on the market are sufficiently powerful to run fully-fledged operating systems and complex algorithms. Such devices can also be easily enriched with a wide range of different sensors and communication controllers.

The ‘systems-on-a-chip’ devices will be integrated into the LHC control systems in line with the ‘internet of things’ (IoT) paradigm, meaning they will be able to communicate via an overlaying cloud-computing service. It should also be possible to perform simple analyses on the devices themselves, such as filtering, pre-processing, conditioning, monitoring, etc. By exploiting the IoT devices’ processing power in this manner, the goal is to reduce the network load within the entire control infrastructure and ensure that applications are not disrupted in case of limited or intermittent network connectivity.

↧

Getting Started in Linux with Intel® SDK for OpenCL™ Applications

July 27, 2017, 3:23 pm

Latest and popular articles on Intel Technologies

≫ Next: After teaching a tutorial, I’m going to go see the high-fidelity motion blurring at SIGGRAPH from the Intel Embree/OSPRay engineers

≪ Previous: The Modern Code Developer Challenge

This article is a step by step guide to quickly get started developing using Intel® SDK for OpenCL™ Applications with the Linux SRB5 driver package.

Install the driver
Install the SDK
Set up Eclipse

For SRB4.1 instructions, please see https://software.intel.com/en-us/articles/sdk-for-opencl-gsg-srb41.

Step 1: Install the driver

This script covers the steps needed to install the SRB5 driver package in Ubuntu 14.04, Ubuntu 16.04, CentOS 7.2, and CentOS 7.3.

To use

$ mv install_OCL_driver.sh_.txt install_OCL_driver.sh
$ chmod 755 install_OCL_driver.sh
$ sudo su
$ ./install_OCL_driver.sh install

This script automates downloading the driver package, installing prerequisites and user-mode components, patching the 4.7 kernel, and building it.

You can check your progress with the System Analyzer Utility. If successful, you should see smoke test results looking like this at the bottom of the the system analyzer output:

--------------------------
Component Smoke Tests:
--------------------------
[ OK ] OpenCL check:platform:Intel(R) OpenCL GPU OK CPU OK

Experimental installation without kernel patch or rebuild:

If using Ubuntu 16.04 with the default 4.8 kernel you may be able to skip the kernel patch and rebuild steps. This configuration works fairly well but several features (i.e. OpenCL 2.x device-side enqueue and shared virtual memory, VTune GPU support) require patches. Install without patches has been "smoke test" validated to check that it is viable to suggest for experimental use only, but it is not fully supported or certified.

Step 2: Install the SDK

This script will set up all prerequisites for successful SDK install for Ubuntu.

$ mv install_SDK_prereq_ubuntu.sh_.txt install_SDK_prereq_ubuntu.sh
$ sudo su
$ ./install_SDK_prereq_ubuntu.sh

After this, run the SDK installer.

Here is a kernel to test the SDK install:

__kernel void simpleAdd(
                       __global int *pA,
                       __global int *pB,
                       __global int *pC)
{
    const int id = get_global_id(0);
    pC[id] = pA[id] + pB[id];
}

Check that the command line compiler ioc64 is installed with

$ ioc64 -input=simpleAdd.cl -asm

(expected output)
No command specified, using 'build' as default
OpenCL Intel(R) Graphics device was found!
Device name: Intel(R) HD Graphics
Device version: OpenCL 2.0
Device vendor: Intel(R) Corporation
Device profile: FULL_PROFILE
fcl build 1 succeeded.
bcl build succeeded.

simpleAdd info:
	Maximum work-group size: 256
	Compiler work-group size: (0, 0, 0)
	Local memory size: 0
	Preferred multiple of work-group size: 32
	Minimum amount of private memory: 0

Build succeeded!

Step 3: Set up Eclipse

Intel SDK for OpenCL applications works with Eclipse Mars and Neon.

After installing, copy the CodeBuilder*.jar file from the SDK eclipse-plug-in folder to the Eclipse dropins folder.

$ cd eclipse/dropins
$ find /opt/intel -name 'CodeBuilder*.jar' -exec cp {} . \;

Start Eclipse. Code-Builder options should be available in the main menu.

↧

After teaching a tutorial, I’m going to go see the high-fidelity motion blurring at SIGGRAPH from the Intel Embree/OSPRay engineers

July 28, 2017, 11:37 am

Latest and popular articles on Intel Technologies

≫ Next: PIN Errors in 2017 Update 3 and 4 Analysis Tools

≪ Previous: Getting Started in Linux with Intel® SDK for OpenCL™ Applications

When I’m at SIGGRAPH, I’m planning to visit some friends in the Intel booth to see their new motion blurring technology. I’m sure people more knowledgeable in the field will find even more interesting developments from these engineers who are helping drive software defined visualization work, in particular with their Embree open source ray tracing library, and the OSPRay open source rendering engine for high-fidelity visualization (built on top of Embree). Key developers will be in the Intel booth at SIGGRAPH, and have a couple papers in the High-Performance Graphics (HPG) conference that is collocated with SIGGRAPH.

The Embree/OSPRay engineers have two interesting papers they will present at HPG. Both will be presented on Saturday July 29, in “Papers Session 3: Acceleration Structures for Ray Tracing”:

Improved Two-Level BVHs Using Partial Re-Braiding, by Carsten Benthin, Sven Woop, Ingo Wald, Attila Afra
STBVH: A Spatial-Temporal BVH for Efficient Multi-Segment Motion Blur, by Sven Woop, Attila Afra, Carsten Benthin

The SDVis machine (I refer to it as a “dream machine for software visualization” – but they simply call it the “Software Defined Visualization Appliance”) – and one such machine will be in the Intel booth at SIGGRAPH. I did not go to see it at ISC in Germany where they showed off HPC related visualization work with the hot topic being “in situ visualization.” At SIGGGRAPH, they will have demos around high-fidelity (realistic) visualization – specifically demonstrating Embree's novel approach to handle multi-segment motion blur and OSPRay's photorealistic renderer to interactively render a scene. These demos relate to their HPG papers.

showing the mblur approach: original (left) and with blurring (right)
original image (CC) caminandes.com

I’m sure the partial re-braiding is amazing, but it’s the blurring that caught my attention.

First of all, theoretically blurring is not needed. With a super high framerate, and amazing resolution, the blurring would just appear to us like real life. At least, I think that’s right.

However, with realistic framerates and resolutions we detect a scene as being unrealistic when blurring is not there. In fact, in some cases, spokes on wheels appear to go backwards.

The solution? Blurring. But, like many topics, what seems simple enough, is not. A simple algorithm might be to take adjacent frames and create a blur based on changes. Perhaps do this on a higher framerate visualization as your sample it down to the target framerate for your final production. Unfortunately, this approach is not efficient because the geometry has to be processed multiple times per frame and adaptive noise reduction on parts of the image is not possible.

That where “A Spatial-Temporal BVH for Efficient Multi-Segment Motion Blur” kicks in. These engineers had a different approach in which they pay attention to the actual motion of object. Imagine a scene with a helicopter blade turning around-and-around while a bird flies through the scene in something much closer to a straight line. Their method comprehends the actual motion, and create blurring based on that. Of course, doing this with high performance and high-fidelity both are what really makes their work valuable. In the example images above, the train blurring varies in a realistic fashion.

If want to read a better description of their work, and their comparisons with previous work, you should read their paper and/or visit them at HPG, or SIGGRAPH in the Intel booth.

I hope to see some of you at SIGGRAPH. I’m co-teaching a tutorial “Multithreading for Visual Effects” on Monday at 2pm. After that, I’m running over to see the Embree/OSPRay folks in the Intel booth.

↧

PIN Errors in 2017 Update 3 and 4 Analysis Tools

July 28, 2017, 3:10 pm

Latest and popular articles on Intel Technologies

≫ Next: Optimizing Edge-Based Intelligence for Intel® HD Graphics

≪ Previous: After teaching a tutorial, I’m going to go see the high-fidelity motion blurring at SIGGRAPH from the Intel Embree/OSPRay engineers

Problem:

As of July 28th, 2017, we have been receiving many reports of people who have been having problems with the analysis tools (Intel® VTune™ Amplifier, Advisor, and Inspector) as a result of a problem with PIN, the tool they use to instrument software.

PIN problems can produce several types of error. One of the more common ones is

__bionic_open_tzdata_path: PIN_CRT_TZDATA not set!

The PIN executable is located in the bin64 and/or bin32 folders in the installation directories of the analysis tools. You can test whether PIN is the source of your problems by running it on any executable. For example:

pin -- Notepad.exe

Solution:

On Windows*, certain virus checkers have been known to interfere with PIN. Excluding pin.exe from the virus checker may resolve the issue.

On Linux*, a recent security patch (CVE-2017-1000364) is causing problems with PIN. Intel® VTune™ Amplifier 2017 Update 4, available on the Registration Center, uses a new version of PIN which should fix these problems.

Intel® Advisor and Inspector have not yet received a patch. We apologize for the inconvenience, and we assure you we're working on getting it fixed as soon as possible. If PIN problems are causing a significant blockage of your work with Intel® Inspector or Advisor, please submit a ticket to the Online Service Center to let us know.

↧

Optimizing Edge-Based Intelligence for Intel® HD Graphics

July 31, 2017, 1:12 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® SDK for OpenCL™ Applications - Release Notes

≪ Previous: PIN Errors in 2017 Update 3 and 4 Analysis Tools

Background on AI and the Move to the Edge

Our daily lives intersect with artificial intelligence (AI)-based algorithms. With fits and starts, AI has been a domain of research over the last 60 years. Machine learning, or the many layers of deep learning, are propelling AI into all parts of modern life. Its applied usages are varied, from computer vision for identification and classification to natural language processing, to forecasting. These base-level tasks then lead to higher level tasks such as decision making.

What we call deep learning is typically associated with servers, the cloud, or high-performance computing. While AI usage in the cloud continues to grow, there is a trend toward AI inference engine on the edge (i.e. PCs, IoT devices, etc.). Having devices perform machine learning locally versus relying solely on the cloud is a trend driven by the need to lower latency, to ensure persistent availability, to reduce costs (for example, the cost of running inferencing algorithms on servers), and to address privacy concerns. Figure 1 shows the phases of deep learning.

Image of a map
Figure 1. Deep learning phases

Moving AI to consumers– Personify* is an expert system performing real-time image segmentation. Personify enabled real-time segmentation within the Intel® RealSense™ camera in 2015. In 2016, Personify launched ChromaCam*, an application that can remove/replace/blur the user's background in all major video chat apps like Microsoft Skype*, Cisco WebEx*, and Zoom* as well as streaming apps like OBS and XSplit*. ChromaCam uses deep learning to do dynamic background replacement in real-time and works with any standard 2D webcam commonly found on laptops.

Image of a person
Figure 2. Personify segmentation

One of the requirements for Personify is to run an inference engine process on their deep learning algorithm on the edge as fast as possible. To get good segmentation quality, Personify needs to run the inference algorithm on the edge to avoid the unacceptable latencies of the cloud. The Personify software stack runs on CPUs and graphics processing units (GPUs), and was originally optimized for discrete graphics. However, running an optimized deep learning inference engine that requires a discrete GPU limits the application to a relatively small segment of PCs. Further, the segmentation effect should ideally be very efficient since it will usually be used along with other intense applications such as gaming with segmentation, and most laptops are constrained in terms of total compute and system-on-chip (SOC) package thermal requirements. Intel and Personify started to explore optimizing an inference engine on Intel® HD Graphics with a goal to meeting these challenges and bringing this technology to the mainstream laptop.

We used Intel® VTune™ Amplifier XE to profile and optimize GPU performance for deep learning inference usage on Intel® 6^th Generation Core™ i7 Processors with Intel HD Graphics 530.

Figure 3 shows the baseline execution profile for running the inference workload on a client PC. Even though the application is using the GPU for the deep learning algorithm, the performance lags in its requirements. The total time to run a non-optimized inference algorithm on Intel HD Graphics is about three seconds for compute and the GPU is 70 percent stall.

Image of a spreadsheet
Figure 3. Intel® VTune™ analyzer GPU profile for segmentation (baseline)

The two most important columns that stand out are total time in general matrix-to-matrix multiplication (GEMM) and execution unit (EU) stalls. Without optimization, the deep learning inference engine is very slow for image segmentation in a video conferencing scenario on Intel HD Graphics. Our task was to hit maximum performance from an Intel® GPU on a mainstream consumer laptop.

Optimization: Optimizing a matrix-to-matrix multiplication kernel and increasing EU active time were top priority.

Figure 4 shows the default pipeline of convolutional neural network.

Image of map
Figure 4. Default deep learning pipeline

We identified several items for deep learning inferencing on Intel HD Graphics (Figure 5).

CPU Copies - Algorithm is using CPU to copy data from CPU to GPU for processing at every deep learning layer.
GEMM Convolution Algorithm - based on OpenCV* and OpenCL™.
Convert to Columns - Uses extra steps and needs extra memory.

Image of a map
Figure 5. Remove extra copies (using spatial convolution)

We replaced GEMM convolution with spatial convolution, which helped to avoid additional memory copies and produced code that was optimized for speed. We overcame dependencies on reading individual kernels in this architecture by auto-tuning the mechanism (see Figure 6).

Image of a map
Figure 6. New simple architecture

Result: Our testing on Intel® 6^th Generation Core™ i7 Processors with Intel HD Graphics 530 shows 13x better performance on total time (2.8 seconds vs. 0.209 seconds) with about 69.6 percent GPU utilization and ~8.6x performance gain (1.495seconds vs. 0.173 seconds) in GEMM kernels (Figure 7 vs. Figure 3), thus increasing quality real-time segmentation as frame rate is increased.

Image of properties
Figure 7. Intel® VTune™ analyzer GPU profile for segmentation (after optimization)

To support mobile usage, battery life is another important metric for a laptop. Bringing high-performance, deep algorithms to a client, at the cost of higher power consumption, impacts the user experience. We analyzed estimated battery power consumption using the Intel® Power Gadget tool during 120 seconds of total video conference workload.

Power Utilization for Personify Application

GPU Power Utilization 5.5W
Package SOC Power 11.28W

Summary: We are witnessing a reshuffling of compute partitioning between cloud data centers and clients in favor of moving applications of deep learning models to the client. Local model deployment also has the advantage of latency overhead and keeping personal data local to the device, and thus protecting privacy. Intel processor-based platforms enable high-end CPU and GPUs to provide inference engine applications on clients that cover big, consumer-based ecosystems.

References

Chromacam* - https://www.chromacam.me
GEMM Kernels for Intel GPU - https://github.com/opencv/opencv/pull/8104
Intel® Power Gadget - https://software.intel.com/en-us/articles/intel-power-gadget-20
Personify* - https://www.personify.com

↧

Intel® SDK for OpenCL™ Applications - Release Notes

August 2, 2017, 4:01 pm

Latest and popular articles on Intel Technologies

≫ Next: Minimum Copy Edit Pass

≪ Previous: Optimizing Edge-Based Intelligence for Intel® HD Graphics

This page provides the current Release Notes for Intel® SDK for OpenCL™ Applications. All files are in PDF format - Adobe Reader* (or compatible) required.

For questions or technical support, visit the OpenCL SDK support forum.

Intel® SDK for OpenCL™ Applications

What's New?	Release Notes
What's new? Intel® SDK for OpenCL™ Applications 2016, R3	Intel® SDK for OpenCL™ Applications 2016, R3 English
What's new? Intel® SDK for OpenCL™ Applications 2016, R2	Intel® SDK for OpenCL™ Applications 2016, R2 English
What's new? Intel® SDK for OpenCL™ Applications 2016	Intel® SDK for OpenCL™ Applications 2016 English
What's new? Intel® SDK for OpenCL™ Applications 2015 R3	Intel® SDK for OpenCL™ Applications 2015 English
What's new? OpenCL™ Code Builder 2015 R2	OpenCL™ Code Builder for Intel® Integrated Native Developers Experience (Intel® INDE) Release Notes English
What's new? OpenCL™ Code Builder 2015 R1	OpenCL™ Code Builder for Intel® Integrated Native Developers Experience (Intel® INDE) Release Notes English
What's new? Beta	Beta Release Notes English

↧

Minimum Copy Edit Pass

August 2, 2017, 12:51 pm

Latest and popular articles on Intel Technologies

≫ Next: A Scalable Path to Commercial IoT Solutions

≪ Previous: Intel® SDK for OpenCL™ Applications - Release Notes

Our copy editing team is currently at max capacity. In order to accommodate requests that are not POR, we've instituted a minimum content quality standard before publication. If your content doesn't meet this standard, your updates will be delayed. We encourage you to thoroughly check your content prior to submitting to our team.

Trademark & Brand

Verify all Intel trademarks in the Intel Names Database
Use the official Intel name, not the code name, unless the code name has been approved by Legal for external use
Provide documentation of Legal approval of new Intel names that do not appear in the Intel Names Database (include approval documentation in ticket)
Confirm correct usage of third-party names

Trademark & Brand Basics
Intel Trademarks
Third-Party Trademarks

Readability

Ensure content is comprehensible (correct major grammar issues)
Remove all internal notes (If needed, use "comments" to record them)
Spell check all copy
Remove all references to "this article,""this video," or "this website"

General Writing Guidelines

Specifications

Double-check that:

Version numbers are consistent
Units of measure are correct and consistent
Terms are correct and consistent (OS, IDEs, etc.)
Acronyms have been spelled out upon first use

Common Words List
Common Acronyms

Layout

Ensure new content doesn't break the template.

For example:

Use up to five related links
Limit top section to three bullet points
Consider width of the site when requesting multiple columns or large tables
Think mobile first (how will it look on a smartphone or tablet?)
Limit the number of embedded videos as excessive videos could slow down the performance of the site

Assets

Ensure all assets are live (articles, blog posts, books, pdfs, downloads, etc.)
Validate asset links and labels

↧

A Scalable Path to Commercial IoT Solutions

August 4, 2017, 1:21 pm

Latest and popular articles on Intel Technologies

≫ Next: VR Content Developer Guide

≪ Previous: Minimum Copy Edit Pass

Fast Track Your Development Cycle from Prototype to Production

Get to your development goal with hardware, software, and data solutions that work together seamlessly, at scale—and securely. From the cloud right out the to the edge, Intel® IoT Technology offers a complete ecosystem for the connected future.

A Smarter World—for Every Enterprise

Moving forward is easier when you don’t have to innovate from scratch. Our complete dev kits, comprehensive code libraries and more start you off with a suite of flexible solutions that scale quickly. Intel offers SDKs, tools and samples to unlock core technologies for deep learning, computer vision, robotics and more for your platform, runtime, or language.

Retail

Create cutting edge display solutions by innovating and optimizing your in-store applications. Combine digital content streaming, personalized interactions, and media hardware acceleration into a digital signage platform you can manage remotely.

Industrial

The factory of the future will be smarter and more automated than ever. Discover IoT solutions offered by Intel that support the rigorous requirements for programmable logic controllers (PLCs), Industrial PCs (IPCs), human machine interfaces (HMIs) and more.

Smart video

Expand the capabilities of your smart video system with the latest in HD graphics, real-time video encoding and transcoding, scalable video streaming and storage, video analytics, and artificial intelligence.

Automotive

Automated driving starts with a suite of end-to-end capabilities developed by Intel. Use SDKs to draft and implement scalable, interoperable automated driving solutions, in-vehicle media experiences, and interactive user interfaces.

Move Beyond Proof of Concept and Start Gaining Insight Right Away

Build a functional prototype with a scalable Intel® IoT Gateway that does a lot more than just handle communication between local sensors and remote users. Use these pre-integrated hardware and software solutions to also collect and analyze data from the edge to the cloud—regardless of your performance and scalability requirements, or thermal and space constraints.

Find your gateway solution. >

Learn more about cloud services. >

Software Development Kits (SDKs)

Developers can leverage a variety of SDKs to help enable their IoT solutions. Intel offers a number of SDKs, including:

Intel® Computer Vision SDK (Intel® CV SDK) Use this SDK to develop and deploy vision-oriented solutions for autonomous vehicles, digital surveillance cameras, robotics, and mixed-reality headsets.	Intel® Media SDK Use this SDK to develop media applications on Windows* and embedded Linux* for fast video playback, encoding, processing, and video conferencing.
Intel® SDK for OpenCL™ Applications This SDK allows you to offload compute-intensive parallel workloads to Intel® Graphics Technology.	Deep Learning Training Tool Beta Use this SDK to develop and train deep-learning solutions using your own hardware. More SDKs.

An IDE that Supports Consistency and Portability

Purpose-built for rapid code development in IoT environments, the Intel® System Studio is an Eclipse*-based IDE for developing in C++ or Java.

Intel® System Studio

The IDE comes with built-in capabilities to easily integrate sensors via the UPM and MRAA libraries, templates, code libraries, samples, platform awareness, and other tools that work across the Intel® IoT Platform. Designed for IoT, the IDE is fine-tuned for embedded application performance, allows for easy sensor integration and is great for prototyping or production development.

Sensors & Supporting Software that Just Works

Gain access to hundreds of sensors and sensor kits for both prototyping and industrial-grade applications, purpose-built for just about any conceivable application. Thanks to the sensor library framework and platform detection, you’ll also get code samples that remain portable throughout the development cycle and as you scale up the enterprise.

Find your sensor. Connect. Deploy. >

Educational Tools

Get support throughout the development lifecycle in the form of education, instructional materials, and community.

Online Training

This includes How-tos, live and on-demand tutorial videos, and documentation.

Online Communities

Learn, build and share ideas with developers around the world.

Events

Be among the first to learn about roadshows, webinars, conferences, hackathons, and workshops that offer hands-on practical experience.

Bring it all Together with Developer Kits from Intel

Get everything you need to prototype your way through business challenges with dev kits that work right out of the box. Choose the kit that matches your unique requirements, and in one package you'll have a platform for developing and deploying, complete with IDE, sensors, libraries, and cloud connectors.

Get started. >

Visit the Intel® Developer Zone for IoT to learn more.

↧

VR Content Developer Guide

August 2, 2017, 2:02 pm

Latest and popular articles on Intel Technologies

≫ Next: Explore the GPIO Example Application

≪ Previous: A Scalable Path to Commercial IoT Solutions

Get general guidelines for developing and designing your virtual reality (VR) application, and learn how to obtain optimal performance. This guide is based on performance characterizations across several VR workloads, and defines common bottlenecks and issues. Find solutions to address black and white-bound choice-of-texture formats, fusing shader passes, and how to use post-anti-aliasing techniques to improve the performance of VR application workloads.

Goals

Define general design point and budget recommendations for developers on creating VR content using 7th generation Intel® Core™ i7 processors and Intel® HD Graphics 615 (GT2).
Provide guidance and considerations for obtaining optimal graphics performance on 7th generation Intel® Core™ i7 processors.
Provide suggestions for optimal media, particularly 3D media.
Get tips on how to design VR apps for sustained power, especially for mobile devices.
Identify tools that help developers identify VR issues in computer graphics on VR-ready hardware.

Developer Recommended Design Points

General guidelines on design points and budgets to ISVs

Triangles/Frame - 200 K - 300 K visible triangles in a given frame.* Use aggressive culling of view frustum, back face, and occlusion to reduce the number of actual triangles sent to the GPU.
Draws/Frame - 500 - 1000*. Reduce number of draw calls to improve performance and power. Batch draws by shader and draw front-to-back with 3D workloads (refer to 3D guide).
Target Refresh - At least 60 frames per second (fps), 90 fps for best experience.
Resolution - Resolution of head-mounted display (HMD) can downscale if needed to hit 60 fps but cannot go below 80 percent of HMD resolution.* Dynamic scaling of render target resolution can also be considered to meet frame rate requirements.*
Memory - 180 MB ‒ 200 MB per frame (DDR3, 1600 MHz) for 90 fps.*

*Data is an initial recommendation and may change.

Considerations for Optimal Performance on General Hardware

Texture Formats and Filtering Modes

Texture formats and filtering modes can have a significant impact on bandwidth.
Generally 32-bit and 64-bit image formats are recommended for most filtering modes (bilinear and so on).
Filtering trilinear and volumetric surfaces with standard red green blue and high-dynamic range (sRGB/HDR) formats will be slower compared to 32-bit cases.

Uncompressed Texture Formats

Uncompressed formats—sRGB and HDR —consume greater bandwidth. Use linear formats if the app becomes heavily bandwidth-bound.

HDR Formats

The use of R10G10B10A2 over R16G16B16A16 and floating point formats is encouraged.

Filtering Modes

Filtering modes, like anisotropic filtering, can have a significant impact on performance, especially with uncompressed formats and HDR formats.

Anisotropic [CC6] filtering is a trade-off between performance and quality. Generally anisotropic level two is recommended based on our performance and quality studies. Mip-Mapping textures along with anisotropic levels add overhead to the filtering and hardware pipeline. If you chose anisotropic filtering, we recommend using bc1‒5 formats.

Anti-Aliasing

Temporally stable anti-aliasing is crucial for a good VR experience. Multisample anti-aliasing (MSAA) is bandwidth intense and consumes a significant portion of the rendering budget. Anti-aliasing algorithms that are temporally stable post-process, like TSCMAA can provide equivalent functionality at half the cost and should be considered as alternatives.

Low-Latency Preemption

Gen hardware supports object-level preemption, which usually translates into preemption on triangle boundaries. For effective scheduling of the compositor, it is important that the primitives can be preempted in a timely fashion. To enable this, draw calls that take more than 1 ms should usually have more than 64‒128 triangles. Typically, full-screen post-effects should use a grid of at least 64 triangles as opposed to 1 or 2.

App Scheduling

1. Recommendation:Nothing additional is required.

screenshot of frame rendering values

In the ideal case for a given frame, the app will have ample time to complete its rendering work between the vsync and before the link state request (LSR) packet is submitted. In this case, it is best that the app synchronize on the vsync itself, so that rendering is performed on the newest HMD positional data. This helps to minimize motion sickness.

2. Recommendation: Start work earlier by syncing on the compositor starting work rather than vsync.

screenshot of frame rendering values

When the frame rendering time no longer fits within this interval, all available GPU time should be reclaimed for rendering the frame before theLSR occurs. If this interval is not met, the compositor can block the app from rendering the next frame by withholding the next available render target in the swap chain. This results in entire frames being skipped until the present workload for that frame has finished, causing a degradation of fps for the app. The app should synchronize with its compositor so that new rendering work is submitted as soon as the present or LSR workload is submitted. This is typically accomplished via a wait behavior provided by the compositor API.

vector image

3. Recommendation: Present asynchronously.

In the worst case, when the frame rendering time exceeds the vsync, the app should submit rendering work as quickly as possible to fully saturate the GPU to allow the compositor to use the newest frame data available, whenever that might occur relative to the LSR. To accomplish this, do not wait on any vsync or compositor events to proceed with rendering, and if possible build your application so that the presentation and rendering threads are decoupled from the rest of the state update.

For example, on the Holographic API, pass DoNotWaitForFrameToFinish to PresentUsingCurrentPrediction, or in DirectX*, pass SyncInterval=0 to Present.

4. Recommendation: Present asynchronously.

Use GPU analysis tools, such as GPUView, to see which rendering performance profile you have encountered, and make the necessary adjustments detailed above.

Other design considerations

Half float versus float: For compute-bound workloads, half floats can be used to increase throughput where precision is not an issue. Mixing half and full resolution results in performance penalties and should be minimized.

Tools

The following tools will help you identify issues with VR workloads.

GPU View: Gives specifics on identifying issues with scheduling and dropped frames.

Intel® Graphics Performance Analyzers: Gives specifics on analyzing VR workloads and the expected patterns we see. For example, two sets of identical calls for the left and right eyes.

Additional Resources

Summary

The biggest challenge to performance for VR workloads comes from being bandwidth-bound. The texture format, fusing shader passes, and using post anti-aliasing techniques help reduce the pressure on bandwidth.

Contributors

The developer guidelines provided in this document are created with input from Katen Shah, Bill Sadler, Prasoon Surti, Mike Apodaca, Rama Harihara, Travis Schuler, and John Gierach.

↧

Explore the GPIO Example Application

August 7, 2017, 9:29 am

Latest and popular articles on Intel Technologies

≫ Next: Enhancing VR Immersion with the CPU in Star Trek™: Bridge Crew

≪ Previous: VR Content Developer Guide

Description

In this example application, you'll learn how to interact with the Terasic* DE10-Nano board's digital I/O:

8 green user LEDs
4 slide switches
User push button

The peripherals used to drive the LEDs, and read the switch settings are implemented as “soft” GPIO peripherals within the FPGA. This simple FPGA design illustrates how programmable logic can be used to extend the peripheral set available to a processor. In this case, we added more GPIOs, but we could have added more UARTs, SPI controllers, Ethernet ports, or some combination of each.

Software running on the CPU interacts with these peripherals using the Linux* general-purpose input/output (GPIO) framework. This article walks you through the process of reading from, and writing to those peripherals using the Linux GPIO framework.

Level: beginner.

Materials Needed

Terasic DE10-Nano kit
The Terasic DE10-Nano development board, based on an Intel® Cyclone® V SoC FPGA, provides a reconfigurable hardware design platform for makers, IoT developers and educators. You can buy the kit here.
Virtual Network Computing (VNC) client software
A VNC client application running on your host PC is used to remotely control the DE10-Nano board (which is running a VNC server application). If you don’t already have one, there is a link to a free download on the Software Utilities section of the Downloads page.

Visit GitHub* for this project's code samples.

If you've already visited the ‘Play’ page of the web site served by the board, then you've probably interacted with the 8 user LEDs. The “Blink the LEDs” example on that page provides a simple web interface to turn ON, turn OFF, or blink the LEDs. You may be curious to learn what's happening behind the scenes of the demo application and we'll explore that in this example application tutorial.

Setup Steps

Follow the steps below to prepare your board to build and run the sample applications.

Open VNC Viewer
- Start a session with VNC* Viewer and type the DE10-Nano board's IP address: 192.168.7.1
Note: If you attempt to connect and a black screen appears, power cycle the board.
Navigate to the examples folder.
a. Double-click the File System icon on the desktop.
b. Locate the examples folder and double-click to open.
c. Open the GPIO example folder, which contains a sandbox folder (code you can play with), and a tar ball version of the sandbox folder in case you need to restore the original.
Note: There are three folders containing example design software for the DE10-Nano board; one for the GPIO, one for the FFT, and one for the accelerometer (adxl).
Sandbox subfolders.
- Open the sandbox folder which contains three subfolders:
README text file (optional).
Each of these folders contains example applications, scripts, and a README text file which describes how the examples make use of the Linux GPIO framework to interact with the board hardware.
To view the README_first.txt file:
a. Right-click on README_first.txt.
b. Select Open With.
c. Choose one of the two editors - Vim or gedit.
The READMEfirst.txt file describes the contents of each of the subfolders, and describes the examples contained in each. Here’s what you will see if you open the file with gedit:
Close the file.
Open a terminal emulator.
To interact with the board hardware, we are going to use a terminal emulator. Open terminal emulator on the Linux desktop as follows:
a. Click the Applications button at the top of the desktop.
b. Select Terminal Emulator.

gpio-leds

Let’s start by playing with the 8 user LEDs on the board.

Navigate to the gpio-leds folder by typing the following command in the terminal window:

cd /examples/gpio/sandbox/gpio-leds/

This directory contains some example code to toggle the LEDs on the DE10-Nano board. Two versions are provided, a shell script (toggle_fpga_leds.sh) and C program (toggle_fpga_leds.c). Both perform the same function.

Build the 'toggle_fpga_leds' Application

To build the 'toggle_fpga_leds' application type the following command in the terminal window:

./build_toggle_fpga_leds.sh

The script compiles the 'toggle_fpga_leds.c' source file and produces the executable 'toggle_fpga_leds' application. If you'd like to learn more about how the script and C program work, open them using the text editor.

Run the Application

Once you've built the application, you can run either the script or the compiled application by typing the following commands into the terminal window:

To run the script, type: ./toggle_fpga_leds.sh
To run the compiled application, type: ./toggle_fpga_leds

Pay attention to the LEDs on the board! Watch them turn on and off (in sequential order). Both the C program application and script will exit automatically after turning each LEDs on then off.

If you'd like to learn more about how the script and c program work, refer to shell script source file and the C program source file. To learn more about how the Linux gpio-led framework works, open the README_gpio-leds.txt file using the text editor.

More Fun: Control the LEDs

You can also control the LEDs from the terminal window by writing to the ‘brightness’ files generated by the gpio_led framework.

Turn LED (0) ‘on’ by typing the following command into the terminal window:
echo 1 > /sys/class/leds/fpga_led0/brightness
Turn LED (0) ‘off’ by typing the following command into the terminal window:
echo 0 > /sys/class/leds/fpga_led0/brightness
Try typing a series of commands to turn on every other LED.
Query the status of LED (0) typing the following command into the terminal window:
cat /sys/class/leds/fpga_led0/brightness

gpio-keys

Next we’ll play with the gpio peripherals connected to the slide switches.

With the terminal emulator navigate to the gpio-keys folder by typing the following command into the terminal window.

cd /examples/gpio/sandbox/gpio-keys/

This directory contains an example application that reads the slide switches and reports their status (0 or 1). Just like the LEDs example, there is both a script ('watch_switch_events.sh') and C source file ('watch_switch_events.c') which accomplish the same task.

Build the 'watch_switch_events' Application

To build the 'watch_switch_events' application type the following command in the terminal window:

./build_watch_switch_events.sh

The script compiles the 'watch_switch_events.c' source file and produces the executable 'watch_switch_events' application.

Run the Application

Once you've built the application, you can run either the script or the compiled application by typing the following commands into the terminal window:

To run the script, type: ./watch_switch_events.sh
To run the compiled application, type: ./watch_switch_events

While the program is running, slide the switches SW0, SW1, SW2 and SW3 on the DE10-Nano board and notice the output in the terminal window.

The program (and script) monitor the gpio-keys device registered in the system, and print out the input events that they receive from the system. To terminate the script or programs just type CTRL-C on the console that you launched them from.

More Fun: Reading the Switch States

The ioctl version of the program will additionally print out the current state of all the switches at each input event. Build the 'watch_switch_events_ioctl.c' application by typing the following command:

./build_watch_switch_events_ioctl.sh

Run the application by typing:

./watch_switch_events_ioctl

Notice the output as you slide the switches back and forth. To terminate the script by typing CTRL-C. Refer to the C program source file for more details on how it works.

raw-gpio

Finally, we’ll play with the gpio peripherals connected to user push button 0.

With the terminal emulator navigate to the raw-gpio folder by typing the following command:

cd /examples/gpio/sandbox/raw-gpio/

This directory contains an example application that reads the state of push button 0 and reports their status (0 or 1). Just like the previous examples, there is both a script ('show_KEY0_pb_state.sh') and C source file ('show_KEY0_pb_state.c') which accomplish the same task.

Build the 'show_KEY0_pb_state' Application

To build the 'watch_switch_events' application type the following command in the terminal window:

./build_show_KEY0_pb_state.sh

The script compiles the 'show_KEY0_pb_state.c' source file and produces the executable 'show_KEY0_pb_state' application.

Run the Application

Once you've built the application, you can run either the script or the compiled application by typing the following commands into the terminal window:

To run the script, type: ./show_KEY0_pb_state.sh
To run the compiled application, type: ./show_KEY0_pb_state

Run the program several times, with the push button pressed and released, and notice the output in the terminal window.

The program and script read a file associated with the push button 0 and report the state.

More Fun: Detect the Push Buttons

The C program called 'poll_KEY0_pb_state.c' adds a poll() call to the 'show_KEY0_pb_state.sh' program to demonstrate the interrupt functionality provided by the gpio framework to detect the push button press via a hardware interrupt.

Build the 'poll_KEY0_pb_state.c' application by typing the following command:

./poll_KEY0_pb_state.sh

Run the application by typing:

./poll_KEY0_pb_state

Notice that the interrupt enabled version of the program waits until the button is pressed before reporting the action and terminating.

Refer to the C program source file for more details on how it works. To learn more about how the Linux gpio-led framework works, open the README_gpio.txt file using the text editor.

Code Samples Learn More

↧

Introduction

Building Prolog for Intel® Architecture

Probing Experiment

Conclusion

References

Introduction

AI Is Transforming Industries

Intel’s AI Ecosystem and Portfolio

The Intel® FPGA Effect

System Acceleration

Power Efficiency

Future Proofing

Increased Productivity and Shortened Design Cycles

Conclusion

Understanding the Challenge of Latency Optimization

Making the Right Hardware Choices

Processor choice

Networking

DPDK

Storage

Tuning the Hardware for Low Latency

BIOS settings

Networking

Kernel bypass

Kernel tuning

Conclusion

Using Intel® VTuneTM Amplifier

Using Intel® Processor Trace Technology

Typical Low Latency Application Issues

Tuning the Application

Tackling Outliers in Virtual Machines

Exercise

Conclusion

Overview

System Requirements

Using the Software

Who is the Target Audience of This Article?

Why Did I Create the My Pirate Cove VR Scene?

What Was I Trying to Create?

What Tools Did I Use?

What Did I Know Going Into This Project?

What Was the End Goal of This Project?

What Misconceptions Did I Have?

Performance

How I Got Vive* Working in My Scene

Getting Tracked Objects

Creating a Teleport System

Create the Laser

Create the Reticle

Setting Up the Layers

Updating Each Frame

In Closing

About the Author

Step 1: Install the driver

Experimental installation without kernel patch or rebuild:

Step 2: Install the SDK

Step 3: Set up Eclipse

Background on AI and the Move to the Edge

References

Intel® SDK for OpenCL™ Applications

Trademark & Brand

Readability

Specifications

Layout

Assets

Fast Track Your Development Cycle from Prototype to Production

A Smarter World—for Every Enterprise

Move Beyond Proof of Concept and Start Gaining Insight Right Away

Software Development Kits (SDKs)

An IDE that Supports Consistency and Portability

Sensors & Supporting Software that Just Works

Educational Tools

Bring it all Together with Developer Kits from Intel

Goals

Developer Recommended Design Points

Considerations for Optimal Performance on General Hardware

Texture Formats and Filtering Modes

Uncompressed Texture Formats

HDR Formats

Filtering Modes

Using Intel® VTune^TM Amplifier