Tuning SIMD vectorization when targeting Intel® Xeon® Processor Scalable Family

September 11, 2017, 4:17 pm

Latest and popular articles on Intel Technologies

≫ Next: Using Intel® Advisor and VTune™ Amplifier with MPI

≪ Previous: Intel® Data Analytics Acceleration Library Release Notes and New Features

Introduction

The Intel® Xeon® Processor Scalable Family is based on the server microarchitecture codenamed Skylake.

For best possible performance on the Intel Xeon Processor Scalable Family, applications should be compiled with processor-specific option [Q]xCORE-AVX512 using the Intel® C++ and Fortran compilers. Note that applications built with this option will not run on non-Intel or older instruction-sets based processors.

Alternatively, applications may also be compiled for multiple instruction-sets targeting multiple processors; for example, [Q]axCORE-AVX512,CORE-AVX2 might generate a fat binary with code-paths optimized for both CORE-AVX512 (codenamed Skylake server) and CORE-AVX2 (codenamed Haswell or Broadwell) target processors along with the default Intel® SSE2 code-path. To generate a common binary for the Intel Xeon Processor Scalable Family and the Intel® Xeon PhiTM x200 processor family, applications should be compiled with option [Q]xCOMMON-AVX512.

What has changed?

It is important to note that choosing the widest possible vector width, 512-bit on the Intel Xeon Processor Scalable Family, may not always result in the best vectorized code for all loops, especially for loops with low trip-counts commonly seen in non-HPC applications.

Based on careful study of applications from several domains, it was decided to introduce flexibility in SIMD vectorization for the Intel Xeon Processor Scalable Family, defaulting to 512-bit ZMM usage as low that can be tuned for higher usage, if beneficial. Developers may use the Intel compilers' optimization-reports or the Intel® Advisor to understand the SIMD vectorization quality and look for more opportunities.

Starting with the 18.0 and 17.0.5 Intel compilers, a new compiler option [Q/q]opt-zmm-usage=low|high is added to enable a smooth transition from the Intel® Advanced Vector Extensions 2 (Intel® AVX2) with 256-bit wide vector registers to the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) with 512-bit wide vector registers. This new option should be used in conjunction with the [Qa]xCORE-AVX512 option.

By default with [Qa]xCORE-AVX512, the Intel compilers will opt for more restrained ZMM register usage which works best for some types of applications. Other types of applications, such as those involving intensive floating-point calculations, may benefit from using the new option [Q/q]opt-zmm-usage=high for more aggressive 512-bit SIMD vectorization using ZMM registers.

What to do to achieve higher ZMM register usage for more 512-bit SIMD vectorization?

There are three potential paths to achieve this objective. Here is a trivial example code for demonstration purposes only:

$ cat Loop.cpp
#include
void work(double *a, double *b, int size)
{
    #pragma omp simd
    for (int i=0; i < size; i++)
    {
        b[i]=exp(a[i]);
    }
}

1. The best option, starting with the 18.0 and 17.0.5 compilers, is to use the new compiler option [Q/q]opt-zmm-usage=high in conjunction with [Qa]xCORE-AVX512 for higher usage of ZMM registers for potentially full 512-bit SIMD vectorization. Using this new option requires no source-code changes, and hence is much easier to use in achieving more aggressive ZMM usage for the entire compilation unit.

Compiling with default options, compiler will emit a remark suggesting to use new option:

    $ icpc -c -xCORE-AVX512 -qopenmp -qopt-report:5 Loop.cpp…
    remark #15305: vectorization support: vector length 4
    …
    remark #15321: Compiler has chosen to target XMM/YMM vector. Try using -qopt-zmm-usage=high to override
    …
    remark #15476: scalar cost: 107
    remark #15477: vector cost: 19.500
    remark #15478: estimated potential speedup: 5.260
    …

Compiling with the new recommended option, above remark goes away and speedup increases for this example thanks to better SIMD gains with higher ZMM usage:

    $ icpc -c -xCORE-AVX512 -qopt-zmm-usage=high -qopenmp -qopt-report:5 Loop.cpp…
    remark #15305: vectorization support: vector length 8
    …
    remark #15476: scalar cost: 107
    remark #15477: vector cost: 9.870
    remark #15478: estimated potential speedup: 10.110
    …

2.As an alternative to using this new compiler option, applications may choose to use the simdlen clause with the OpenMP simd construct to specify higher vector-length to achieve 512-bit based SIMD vectorization. Note that this type of change is localized to the loop in question, and needs to be applied for other such loops as needed, following typical hotspot tuning practices. So, using this path requires modest source-code changes.

Using the simdlen clause we get better SIMD gains for this example:

    #pragma omp simd simdlen(8)
    for (int i=0; i < size; i++) …

    $ icpc -c -xCORE-AVX512 -qopenmp -qopt-report:5 Loop.cpp
    …
    remark #15305: vectorization support: vector length 8
    …
    remark #15476: scalar cost: 107
    remark #15477: vector cost: 9.870
    remark #15478: estimated potential speedup: 10.110
    …

3. Applications built with the [Qa]xCOMMON-AVX512 option already get higher ZMM register usage and, therefore, don't need to take any additional action using either of above two paths. Note, however, that while such applications have the advantage of being able to run on a common set of processors supporting Intel AVX-512, such as the Intel Xeon Processor Scalable Family and the Intel® Xeon Phi^TM x200 processor family, they may potentially miss out on the smaller subset of processor specific Intel AVX-512 instructions not generated with [Qa]xCOMMON-AVX512. Note also that some types of applications may perform better with the default [Q/q]opt-zmm-usage=low option.

Conclusion

Developers building compute intensive applications for the Intel Xeon Processor Scalable Family may choose to benefit from higher ZMM register usage for more aggressive 512-bit SIMD vectorization using the options discussed above.

↧

Using Intel® Advisor and VTune™ Amplifier with MPI

September 8, 2017, 1:24 pm

Latest and popular articles on Intel Technologies

≫ Next: Contiguity Checking for Pointer Assignments in the Intel® Fortran Compiler

≪ Previous: Tuning SIMD vectorization when targeting Intel® Xeon® Processor Scalable Family

Introduction

This article describes how to use Intel® Advisor and VTune™ Amplifier in a Linux* distributed environment. While specifically designed to collect performance data at the node and core level, both tools can be used with MPI. The document covers basic utilization with an MPI launcher from the command line, as well as more advanced customizations.

Running a profiler at scale is often interesting for problems that may not be reduced to execution on a single node. While in some cases reducing the workload size is an effective way to save time and effort in the profiling stages, it is often difficult to determine what minimum workload size may be used per node without modifying the performance characteristics of full production cases.

It is important to remember that a separate tool exists to record the details of the communication patterns and communication costs of an MPI application, the Intel® Trace Analyzer and Collector. The information provided by VTune™ Amplifier and Intel® Advisor is focused on the core and node performance, and complements the specific MPI communication details provided by the Intel® Trace Analyzer and Collector.

Unless mentioned explicitly in the text, the MPI task counts and node numbers used in the examples are not special in any way, and do not represent limitations of the tools. Note that this document is not intended to serve as a complete description on how to use VTune™ Amplifier or Intel® Advisor, and will assume that the environment has been setup in advance by sourcing the relevant files.

VTune™ Amplifier with MPI codes

The latest version of VTune™ Amplifier contains improved reporting on MPI related metrics, and it is extremely simple to use from the command line. In general, the simplest way to execute it in a distributed environment is:

<mpilauncher> [options] amplxe-cl [options] -r [results_dir] -- <application> [arguments]

Note: In a distributed environment the results directory name is a requirement and may not be left blank. VTune™ Amplifier will exit immediately with an appropriate message if an output directory name is not provided.

Let’s see an example using the Intel® MPI Library launcher, mpirun, and the pre-defined collection hpc-performance. Assuming that the application of interest is run using 64 MPI ranks and 16 ranks per node, the command line would look like this:

mpirun –np 64 –ppn 16 amplxe-cl –collect hpc-performance –r ./hpc_out -- ./app

This will generate four directories, with names starting with hpc_out and followed by a dot and the host name for each of the nodes. If the host names are node1, node2, node3, and node4 the output directories would be: hpc_out.node1, hpc_out.node2, hpc_out.node3, hpc_out.node4. Within each directory there will be sixteen data directories, one per rank executed in the node.

vtune-mpi-summary-detail When run in this manner VTune™ Amplifier presents data regarding MPI imbalance and some details regarding the MPI rank in the critical path of execution in the Summary. This is shown in the CPU Utilization section, as shown in the figure to the left.

vtune mpi bottop up details It also allows for grouping functions under each process so that they can be compared directly in the Bottom-up tab. The figure on the right shows details for the first two ranks of an application named lbs. This ability to compare MPI process performance to one another can provide insights into load balancing issues and other performance characteristics critical for scalability.

It is important to note that, if not using the Intel® MPI Library, the command line must add the option -trace-mpi for VTune™ Amplifier to be able to collect MPI rank information and display it as described above. For details on the many capabilities and command line options available, please visit the VTune™ Amplifier Documentation Site.

mpirun –np 64 –ppn 16 amplxe-cl –collect hpc-performance –r ./hpc_out -- ./app

Intel® Advisor with MPI codes

Intel® Advisor can be run from the command line in a similar manner to VTune™ Amplifier:

<mpilauncher> [options] advixe-cl [options] -- <application> [arguments]

An output directory name is not a requirement, since a directory will be created for each MPI rank used in the execution. Let’s consider a survey study using 64 MPI ranks and 16 ranks per node:

mpirun –np 64 –ppn 16 advixe-cl –collect survey -- ./app

This generates 64 output directories named rank.0, rank.1, ... rank.63. While Intel® Advisor does not automatically aggregate multiple rank data, this collection allows for manually studying the differences across ranks. Data for each rank can be loaded separately into the GUI for further analysis.

The example shown above corresponds to just one of the many collection types available in Intel® Advisor. For additional information on the extensive capabilities and command line options available please consult the Intel® Advisor Documentation Site. For a more detailed description of the remote collection process you can read the in-depth article Analyzing Intel® MPI applications using Intel® Advisor.

Selective MPI rank profiling

You can use selective profiling to reduce the size of the results collected by Intel® Advisor and VTune™ Amplifier. In many cases valuable information can be obtained by profiling a subset of all the MPI ranks involved. Typical profiling runs would include profiling all the MPI ranks inside a node, or a single MPI rank on each node.

When using Intel® MPI Library version 5.0.2 or newer the gtool option may be used to restrict the collection to a subset of ranks:

mpirun –np 64 –ppn 16 –gtool “amplxe-cl –collect hpc-performance –r hpc_out :0-15” ./app

In this example VTune™ Amplifier will generate a single directory, with name “hpc_out.node1. Inside of this directory there will be sixteen data directories with the profiling information for MPI ranks 0 through 15.

It is also possible, but more cumbersome, to do this with a configuration file or through command line arguments. When using the Intel® MPI Libary the command above can be executed as:

mpirun –host node1 –n 16 amplxe-cl –collect hpc-performance –r hpc_out :0-15 ./app : –host node2 –n 16 ./app : –host node3 –n 16 ./app : –host node4 –n 16 ./app

Or simply creating a configuration file with the following content:

$ cat ./config.txt
-host node1 –n 16 amplxe-cl –collect hpc-performance –r hpc_out :0-15 ./app
-host node2 –n 16 ./app
-host node3 –n 16 ./app
-host node4 –n 16 ./app

And then using it with the mpi launcher:

mpirun –configfile ./config.txt

If not using the Intel® MPI Library, please check your MPI provider documentation for information on how to execute heterogeneous workloads.

Simultaneous Intel® Advisor and VTune™ Amplifier collections

The ability to perform selective MPI rank profiling means that you can easily use VTune™ Amplifier to profile activity in one node, while collecting core performance data with Intel® Advisor for a rank in a different node. The following example collects performance data for the first node using VTune™ Amplifier, and core performance data for the first rank on the second node using Intel® Advisor:

mpirun -np 64 –ppn 16 -gtool "amplxe-cl -collect hpc-performance -r hpc_perf -- ./app :0-15" -gtool "advixe-cl -collect survey -- ./app  :16" ./app

This run will generate a single hpc_perf.node1 directory with 16 data directories inside - one per MPI rank. An additional rank.16 directory will be generated at the top level of the working directory with the collected Intel® Advisor data.

Summary

It is simple to collect performance data with Intel® Advisor and VTune™ Amplifier for MPI and hybrid MPI+Threads codes, with the flexibility of profiling all ranks or a just subset of them. The command lines for both tools follow the standards described in their user guides and require no special configuration other than the minor details presented in this article.

VTune™ is a trademark of Intel Corporation or its subsidiaries in the U.S. and/or other countries. For details see Legal Information.

↧

Contiguity Checking for Pointer Assignments in the Intel® Fortran Compiler

September 11, 2017, 4:12 pm

Latest and popular articles on Intel Technologies

≫ Next: Migration from Intel® Cilk™ Plus to OpenMP* or Intel® Threading Building Blocks

≪ Previous: Using Intel® Advisor and VTune™ Amplifier with MPI

The Fortran 2008 Standard introduced the CONTIGUOUS attribute for assumed shape arrays and pointers. The CONTIGUOUS attribute may help the compiler optimize more effectively, see https://software.intel.com/en-us/articles/vectorization-and-array-contiguity. However, specifying the CONTIGUOUS attribute for an assumed shape array or pointer that may not be contiguous may lead to incorrect behavior at run-time. The Intel® Fortran Compiler version 18 introduces a new feature that checks at run-time whether the targets of contiguous pointer assignments are indeed contiguous.

Consider the following example:

module mymod         !  source file inc.F90
contains
  subroutine inc(b)
    real, dimension(:),             pointer :: b
#ifdef CONTIG
    real, dimension(:), contiguous, pointer :: c
#else
    real, dimension(:),             pointer :: c
#endif

    c => b
    c =  1.
  end subroutine inc
end module mymod

program drive_inc     ! source file drive_inc.F90
  use mymod
  implicit none
  integer, parameter          :: n=25, stride=5
  real, target,  dimension(n) :: a=0.
  real, pointer, dimension(:) :: b

  b => a(:n:stride)
  call inc(b)

  print '(5F5.0)', a
end program drive_inc

$ ifort inc.F90 drive_inc.F90; ./a.out
   1.   0.   0.   0.   0.
   1.   0.   0.   0.   0.
   1.   0.   0.   0.   0.
   1.   0.   0. 0.   0.
   1.   0.   0.   0.   0.

Without the contiguity attribute, results are correct.

$ ifort -DCONTIG inc.F90 drive_inc.F90; ./a.out
   1.   1.   1.   1.   1.
   0.   0.   0.   0.   0.
   0.   0.   0.   0.   0.
   0.   0.   0.   0.   0.
   0.   0.   0.   0.   0.

Here, pointer c is declared as contiguous, but the actual target is not, leading to incorrect results.

$ ifort -DCONTIG -check contiguous -traceback inc.F90 drive_inc.F90; ./a.out
forrtl: severe (408): fort: (32): A pointer with the CONTIGUOUS attributes is being made to a non-contiguous target.

Image PC                Routine            Line        Source
a.out 0000000000405220 Unknown Unknown Unknown
a.out 0000000000402AEF mymod_mp_inc_ 11 inc.F90
a.out 0000000000402C23 MAIN__                      9 drive_inc.F90
a.out 0000000000402A9E Unknown Unknown Unknown
libc-2.17.so       00007F1A0A6B0B35 __libc_start_main Unknown Unknown
a.out 00000000004029A9 Unknown Unknown Unknown

With the new option -check contiguous (Linux* or OS X*) or /check:contiguous (Windows*), the version 18 compiler detects the assignment of a contiguous pointer to a non-contiguous target. The option -traceback (/traceback) identifies the function and source file line number at which the incorrect assignment took place. It is not necessary to compile with the debugging option -g (Linux or OS X) or /Zi (Windows) in order to get this traceback.

↧

Migration from Intel® Cilk™ Plus to OpenMP* or Intel® Threading Building Blocks

September 12, 2017, 9:55 am

Latest and popular articles on Intel Technologies

≫ Next: Face It – The Artificially Intelligent Hairstylist

≪ Previous: Contiguity Checking for Pointer Assignments in the Intel® Fortran Compiler

Introduction

Starting with Intel® C++ Compiler 18.0 Gold Release, Intel® Cilk™ Plus constructs will be marked as deprecated and eventually removed in a future release. Intel® Cilk™ Plus when introduced in 2010 was a one stop programming model for enabling both multi-threading and vectorization in programs. Intel® Cilk™ Plus specification is made open at https://www.cilkplus.org/specs. With OpenMP* 4.0 SIMD extensions, we can enable both multi-threading and explicit vectorization using OpenMP pragmas. For all C++ programmers who prefer using Intel® TBB as the threading model can still continue to use TBB for threading and use OpenMP 4.0 SIMD pragmas for enabling vectorization. Below is a table which quickly summaries the options for moving from Intel® Cilk™ Plus to either OpenMP or Intel® TBB.

Migration Cheat Sheet	Intel® Cilk™ Plus	OpenMP	Intel® Threading Building Blocks (Intel® TBB)
Task Parallelism	cilk_spawn	#pragma omp task	task_group t; t.run([](){ })
Task Parallelism	cilk_sync	#pragma omp taskwait	t.wait()
Data Parallelism (Threading)	cilk_for	#pragma omp parallel for	tbb::parallel_for()
Data Parallelism (Explicit vectorization for loops)	#pragma simd	#pragma omp simd	TBB doesn’t support vectorization. Use TBB for threading and OpenMP pragma for vectorization. #pragma omp simd
Data Parallelism (Vector functions)	__declspec(vector()) or __attribute__((vector()))	#pragma omp declare simd	#pragma omp declare simd
Control the number of threads	__cilkrts_set_param("nworkers", nthreads); and Environment variable CILK_NWORKERS=nthreads	omp_set_num_threads(nthreads); and Environment variable OMP_NUM_THREADS=nthreads	task_schedule_init(nthreads); and global_control c(global_control::max_allowed_parallelism. nthreads);

Each scenario listed above in the table is elaborated in sequence with simple code snippets.

Case 1: Task Parallelism

Below is a single threaded implementation for sum of Fibonacci series:

int fibonacci(int num) {
	int a, b;
	if (num == 0)
		return 0;
	if (num == 1)
		return 1;
	a = fibonacci(num - 1);
	b = fibonacci(num - 2);
	return(a+b);

}

In the above kernel, fibonacci(num-1) and fibonacci(num-2) can be computed independently. The procedure used to implement task level parallelism in Intel® Cilk™ Plus is demonstrated below:

int fibonacci(int num) {
	int a, b;
	if (num == 0)
		return 0;
	if (num == 1)
		return 1;
	a = cilk_spawn fibonacci(num - 1);
	b = fibonacci(num - 2);
	cilk_sync;
	return(a+b);

}

cilk_spawn is used to spawn a new worker thread into action for executing the corresponding task. cilk_sync is a barrier where the threads sync up before the main thread proceeds.

Equivalent OpenMP code:

int fibonacci(int num) {
	int a, b;
	if (num == 0)
		return 0;
	if (num == 1)
		return 1;
#pragma omp task shared(a)
	a = fibonacci(num - 1);
#pragma omp task shared(b)
	b = fibonacci(num - 2);
#pragma omp taskwait
	return(a+b);

}

OpenMP standard provides #pragma omp task (equivalent to cilk_spawn) which is used here to annotate the section of the code which needs to be executed in a separate thread. The shared clause specifies that the variable “a” and “b” needs to be shared between threads. The thread barrier is implemented using #pragma omp taskwait (equivalent to cilk_sync).

Equivalent Intel® TBB code:

#include "tbb\task_group.h"
int fibonacci(int num) {
	int a, b;
	task_group p;
	if (num == 0)
		return 0;
	if (num == 1)
		return 1;
	p.run([&] {a = fibonacci(num - 1); });
	p.run([&] {b = fibonacci(num - 2); });
	p.wait();
	return(a+b);

}

Intel® TBB allows the developer to create a task group by creating an instance of task_group class and use run() member function to create new independent task. wait() member function of task_group is used for implementation of barrier.

Case 2: Data Parallelism (Threading)

Consider the below loop which does vector addition in serial fashion:

void vector_add(float *a, float *b, float *c, int N) {
	for (int i = 0; i < N; i++)
		c[i] = a[i] + b[i];
}

The above operation is highly data parallel and this can be one using multiple threads. Intel® Cilk™ Plus offers cilk_for keyword for the same as shown below:

void vector_add(float *a, float *b, float *c, int N) {
	cilk_for (int i = 0; i < N; i++)
		c[i] = a[i] + b[i];
}

Annotating the for loop with cilk_for keyword will let the compiler know to generate parallel code for the loop body.

Equivalent OpenMP code:

void vector_add(float *a, float *b, float *c, int N) {
#pragma omp parallel for
	for (int i = 0; i < N; i++)
		c[i] = a[i] + b[i];
}

#pragma omp parallel for (equivalent of cilk_for) is OpenMP equivalent for expressing parallelism in the loop.

Equivalent Intel® TBB code:

#include "tbb/parallel_for.h"
void vector_add(float *a, float *b, float *c, int N) {
	tbb::parallel_for(0, N, 1, [&](int i) { c[i] = a[i] + b[i]; });
}

tbb::parallel_for is a template function which performs parallel iteration over the above range.

Case 3: Data Parallelism (Explicit Vectorization)

Consider the below loop which does Gaussian Filter in serial fashion:

void GaussianFilter(Mat &output, Mat &input)
{
	float filter[3][3] = {{0.04491, 0.12210, 0.04491}, {0.12210, 0.33191, 0.12210}, {0.04491, 0.12210, 0.04491}};
	int rows = output.rows;
	int cols = output.cols;
	float value;
	for(int i = 1; i < rows; i++)
	{
		int index = i*cols+1;
		for(int j = index; j < index+cols; j++)
		{
			value = 0.0f;
			for(int k1 = -1; k1 <= 1; k1++)
			{
				int index1 = j+(k1*cols);
				for(int k2 = -1; k2 <= 1; k2++)
					value += filter[k1+1][k2+1]*input.data[j+k2];
			}
			output.data[j] = value;
		}
	}
	return;
}

The above operation is highly data parallel. By default compiler always targets innermost loop for vectorization but here the innermost loop iteration trip count is just 3. So it makes sense to do the outer loop vectorization (loop with loop index j). Intel® Cilk™ Plus provides #pragma simd which helps annotate the loop as shown below:

void GaussianFilter(Mat &output, Mat &input)
{
	float filter[3][3] = {{0.04491, 0.12210, 0.04491}, {0.12210, 0.33191, 0.12210}, {0.04491, 0.12210, 0.04491}};
	int rows = output.rows;
	int cols = output.cols;
	float value;
	for(int i = 1; i < rows; i++)
	{
		int index = i*cols+1;
		#pragma simd private(value)
		for(int j = index; j < index+cols; j++)
		{
			value = 0.0f;
			for(int k1 = -1; k1 <= 1; k1++)
			{
				int index1 = j+(k1*cols);
				for(int k2 = -1; k2 <= 1; k2++)
					value += filter[k1+1][k2+1]*input.data[j+k2];
			}
			output.data[j] = value;
		}
	}
	return;
}

Equivalent OpenMP code:

void GaussianFilter(Mat &output, Mat &input)
{
	float filter[3][3] = {{0.04491, 0.12210, 0.04491}, {0.12210, 0.33191, 0.12210}, {0.04491, 0.12210, 0.04491}};
	int rows = output.rows;
	int cols = output.cols;
	float value;
	#pragma omp parallel for private(value)
	for(int i = 1; i < rows; i++)
	{
		int index = i*cols+1;
		#pragma omp simd private(value)
		for(int j = index; j < index+cols; j++)
		{
			value = 0.0f;
			for(int k1 = -1; k1 <= 1; k1++)
			{
				int index1 = j+(k1*cols);
				for(int k2 = -1; k2 <= 1; k2++)
					value += filter[k1+1][k2+1]*input.data[j+k2];
			}
			output.data[j] = value;
		}
	}
	return;
}

OpenMP 4.0 SIMD extensions offers #pragma omp simd which is equivalent to #pragma simd offered by Intel® Cilk™ Plus and helps in annotating loops which are good candidates for vectorization. In the above case, it also enables to override the default behavior of the compiler by enabling outer loop vectorization.

Even if the developer chooses to use Intel® TBB threading for the outer loop, the loop with loop index j can be vectorized using OpenMP pragma as shown below:

void GaussianFilter(Mat &output, Mat &input)
{
	float filter[3][3] = {{0.04491, 0.12210, 0.04491}, {0.12210, 0.33191, 0.12210}, {0.04491, 0.12210, 0.04491}};
	int rows = output.rows;
	int cols = output.cols;
	float value;
	tbb::parallel_for(1, row, 1, [&](int i)
	{
		int index = i*cols+1;
		#pragma omp simd private(value)
		for(int j = index; j < index+cols; j++)
		{
			value = 0.0f;
			for(int k1 = -1; k1 <= 1; k1++)
			{
				int index1 = j+(k1*cols);
				for(int k2 = -1; k2 <= 1; k2++)
					value += filter[k1+1][k2+1]*input.data[j+k2];
			}
			output.data[j] = value;
		}
	});
	return;
}

Case 4: Data Parallelism (Vector Functions):

Consider the below code:

#include <iostream>
#include <stdlib.h>
__declspec(noinline)
void vector_add(float *a, float *b, float *c, int i){
        c[i] = a[i] + b[i];
}
int main(int argc, char *argv[])
{
        float a[100], b[100], c[100];
        for (int i = 0; i < 100; i++)
        {
                a[i] = i;
                b[i] = 100 - i;
                c[i] = 0;
        }
        for(int i = 0; i < 100; i++)
                vector_add(a,b,c,i);
        std::cout << c[0] << "\n";
    return 0;
}

Compiling the above loop produces the following vectorization report:

LOOP BEGIN at test.cc(16,2)
   remark #15543: loop was not vectorized: loop with function call not considered an optimization candidate.
LOOP END

Function calls traditionally take scalar arguments and return scalar values. By enabling SIMD mode for functions, they can accept vector arguments and return vector values. This will enable vectorization of loop body which invokes these functions. Intel® Cilk™ Plus offers __declspec(vector(<clauses>)) on Windows and __attribute__((vector(<clauses>)) on Linux which when annotated for a function, the compiler generates the vector variant of the function body as shown below:

#include <iostream>
#include <stdlib.h>
__declspec(noinline, vector(uniform(a,b,c), linear(i:1)))
void vector_add(float *a, float *b, float *c, int i){
        c[i] = a[i] + b[i];
}
int main(int argc, char *argv[])
{
        float a[100], b[100], c[100];
        for (int i = 0; i < 100; i++)
        {
                a[i] = i;
                b[i] = 100 - i;
                c[i] = 0;
        }
        for(int i = 0; i < 100; i++)
                vector_add(a,b,c,i);
        std::cout << c[0] << "\n";
    return 0;
}

Equivalent OpenMP Code:

#include <iostream>
#include <stdlib.h>
#pragma omp declare simd uniform(a,b,c) linear(i)
__declspec(noinline) void vector_add(float *a, float *b, float *c, int i){
        c[i] = a[i] + b[i];
}
int main(int argc, char *argv[])
{
        float a[100], b[100], c[100];
        for (int i = 0; i < 100; i++)
        {
                a[i] = i;
                b[i] = 100 - i;
                c[i] = 0;
        }
        for(int i = 0; i < 100; i++)
                vector_add(a,b,c,i);
        std::cout << c[0] << "\n";
    return 0;
}

OpenMP4.0 SIMD constructs supports #pragma omp declare simd which can be used to annotate a function to generate vector code.

↧

Face It – The Artificially Intelligent Hairstylist

September 13, 2017, 9:39 am

Latest and popular articles on Intel Technologies

≫ Next: VTune™ Amplifier 2018 License Upgrade

≪ Previous: Migration from Intel® Cilk™ Plus to OpenMP* or Intel® Threading Building Blocks

ABSTRACT

Face It is a mobile application that uses computer vision to acquire data about a user’s facial structure as well as machine learning to determine the user’s face shape. This information is then combined with manually inputted information to give the user a personalized set of hair and beard styles that are guaranteed to make the user look his best. A personalized list of tips are also generated for the user to take into account when getting a haircut.

1. INTRODUCTION

To create this application, various procedures, tools and coding languages were utilized.
The procedures that were used include:
(1) Computer vision with haar-Cascade files to detect a person’s face
(2) Machine learning, specifically using a convolutional neural network and transfer learning to identify a person’s face shape
(3) A preference sorting algorithm to determine what styles look best on a person based on collected data

The programs/tools that were used include:
(1) Ubuntu v17.04
(2) Android Studios
(3) Intel Optimized TensorFlow
(4) Intel’s OpenCV

The coding languages that were used include:
(1) Java
(2) Python

2. Computer Vision

For this application we used Intel’s OpenCV library along with haar cascade files to detect a person’s face.

Haar-like features are digital features used in object recognition. They owe their name to their intuitive similarity with Haar wavelets and were used in the first real-time face detector. ^[1] A large amount of these haar-like features are put together to determine an object with sufficient accuracy and these files are called haar-cascade classifier files. These methods were used and tested in the Viola-Jones object detection framework. ^[2]

In particular the Frontal Face Detection file is being used to detect the user’s face. This file, along with various other haar-cascade files can be found here: http://alereimondo.no-ip.org/OpenCV/34.

This library and file was incorporated into our application to ensure that the user’s face is detected since the main objective is to determine the user’s face shape.

Figure 1: Testing out the OpenCV Library as well as the Frontal Face Haar-Cascade Classifer file in real-time.

OpenCV was integrated into Android’s camera2 API in order for this real-time processing to occur. An android device with an API level of 21 or higher is required to run tests and use the application because the camera2 API can only be used by phones of that version or greater.

3. Machine Learning

3.1 Convolutional Neural Networks

For the facial recognition aspect of our application, the process of using machine learning with a convolutional neural network(CNN) was used.

CNN’s are very commonly associated with image recognition and they can be trained with little difficult. The accuracy of a trained CNN is very high when it comes to detecting a correct image.

CNN architectures are inspired by biological processes and include variations of multilayer receptors that result in minimal amounts of preprocessing. ^[3] In a CNN, there are multiple layers that each have distinct functions to help us recognize an image. These layers include a convolutional layer, pooling layer, rectified linear unit (ReLU) layer, fully connected layer and loss layer.

Figure 2: A diagram of a convolutional neural network in action^[4]

- The Convolutional layer acts as the core of any CNN. The network of a CNN develops a 2-dimensional activation map that detects the special position of a feature at all the given spatial positions which are set by the parameters.

- The Pooling layer acts as a form of down sampling. Max Pooling is the most common implementation of pooling. Max Pooling is ideal when dealing with smaller data sets which is why we are choosing to use it.

- The ReLU layer is a layer of neurons which applies an activation function to increase the nonlinear properties of the decision function and of the overall network without affecting the receptive fields of the convolutional layer itself.

- The Fully Connected Layer, which occurs after several convolutional and max pooling layers, does the high-level reasoning in the neural network. Neurons in this layer have connections to all the activations amongst the precious layers. After, the activations for the Fully Connected layer are computed by a matrix multiplication and a bias offset.

- The Loss layer specifies how the network training penalizes the deviation between the predicted and true layers. Softmax Loss is the best for this application as this is ideal for detecting a single class in a set of mutually exclusive classes.

3.2 Transfer Learning with TensorFlow

The layers of a CNN can be connected in various different orders and variations. The order depends on what type of data you are using and what kind of results you are trying to get back.

There are various well-known CNN models that have been created and put out into the public for research and use. These models include the AlexNet^[5]which uses two GPU’s to train the model and various separate and combined layers. This model was entered in the ImageNet Large Scale Visual Recognition Competition ^[6] in 2012 and won. Another example is the VGGNet^[7] that is a very deep net and uses many convolutional layers in its architecture.

A very popular CNN architecture for image classification is the Inception v3 or GoogLeNet model created by Google. This model was entered in the ImageNet Large Scale Visual Recognition Competition in 2014 and won.

Figure 3: A diagram of Google’s Inception v3 convolutional neural network model^[8]

As you can see, there are various convolutional, pooling, ReLU, fully connected and loss layers being used in a specific order which will help output extremely accurate results when trying to classify an image.

This model is so well put together that many developers use a method called transfer learning with the Inception v3 model. Transfer learning is a technique that shortens the process of training a model from scratch by taking a fully-trained model from a set of categories like ImageNet and re-training it with the existing weights but for new classes.

Figure 4: Diagram showing the difference between Traditional Machine Learning and Transfer Learning^[9]

To use the process of transfer learning for the application, TensorFlow was used along with a Docker image. This image had all the repositories needed for the process. Then the Inception v3 re-train model was loaded on to TensorFlow where we were able to re-train it with the dataset needed for our application to recognize face shapes.

Figure 5: How the Inception v3 model looks during the process of transfer learning^[10]

During the process of transfer learning, only the last layer of the pre-trained model is dissected and modified. This is where the dataset for our application was inputted to be trained. The model uses all the previous knowledge it has acquired from the previous data to train the new data as accurately as possible.

This is the beauty of transfer learning and this is why using this time can save so much time and extremely accurate. Through a re-train algorithm the images within the dataset were passed through the last layer of the model and the model was accurately re-trained.

3.3 Dataset

There are many popular datasets that were created and collected by many individuals to help further the advancement and research of convolutional neural networks. One common dataset used is the MNIST dataset for recognizing handwritten digits.

Figure 6: Example of the MNIST dataset that is used for training and recognizing hand written digits. ^[11]

This dataset consists of thousands of images of handwritten digits and people can uses this dataset to train and test the accuracies of their own convolutional neural networks. Another popular dataset is the CIFAR-10^[12] dataset that consists of thousands of images of 10 different objects/animals: an airplane, an automobile, a bird, a cat, a deer, a dog, a frog, a horse, a ship and a truck.

It is good to have large amounts of data but it is very hard to collect large amounts of data so that is why many collections are already made and ready to use for practice and training.

The objective of our CNN model was to recognize a user’s face shape and in order for it to do so, it was fed various images of people with different face shapes.

The face shapes were categorized into six different shapes: square, round, oval, oblong, diamond and triangular. A folder was created for each face shape and each folder contained various images of people with that certain face shape.

Figure 7: Example of the contents inside the folder for the square face shape

These images were gathered from various reliable articles about face shapes and hairstyles. We made sure to collect as accurate data as possible to get the best results. In total we had approximately 100 images of people with each type of face shape within each folder.

Figure 8: Example of a single image saved in the square face shape folder.^[13]

These images were fed and trained through the model for 4000 iterations (steps) for get maximum accuracy.

While these images were being trained various bottlenecks were created. Bottlenecks contain the information about every image after it has been trained through the model various amounts of times.

Figure 9: Various bottlenecks being created while re-training the Inception v3 CNN

A few other files are also created including a retrained graph that has all the new information that you will need if you want to now recognize the images that you have just trained the model on.

This file is fine to use if they are to be used to recognize images on a computer but if we want to use this file on a mobile device then we would have to compress it but have it contain all the information necessary for it to still be accurate.

In order to do this we have to optimize the file to fit the size that we need. To do this we modify the following features of the file:

(1) We remove all nodes that aren't needed for a given set of input and output nodes

(2) We merge explicit batch normalization operations

After this we are left with two main files that we will load into Android Studio to use with our application.

Figure 10: Files that need to be imported into Android Studio

These files consist of the information needed to identify an image that the model has been trained to recognize once it is seen through a camera.

3.4 Accuracy

The accuracy of the retrained model is very important since the face shape being determined should be as accurate as possible for the user.

To have a high level of accuracy we had to make sure that various factors were taken into account. We had to make sure that we had a sufficient amount of images for the model to be trained on. We also had to make sure that the model trained on the images a sufficient amount of iterations.

For the first few trials we were getting a lot of mixed results and the accuracy for a predicted face shape was all over the place. For one image we were getting a 82% accuracy while for another image we were getting a 62% accuracy. This was obviously not good and we wanted to have much more accurate and precise data.

Figure 11: An example of a low accuracy level that we were receiving with our initial dataset.

At first we were using approximately 50 images of each face shape but to improve our low accuracy we increased this number to approximately 100 images of each face shape. These images were carefully hand-picked to fit the needs of our application and face shape recognition software. We wanted to reach a benchmark average accuracy of approximately 90%.

Figure 12: An example of a high accuracy level we were receiving after the changes we made with the dataset.

After these adjustments we saw a huge difference with our accuracy level and reached the benchmark we were aiming for. When it came time to compress the files necessary for the face shape detection software to work, we made sure that the accuracy level was not affected.

For ease of use by the user, after testing the accuracy levels of our application, we adjusted the code to output the highest percentage face shape that it detected in a simple and easy to read sentence rather than having various percentages appearing on the screen.

4. Application Functionality

4.1 User Interface

The user interface of the application consists of three main screens:

(1) The face detection screen with the front-side camera. This camera screen will appear first so that the user can figure out his face shape right away with no hesitation. After the face shape detector has figured out the user’s face shape, the user can click on the “Preferences” button to go to the next screen.

(2) The next screen is the preferences screen where the user inputs information about himself. The preference screen will ask the user to input certain characteristics about himself including the user’s face shape that he just discovered through the first screen (square, round, oval, oblong, diamond or triangular), the user’s hair texture (straight, wavy or coiled), the user’s hair thickness (thick or thin), if the user has facial hair (yes or no), the acne level of the user (none, moderate, excessive or prefer not to answer), and the lifestyle of the user (business, athlete or student). After the user has selected all of his preferences he can click on the “Get Hairstyles!” button to go to the final screen.

(3) The final output screen is where a list of recommended hair/ beard styles along with tips the user can use when getting a haircut will be presented. The preferences that the user selects will go through a sorting algorithm that was created for this application. Afterwards, the user will be able to swipe through the final recommendation screen and be able to choose from various hair/beard styles. An image will complement each style so the user has a better idea of how the style looks. Also a list of tips will be generated so that the user will know what to say to his barber when getting a haircut.

Figure 13: This is a display of all the screens of the application. From left to right: Face shape detection screen, preferences screen, final recommendation screen with tips that the user can swipe through.

The application was meant to have a very simplistic design so we chose very basic complementary colors and a simple logo that got the point of the application across. To integrate our ideas of how the application should look into Android Studio we made sure to create a .png file of our logo and to take down the hexcolor code of the colors that we wanted to use. Once we had those, we used Android Studio’s easy to use user interface creator and added a layer for the toolbar and a layer for the logo.

4.2 Preference Sorting Algorithm

The preference screen was organized with six different spinners, one for every preference. Each option for each preference was linked to a specific array full of various different hair/beard styles that fit that one preference.

Figure 14: Snippet of the code used to assign each option of every preference an array of hairstyles.

These styles were categorized by doing extensive research on what styles fit every option within each preference. Then these arrays were sorted to find the hairstyles that were in common with every option the user chose.

For example, let’s say the user has a square face shape and straight hair. The hair styles that look good with a square face shape may be a fade, a combover and a crew cut. These three hairstyles would be put into an array for the square face shape option. The hairstyles that look good with straight hair may be a combover, a crew cut and a side part.. These three hairstyles would be put into an array for the straight hair option. Now these two arrays would be compared and whatever hairstyles the two arrays have in common would be placed into a new and final array with the personalized hairstyles that look good for the user based on both the face shape and hair type preferences. In this case, the final array would consist of combover and a crew cut since these are the two hairstyles that both preferences had in common. These hairstyles would then be outputted and recommended to the user.

Figure 15: Snippet of the code used to compare the six different preference arrays so that one final personalized array of hairstyles can be formed.

Once the final list of hairstyles is created, an array of images is created to match the same hairstyles in the final list and this array of images is used to create a gallery of personalized hairstyles that the user can swipe through and see what he likes and what he doesn’t like.

In addition, a list of tips are outputted for the user to view and take into consideration. These tips are based on what preference the user selected. For example, if the user selected excessive acne, a tip may be to go for a long hair style to keep the acne slightly hidden. These tips are generated by various if-statements and outputted on the final screen. Since this application cannot control every aspect of a user’s haircut we are hoping that these tips will be taken into consideration by the user and hopefully used when describing to the barber what type of haircut the user is looking for.

Figure 16: An example of how the outputted tips would look for the user once he selects his preferences.

5. Programs and Hardware

5.1 Intel Optimized TensorFlow

TensorFlow was a key framework that made it possible for us to train our model and have our application actually detect a user’s face shape.

TensorFlow was installed onto the Linux operating system, Ubuntu by following this tutorial:

https://www.tensorflow.org/install/install_linux

Intel’s TensorFlow optimizations were installed by following this tutorial:

https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture

Intel has optimized the TensorFlow framework in various ways to help improve the results of training a neural network and using TensorFlow in general. They have made many modifications to help people use CPUs for this process through Intel’s MKL (Math Kernal Library) optimized operations. They have also developed a custom pool allocator and a faster way to perform back propagation to also help improve results.

After all this had been installed, Python was used to write commands to facilitate with the transfer learning process and to re-train the convolutional neural network.

5.2 Android Studio

Android Studio is the main development kit used to create the application and make it come to life. Since both TensorFlow and Android are run under Google, they had various detailed tutorials explaining how to combine the trained data from TensorFlow and integrate it with Android Studio. ^[14] This made the process very simple as long as the instructions were followed.

Figure 17: Snippet of code that shows how the viewPager is used for sliding through various images

Android Studio also made it simple to create basic .xml files for the application. These .xml files were very customizable and allowed the original mock-ups of the application to come to life and take form. When creating these .xml files we were sure to click on the option to “infer constraints.” Without this option being checked, the various displays such as the text-view box or the spinners would be in random positions when the application is fully built. Also, the application should run very smoothly. Tutorials on how to connect two activities together^[15] and how to create a view-page image gallery^[16] were used to help make the application easily useable and smooth.

Figure 18: An example of inferring constraints to make sure everything appears properly during the full build.

5.3 Mobile Device

A countless number of tests were required to make sure certain parts of the code were working whenever a new feature was added to the application. This tests were done through an actual android smart phone that was given to us by Intel.

The camera2 that is used for this application requires an android phone with an API level of 21 or higher or of version 5.1 or higher so we used a phone model with an API level of 23. Though the camera was slow at time, the overall functionality of this device was great.

Whenever a slight modification was done to the code for this application, a full build and test was always done on this smartphone to ensure that the application was still running smoothly.

Figure 19: The Android phone we used with an API level of 23. You can see the Face It application logo in the center of the screen.

6. Summary and Future Work

Using various procedures, programs, tools and languages, we were able to form an application that that uses computer vision to acquire data about a user’s facial structure and machine learning, specifically transfer learning, to detect a person’s face shape. We then put this information as well as user inputted information through a preference sorting algorithm to output a personalized gallery of hairstyles for the user to view and choose from as well as personalized tips the user can tell his barber when getting a haircut or take into consideration when styling or growing out his hair.

There is always room for improvement and we definitely plan to improve many aspects of this application including even more accurate face shape detection results, an even cleaner looking user interface and many more hair and beard styles for the user to choose and select from.

ACKNOWLEDGEMENTS

I would like to personally thank the Intel Student Ambassador Program for AI for supporting us through the creation of this application and for the motivation to keep on adding to it. I would also like to thank Intel for providing us with the proper hardware and software that was necessary for us to create and test the application.

ONLINE REFERENCES

[1] https://en.wikipedia.org/wiki/Haar-like_features

[2] https://www.cs.ubc.ca/~lowe/425/slides/13-ViolaJones.pdf

[3] https://en.wikipedia.org/wiki/Convolutional_neural_network

[4] https://www.mathworks.com/help/nnet/convolutional-neural-networks.html

[5] https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks

[6] http://www.image-net.org/

[7] http://www.robots.ox.ac.uk/~vgg/research/very_deep/

[8] https://research.googleblog.com/2016/03/train-your-own-image-classifier-with.html

[9] https://www.slideshare.net/hunkim/transfer-defectlearningnew-completed

[10]https://medium.com/@vinayakvarrier/significance-of-transfer-learning-in-the-domain-space-of-artificial-intelligence-1ebd7a1298f2

[11]http://yann.lecun.com/exdb/mnist/

[12]https://www.cs.toronto.edu/~kriz/cifar.html

[13]http://shaverguru.com/finding-a-great-beard-style-for-your-face/

[14]https://www.tensorflow.org/deploy/distributed

[15]https://developer.android.com/training/basics/firstapp/starting-activity.html

[16]https://developer.android.com/training/animation/screen-slide.html

↧

VTune™ Amplifier 2018 License Upgrade

September 11, 2017, 10:32 am

Latest and popular articles on Intel Technologies

≫ Next: Usage of Intel® Performance Libraries with Intel® C++ Compiler in Visual Studio

≪ Previous: Face It – The Artificially Intelligent Hairstylist

Note: If you have a pre-2018 serial number for Intel® VTune™ Amplifier as a standalone product with active support and you would like to download and install VTune Amplifier 2018 or later, the information below is relevant for you. This does not apply to Parallel Studio XE licenses, even if they include VTune Amplifier.

There is an upgrade available for pre-2018 Intel® VTune™ Amplifier for Windows* or Linux* licenses with active support. This is an optional upgrade and allows you to download the 2018 version of VTune Amplifier regardless of OS.

Why is the license changing for 2018?

As of update 2 of the 2017 release of VTune Amplifier, you can use your OS-specific license to download and run either the Linux* or Windows* version. For example, if you have a single license for VTune Amplifier XE for Linux, you can also download and run the version for Windows, and vice versa. The Mac OS* host version has always been supported regardless of the OS associated with the license.

Pre-2018 VTune Amplifier license:

VTune 2017 serial number

VTune 2017 license

VTune 2017 download

VTune Amplifier 2018 no longer has separate licenses for the Linux* and Windows* versions. We have created a single, standalone license to enable all supported platforms. Upgrading your pre-2018 license will replace your OS-specific serial number with a full platform serial number. If you do not upgrade, you will still be able to install and use all OS versions, but you will only be able to download the 2018 version corresponding to the OS on your valid license. 2017 version downloads are not affected.

VTune Amplifier license after upgrading:

VTune 2018 serial number

VTune 2018 license

VTune 2018 download

Note that although the 2018 Initial Release of VTune Amplifier for Linux download only has the option for Linux, the license gives you access to the Windows download page as well.

Do I need to upgrade my license?

Although it is recommended to keep your license current, you only need to upgrade your license if you have active support and want to download the 2018 version for an OS not listed on your license.

How do I upgrade my license?

From the Serial Numbers tab in the Intel Registration Center, click on the link for Intel VTune Amplifier for Linux (or Windows) under the Product Name column.

VTune 2017 upgrade

It will show your pre-2018 serial number(s). Click the Upgrade available >> link underneath.

VTune 2017 upgrade link

Next you should see this page. Click the Upgrade button to complete the process and receive your new serial number.

VTune 2017 upgrade button

Once you've received your new serial number, the old one will be retired and invalid. Make sure you replace your old license file. If you have a floating license, follow these steps to replace the file on your license server.

Does the new license work with my older installed releases?

Yes. Once you download and install the 2018 or later release, the new license file will work with the latest release as well as with any existing older releases. Upgrading the license will automatically retire the old serial number, so you must replace existing licenses using the old serial number with the new one.

Will my current license file work with the 2018 versions?

Yes. If you chose not to upgrade, the only limitation is that you may not be able to download a 2018 version not associated with your current license. It does not affect installation and use of the product on any OS.

Have questions?

Check out the Licensing FAQ
Please visit our Get Help page for support options.

↧

Usage of Intel® Performance Libraries with Intel® C++ Compiler in Visual Studio

September 13, 2017, 11:56 am

Latest and popular articles on Intel Technologies

≫ Next: Introducing Batch GEMM Operations

≪ Previous: VTune™ Amplifier 2018 License Upgrade

Affected products: Intel® Parallel Studio XE 2016, 2017 or 2018; Intel® System Studio 2016, 2017
Visual Studio versions: all supported Visual Studio, see Intel C++ Compiler Release Notes for details.

Compilation of application with use of Intel® Performance Libraries by Intel® C++ Compiler fails in Microsoft* Visual Studio and produces the warnings like:
“Could not expand [MKL|TBB|DAAL|IPP] ProductDir variable. The registry information may be incorrect."

There can be two root causes:

The library was not installed with the selected version of Intel® C++ Compiler.
“Use Intel® MKL”, “Use Intel® DAAL”, “Use Intel® IPP” and “Use Intel® TBB” properties in Visual Studio mimic behavior of /Qmkl, /Qdaal, /Qipp and /Qtbb compiler options. Include and library paths to the performance library installed together with selected Intel® C++ compiler are set up.
To fix the compilation, please, install the necessary performance libraries (Intel MKL, Intel DAAL, Intel IPP, and/or Intel TBB) from the same package from which selected versions of Intel® C++ Compiler was installed.
If you need to use different versions of Intel® Performance Libraries with Intel® C++ Compiler in Visual Studio, instead of using “Use Intel® MKL”, “Use Intel® DAAL”, “Use Intel® IPP” and “Use Intel® TBB” properties, please, manually, specify paths to headers and libraries of performance library in
“Project” > “Properties” > “VC++ Directories” and libraries in
“Project” > “Properties” > “Linker” > “Input” > “Additional Dependencies”.
For more information on correct paths and list of libraries, see the Intel® Math Kernel Library, Intel® DAAL, Intel® Integrated Performance Primitives, and Intel® Threading Building Blocks documentation.
Installation failed and registry is incorrect
Workaround: Repair Intel® Parallel Studio XE/Intel® System Studio installation. If still does not work, please report to Intel Online Service Center.

↧

Introducing Batch GEMM Operations

September 14, 2017, 1:55 am

Latest and popular articles on Intel Technologies

≫ Next: Boost Quality and Performance of Media Applications with the Latest Intel HEVC Encoder/Decoder

≪ Previous: Usage of Intel® Performance Libraries with Intel® C++ Compiler in Visual Studio

The general matrix-matrix multiplication (GEMM) is a fundamental operation in most scientific, engineering, and data applications. There is an everlasting desire to make this operation run faster. Optimized numerical libraries like Intel® Math Kernel Library (Intel® MKL) typically offer parallel high-performing GEMM implementations to leverage the concurrent threads supported by modern multi-core architectures. This strategy works well when multiplying large matrices because all cores are used efficiently. When multiplying small matrices, however, individual GEMM calls may not optimally use all the cores. Developers wanting to improve utilization usually batch multiple independent small GEMM operations into a group and then spawn multiple threads for different GEMM instances within the group. While this is a classic example of an embarrassingly parallel approach, making it run optimally requires a significant programming effort that involves threads creation/termination, synchronization, and load balancing. That is, until now.

Intel MKL 11.3 Beta (part of Intel® Parallel Studio XE 2016 Beta) includes a new flavor of GEMM feature called "Batch GEMM". This allows users to achieve the same objective described above with minimal programming effort. Users can specify multiple independent GEMM operations, which can be of different matrix sizes and different parameters, through a single call to the "Batch GEMM" API. At runtime, Intel MKL will intelligently execute all of the matrix multiplications so as to optimize overall performance. Here is an example that shows how "Batch GEMM" works:

Example

Let A0, A1 be two real double precision 4x4 matrices; Let B0, B1 be two real double precision 8x4 matrices. We'd like to perform these operations:

C0 = 1.0 * A0 * B0^T, and C1 = 1.0 * A1 * B1^T

where C0 and C1 are two real double precision 4x8 result matrices.

Again, let X0, X1 be two real double precision 3x6 matrices; Let Y0, Y1 be another two real double precision 3x6 matrices. We'd like to perform these operations:

Z0 = 1.0 * X0 * Y0^T + 2.0 * Z0, and Z1 = 1.0 * X1 * Y1^T + 2.0 * Z1

where Z0 and Z1 are two real double precision 3x3 result matrices.

We could accomplished these multiplications using four individual calls to the standard DGEMM API. Instead, here we use a single "Batch GEMM" call for the same with potentially improved overall performance. We illustrate this using the "cblas_dgemm_batch" function as an example below.

#define    GRP_COUNT    2

MKL_INT    m[GRP_COUNT] = {4, 3};
MKL_INT    k[GRP_COUNT] = {4, 6};
MKL_INT    n[GRP_COUNT] = {8, 3};

MKL_INT    lda[GRP_COUNT] = {4, 6};
MKL_INT    ldb[GRP_COUNT] = {4, 6};
MKL_INT    ldc[GRP_COUNT] = {8, 3};

CBLAS_TRANSPOSE    transA[GRP_COUNT] = {'N', 'N'};
CBLAS_TRANSPOSE    transB[GRP_COUNT] = {'T', 'T'};

double    alpha[GRP_COUNT] = {1.0, 1.0};
double    beta[GRP_COUNT] = {0.0, 2.0};

MKL_INT    size_per_grp[GRP_COUNT] = {2, 2};

// Total number of multiplications: 4
double    *a_array[4], *b_array[4], *c_array[4];
a_array[0] = A0, b_array[0] = B0, c_array[0] = C0;
a_array[1] = A1, b_array[1] = B1, c_array[1] = C1;
a_array[2] = X0, b_array[2] = Y0, c_array[2] = Z0;
a_array[3] = X1, b_array[3] = Y1, c_array[3] = Z1;

// Call cblas_dgemm_batch
cblas_dgemm_batch (
        CblasRowMajor,
        transA,
        transB,
        m,
        n,
        k,
        alpha,
        a_array,
        lda,
        b_array,
        ldb,
        beta,
        c_array,
        ldc,
        GRP_COUNT,
        size_per_group);

The "Batch GEMM" interface resembles the GEMM interface. It is simply a matter of passing arguments as arrays of pointers to matrices and parameters, instead of as matrices and the parameters themselves. We see that it is possible to batch the multiplications of different shapes and parameters by packaging them into groups. Each group consists of multiplications of the same matrices shape (same m, n, and k) and the same parameters.

Performance

While this example does not show performance advantages of "Batch GEMM", when you have thousands of independent small matrix multiplications then the advantages of "Batch GEMM" become apparent. The chart below shows the performance of 11K small matrix multiplications with various sizes using "Batch GEMM" and the standard GEMM, respectively. The benchmark was run on a 28-core Intel Xeon processor (Haswell). The performance metric is Gflops, and higher bars mean higher performance or a faster solution.

The second chart shows the same benchmark running on a 61-core Intel Xeon Phi co-processor (KNC). Because "Batch GEMM" is able to exploit parallelism using many concurrent multiple threads, its advantages are more evident on architectures with a larger core count.

Summary

This article introduces the new API for batch computation of matrix-matrix multiplications. It is an ideal solution when many small independent matrix multiplications need to be performed. "Batch GEMM" supports all precision types (S/D/C/Z). It has Fortran 77 and Fortran 95 APIs, and also CBLAS bindings. It is available in Intel MKL 11.3 Beta and later releases. Refer to the reference manual for additional documentation.

Optimization Notice in English

↧

Boost Quality and Performance of Media Applications with the Latest Intel HEVC Encoder/Decoder

September 14, 2017, 11:14 am

Latest and popular articles on Intel Technologies

≫ Next: The 32-bit wrappers are deprecated in Intel® Compiler 18.0

≪ Previous: Introducing Batch GEMM Operations

by Terry Deem, product marketing engineer, Intel Developer Products Division

Video Screens Man Media and video application developers can tune for even more brilliant visual quality and fast performance with new HEVC technology inside the just released Intel® Media Server Studio Professional Edition (2017 R3). In this new edition, key analysis tools' enhancements provide better and deeper data insights on application performance characteristics so devs can save time targeting and fixing optimization areas. The Intel Media Server Professional Edition includes easy-to-use video encoding and decoding APIs, and visual quality and performance analysis tools that help media applications to deliver higher resolutions and frame rates.

What's New

For this release, Intel improved the visual quality of its HEVC encoder and decoder. Developers can optimize for subjective video quality gains of 10% by using Intel’s HEVC encoder combined with Intel® Processor Graphics (GPU). Developers can also optimize for subjective quality gains of 5% by using the software-only HEVC encoder. Other feature enhancements include:

Improved compression efficiency for frequent Key Frames when using TU 1 / 2 encoding.
Support for MaxFrameSize for constrained variable bitrate encoding, which results in better quality streaming.
Advanced user-defined bitrate control. The software and GPU-accelerated HEVC encoders expose distortion metrics that can be used in concert with recently introduced external bitrate controls to support customized rate control for demanding customers.

Intel(r) Media Server Studio 2017 R3 - Professional Edition benchmark

Benchmark Source: Intel Corporation. See below for additional notes and disclaimers.¹

Optimize Performance and More with Super Component Tools Inside

Intel® VTune™ Amplifier

The Intel Media Server Studio Professional Edition includes several other tools that help developers understand their code and take advantage of the advanced Intel on-chip GPU. One of the more powerful analysis tools is Intel® VTune™ Amplifier. This tool allows for GPU in-kernel profiling, which helps developers quickly find memory latency of inefficient kernel algorithms. This analysis tool also provides a GPU Hotspots Summary view, which includes histograms of Packet Queue Depth and Packet Duration for analyzing DMA packet execution.

With Intel® VTune™ Amplifier, developers can also:

Detect GPU stalled/idle issues to improve application performance.
Find GPU hotspots for determining compute bound tasks hindered by GPU L3 bandwidth or DRAM bandwidth.

Intel® SDK for OpenCL™ Applications

Intel® SDK for OpenCL™ Applications (2017 version) is also included in this release of Intel Media Server Studio Professional Edition. This SDK adds OpenCL support for additional operating systems and platforms. It's fully compatible in Ubuntu* 16.04 with the latest OpenCL™ 2.0 CPU/GPU driver pages for Linux* OS (SRB5).

About Intel Media Server Studio

Intel Media Server Studio allows video delivery and cloud graphics solution providers to quickly create, optimize and debug media applications, enable real-time video transcoding in media delivery, cloud gaming, remote desktop, and more solutions targeting the latest Intel® Xeon® and Intel® Core™ processor-based platforms.**

More Resources

Learn more about Intel Media Server Studio
Access documentation, solution articles, and other support resources
Connect with other experts and ask questions at the Intel Media SDK community forum
To innovate and optimize client, mobile and IOT/embedded media/video applications and solutions, download Intel® Media SDK

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. Benchmark Source: Intel Corporation.

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804.

**Specific technical specifications apply. Review product site for more details.

↧

The 32-bit wrappers are deprecated in Intel® Compiler 18.0

September 11, 2017, 10:29 pm

Latest and popular articles on Intel Technologies

≫ Next: How to upgrade an existing floating license manager on Linux

≪ Previous: Boost Quality and Performance of Media Applications with the Latest Intel HEVC Encoder/Decoder

Version:

Intel C++ Compiler 18.0
Intel Fortran Compiler 18.0

Operating System: Linux*

The support of 32-bit wrappers are deprecated in Intel® Compiler 18.0 and will be removed in the future version. There will be a deprecation message issued when any wrapper program in the "bin/ia32" directory is invoked.

For example:

$ /opt/intel/compilers_and_libraries_2018/linux/bin/ia32/icc
icc: remark #10421: The IA-32 target wrapper binary 'icc' is deprecated. Please use the compiler startup scripts or the proper Intel(R) 64 compiler binary with the '-m32' option to target the intended architecture

Compiler users need make changes to the environment to no longer invoke the 32-bit wrappers. If the compiler startup scripts are not used they also need make sure the -m32 option is provided for generating 32-bit code.

Refer to the Intel® Parallel Studio XE 2018 Composer Edition product documentation for additional details.

↧

How to upgrade an existing floating license manager on Linux

September 13, 2017, 8:33 am

Latest and popular articles on Intel Technologies

≫ Next: Why won't the Intel® Software License Manager start on Windows*?

≪ Previous: The 32-bit wrappers are deprecated in Intel® Compiler 18.0

As of version 2.5, the Intel® Software License Manager download package for Linux* uses an installer similar to Intel software development tools. This is a change from older versions which were simply extracted to a folder. If you already have a pre-2.5 version of the license manager installed and running on your server, follow the steps below to upgrade to version 2.5 or higher.

Note that starting the license manager may change the port number used by the INTEL vendor daemon. Be sure to note the port number and exclude it from your firewall if necessary.

To install the new version using the recommended defaults:

Download the latest version of the Intel Software License Manager for Linux package here
1. Also see the User's Guide for more detailed installation instructions
Extract the package to a temporary folder
Shut down the current license manager (the lmgrd and INTEL daemons should be stopped)
From the new license manager folder, run the installer
1. From the command-line, run install.sh
2. From the GUI, run install_GUI.sh
Enter the serial number for your license or path to existing license(s) when prompted
Continue with the installation using the defaults

After the installation completes successfully, the new license manager will be started automatically. The default installation folder is /opt/intel/licenseserver. If you entered a serial number, it will create and use the license file in /opt/intel/serverlicenses.

If you prefer to keep the existing installation and update the files manually:

Note that by following these steps you will not be able to use the new uninstall process, and it may cause difficulties with future license manager upgrades.

Download the latest version of the Intel Software License Manager for Linux package here
1. Also see the User's Guide for more detailed installation instructions
Extract the package to a temporary folder
From the new license manager folder, run the installer as non-root user
Ignore the warning about the lmgrd service running
When prompted for the license, select the option to use the local license file. Then enter the path to a valid license file.
You may change the installation location in the next step
When starting the installation, it will once again detect that lmgrd is running. Ignore this message and continue.
You may see an error that it could not create folder /opt/intel/serverlicenses. You can ignore this message.
After the installation completes, shut down the older license manager processes (lmgrd and INTEL daemons)
Copy the contents of the new licenseserver folder to your existing license manager folder (such as flexlm)
Restart lmgrd

Have questions?

Check out the Licensing FAQ
Please visit our Get Help page for support options.

↧

Why won't the Intel® Software License Manager start on Windows*?

September 15, 2017, 12:42 pm

Latest and popular articles on Intel Technologies

≫ Next: testing points

≪ Previous: How to upgrade an existing floating license manager on Linux

Problem:

There is a known issue that may prevent the latest version of the Intel® Software License Manager from starting in Windows*. Although the license manager process (lmgrd) may be able to start during installation, you may get an error when trying to restart using the ilmconfig.exe or lmtools.exe utilities. This is the error you will see:

License manager log error

Solution:

Although the error message implies that the problem is with the license file, this is not the case. The ilmconfig.exe and lmtools.exe may try to write to a log file in a folder without write permission for users. This file is defaulted to the Intel Software License Manager installation folder (C:\Program Files\Intel\licenseserver\Iflexlm.log),and defined in the lmtools.exe Config Services tab:

Lmtools configuration

The path to the debug log file must be set to a folder that has user write permission, as it does not use administrator privileges. You can either change this value, or modify the permissions on the existing folder\file.

Have questions?

Check out the Licensing FAQ
Please visit our Get Help page for support options.

↧