Intel® Math Kernel Library¹ (Intel® MKL) is a product that accelerates math processing routines to increase the performance of an application when running on systems equipped with Intel® processors. Intel MKL includes linear algebra, fast Fourier transforms (FFT), vector math, and statistics functions.

To illustrate performance improvement using Intel MKL, this paper selects matrix multiplication operation as an example. Matrix multiplication operation is used here because it is a fundamental mathematical operation that has many applications across most scientific fields.

Performance Test Procedure

To demonstrate how Intel MKL can help improve the performance of matrix operation, we used a code sample downloaded from GitHub.

The tests were done on two systems; one system equipped with the Intel® Xeon® processor E5-2699 v4 and the other equipped with the Intel® Xeon® Platinum 8180 processor.

The performance was measured by comparing the time, in seconds, it takes to compute the matrix multiplication.

The tests were done using the following steps:

Measuring the time (in seconds) it takes to complete 2000 x 2000, 4000 x 4000, and 10000 x 10000 matrix multiplications using different methods of optimization. Figure 1 shows how to specify the matrix sizes and optimized methods options. More information about these methods can be found in the link above.

Figure 1. Matrix size specifications and optimization method options.
Figure 1 shows different optimized methods. Option 2 optimizes the matrix multiplication using vectorized sdot with Intel® Streaming SIMD Extensions (Intel SSE) and option 7 utilizes option 2 with loop tiling. All measurements were collected on the system equipped with the Intel Xeon processor E5-2699 v4.
Comparing results and selecting the best results to be used for later steps.
Repeating steps 1 and 2 on the system equipped with the Intel Xeon Platinum 8180 processor.
Creating a new matrix multiplication function using Intel MKL in the file matmul.c.
This involved two steps:
a) Adding the mkl include file as follows:
#include <mkl.h></mkl.h>
b) Making a call to the mkl function cblas_sgemm as follows:
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n_a_rows, n_b_cols, n_a_cols, 1.0f, a[0], n_a_rows, b[0], n_b_rows, 0.0f, m[0], n_a_rows);
Running the test again with the Intel MKL function implemented. Measuring the time it takes to do the matrix multiplication for 2000 x 2000, 4000 x 4000, and 10000 x 10000 matrices.
Comparing the results in step 5 with the best results in steps 2 and 3.

Test Configurations

Hardware

System #1

System: Preproduction
Processor: Intel Xeon processor E5-2699 v4 @ 2.2 GHz
Cores: 22
Memory: 256 GB DDR4

System #2

System: Preproduction
Processor: Intel Xeon Platinum 8180 @ 2.5 GHz
Cores: 28
Memory: 256 GB DDR4

Software

Ubuntu* 16.04 LTS
GCC* version 5.4.0
Intel MKL 2017

Test Results

Figure 2. Results of different optimized methods on different-sized matrices.

Figure 2 shows that the optimized method using explicit vectorized sdot with loop tiling performed the best on all sizes of matrices. The results of this method will be compared against those of the Intel MKL method.

Figure 3. Results of Intel® MKL on systems equipped with Intel® Xeon® processor E5-2699 v4 and Intel® Xeon® Platinum 8180 processor.

Figure 3 shows the results of the matrix multiplications using the Intel MKL method on systems equipped with the Intel Xeon processor E5-2699 v4 and the Intel Xeon Platinum 8180 processor.

Figure 4. Results with and without Intel® MKL on system equipped with the Intel® Xeon® Platinum 8180 processor.

Figure 4 shows the results of the matrix multiplications using the Intel MKL method and without the Intel MKL method on a system equipped with the Intel Xeon Platinum 8180 processor.

Conclusion

Intel MKL greatly improves the performance of Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) functions since it takes advantage of special features in the new generation of Intel processors such as Intel® Advanced Vector Extensions 512 that greatly speed up matrix operations. With Intel MKL, you don’t need to modify your source code to take advantage of new features of Intel processors. Just make sure to link the code to the latest version of Intel MKL to automatically detect and make use of new features in Intel Xeon processors.

References

1. Intel® Math Kernel Library

↧

Tuning SIMD vectorization when targeting Intel® Xeon® Processor Scalable Family

September 11, 2017, 4:17 pm

Latest and popular articles on Intel Technologies

≫ Next: The 32-bit wrappers are deprecated in Intel® Compiler 18.0

≪ Previous: Improving Performance of Math Functions with Intel® Math Kernel Library

Introduction

The Intel® Xeon® Processor Scalable Family is based on the server microarchitecture codenamed Skylake.

For best possible performance on the Intel Xeon Processor Scalable Family, applications should be compiled with processor-specific option [Q]xCORE-AVX512 using the Intel® C++ and Fortran compilers. Note that applications built with this option will not run on non-Intel or older instruction-sets based processors.

Alternatively, applications may also be compiled for multiple instruction-sets targeting multiple processors; for example, [Q]axCORE-AVX512,CORE-AVX2 might generate a fat binary with code-paths optimized for both CORE-AVX512 (codenamed Skylake server) and CORE-AVX2 (codenamed Haswell or Broadwell) target processors along with the default Intel® SSE2 code-path. To generate a common binary for the Intel Xeon Processor Scalable Family and the Intel® Xeon PhiTM x200 processor family, applications should be compiled with option [Q]xCOMMON-AVX512.

What has changed?

It is important to note that choosing the widest possible vector width, 512-bit on the Intel Xeon Processor Scalable Family, may not always result in the best vectorized code for all loops, especially for loops with low trip-counts commonly seen in non-HPC applications.

Based on careful study of applications from several domains, it was decided to introduce flexibility in SIMD vectorization for the Intel Xeon Processor Scalable Family, defaulting to 512-bit ZMM usage as low that can be tuned for higher usage, if beneficial. Developers may use the Intel compilers' optimization-reports or the Intel® Advisor to understand the SIMD vectorization quality and look for more opportunities.

Starting with the 18.0 and 17.0.5 Intel compilers, a new compiler option [Q/q]opt-zmm-usage=low|high is added to enable a smooth transition from the Intel® Advanced Vector Extensions 2 (Intel® AVX2) with 256-bit wide vector registers to the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) with 512-bit wide vector registers. This new option should be used in conjunction with the [Qa]xCORE-AVX512 option.

By default with [Qa]xCORE-AVX512, the Intel compilers will opt for more restrained ZMM register usage which works best for some types of applications. Other types of applications, such as those involving intensive floating-point calculations, may benefit from using the new option [Q/q]opt-zmm-usage=high for more aggressive 512-bit SIMD vectorization using ZMM registers.

What to do to achieve higher ZMM register usage for more 512-bit SIMD vectorization?

There are three potential paths to achieve this objective. Here is a trivial example code for demonstration purposes only:

$ cat Loop.cpp
#include
void work(double *a, double *b, int size)
{
    #pragma omp simd
    for (int i=0; i < size; i++)
    {
        b[i]=exp(a[i]);
    }
}

1. The first option, starting with the 18.0 and 17.0.5 compilers, is to use the new compiler option [Q/q]opt-zmm-usage=high in conjunction with [Qa]xCORE-AVX512 for higher usage of ZMM registers for potentially full 512-bit SIMD vectorization. Using this new option requires no source-code changes, and hence is much easier to use in achieving more aggressive ZMM usage for the entire compilation unit.

Compiling with default options, compiler will emit a remark suggesting to use new option:

    $ icpc -c -xCORE-AVX512 -qopenmp -qopt-report:5 Loop.cpp…
    remark #15305: vectorization support: vector length 4
    …
    remark #15321: Compiler has chosen to target XMM/YMM vector. Try using -qopt-zmm-usage=high to override
    …
    remark #15476: scalar cost: 107
    remark #15477: vector cost: 19.500
    remark #15478: estimated potential speedup: 5.260
    …

Compiling with the new recommended option, above remark goes away and speedup increases for this example thanks to better SIMD gains with higher ZMM usage:

    $ icpc -c -xCORE-AVX512 -qopt-zmm-usage=high -qopenmp -qopt-report:5 Loop.cpp…
    remark #15305: vectorization support: vector length 8
    …
    remark #15476: scalar cost: 107
    remark #15477: vector cost: 9.870
    remark #15478: estimated potential speedup: 10.110
    …

2.As an alternative to using this new compiler option, applications may choose to use the simdlen clause with the OpenMP simd construct to specify higher vector-length to achieve 512-bit based SIMD vectorization. Note that this type of change is localized to the loop in question, and needs to be applied for other such loops as needed, following typical hotspot tuning practices. So, using this path requires modest source-code changes.

Using the simdlen clause we get better SIMD gains for this example:

    #pragma omp simd simdlen(8)
    for (int i=0; i < size; i++) …

    $ icpc -c -xCORE-AVX512 -qopenmp -qopt-report:5 Loop.cpp
    …
    remark #15305: vectorization support: vector length 8
    …
    remark #15476: scalar cost: 107
    remark #15477: vector cost: 9.870
    remark #15478: estimated potential speedup: 10.110
    …

3. Applications built with the [Qa]xCOMMON-AVX512 option already get higher ZMM register usage and, therefore, don't need to take any additional action using either of above two paths. Note, however, that while such applications have the advantage of being able to run on a common set of processors supporting Intel AVX-512, such as the Intel Xeon Processor Scalable Family and the Intel® Xeon Phi^TM x200 processor family, they may potentially miss out on the smaller subset of processor specific Intel AVX-512 instructions not generated with [Qa]xCOMMON-AVX512. Note also that some types of applications may perform better with the default [Q/q]opt-zmm-usage=low option.

Conclusion

Developers building compute intensive applications for the Intel Xeon Processor Scalable Family may choose to benefit from higher ZMM register usage for more aggressive 512-bit SIMD vectorization using the options discussed above.

↧

The 32-bit wrappers are deprecated in Intel® Compiler 18.0

September 11, 2017, 10:29 pm

Latest and popular articles on Intel Technologies

≫ Next: How to upgrade an existing floating license manager on Linux

≪ Previous: Tuning SIMD vectorization when targeting Intel® Xeon® Processor Scalable Family

Version:

Intel C++ Compiler 18.0
Intel Fortran Compiler 18.0

Operating System: Linux*

The support of 32-bit wrappers are deprecated in Intel® Compiler 18.0 and will be removed in the future version. There will be a deprecation message issued when any wrapper program in the "bin/ia32" directory is invoked.

For example:

$ /opt/intel/compilers_and_libraries_2018/linux/bin/ia32/icc
icc: remark #10421: The IA-32 target wrapper binary 'icc' is deprecated. Please use the compiler startup scripts or the proper Intel(R) 64 compiler binary with the '-m32' option to target the intended architecture

Compiler users need make changes to the environment to no longer invoke the 32-bit wrappers. If the compiler startup scripts are not used they also need make sure the -m32 option is provided for generating 32-bit code.

Refer to the Intel® Parallel Studio XE 2018 Composer Edition product documentation for additional details.

↧

How to upgrade an existing floating license manager on Linux

September 13, 2017, 8:33 am

Latest and popular articles on Intel Technologies

≫ Next: Face It – The Artificially Intelligent Hairstylist

≪ Previous: The 32-bit wrappers are deprecated in Intel® Compiler 18.0

As of version 2.5, the Intel® Software License Manager download package for Linux* uses an installer similar to Intel software development tools. This is a change from older versions which were simply extracted to a folder. If you already have a pre-2.5 version of the license manager installed and running on your server, follow the steps below to upgrade to version 2.5 or higher.

Note that starting the license manager may change the port number used by the INTEL vendor daemon. Be sure to note the port number and exclude it from your firewall if necessary.

To install the new version using the recommended defaults:

Download the latest version of the Intel Software License Manager for Linux package here
1. Also see the User's Guide for more detailed installation instructions
Extract the package to a temporary folder
Shut down the current license manager (the lmgrd and INTEL daemons should be stopped)
From the new license manager folder, run the installer
1. From the command-line, run install.sh
2. From the GUI, run install_GUI.sh
Enter the serial number for your license or path to existing license(s) when prompted
Continue with the installation using the defaults

After the installation completes successfully, the new license manager will be started automatically. The default installation folder is /opt/intel/licenseserver. If you entered a serial number, it will create and use the license file in /opt/intel/serverlicenses.

If you prefer to keep the existing installation and update the files manually:

Note that by following these steps you will not be able to use the new uninstall process, and it may cause difficulties with future license manager upgrades.

Download the latest version of the Intel Software License Manager for Linux package here
1. Also see the User's Guide for more detailed installation instructions
Extract the package to a temporary folder
From the new license manager folder, run the installer as non-root user
Ignore the warning about the lmgrd service running
When prompted for the license, select the option to use the local license file. Then enter the path to a valid license file.
You may change the installation location in the next step
When starting the installation, it will once again detect that lmgrd is running. Ignore this message and continue.
You may see an error that it could not create folder /opt/intel/serverlicenses. You can ignore this message.
After the installation completes, shut down the older license manager processes (lmgrd and INTEL daemons)
Copy the contents of the new licenseserver folder to your existing license manager folder (such as flexlm)
Restart lmgrd

Have questions?

Check out the Licensing FAQ
Please visit our Get Help page for support options.

↧

Face It – The Artificially Intelligent Hairstylist

September 13, 2017, 9:39 am

Latest and popular articles on Intel Technologies

≫ Next: Usage of Intel® Performance Libraries with Intel® C++ Compiler in Visual Studio

≪ Previous: How to upgrade an existing floating license manager on Linux

Abstract

Face It is a mobile application that uses computer vision to acquire data about a user’s facial structure as well as machine learning to determine the user’s face shape. This information is then combined with manually inputted information to give the user a personalized set of hair and beard styles that are guaranteed to make the user look his best. A personalized list of tips are also generated for the user to take into account when getting a haircut.

1. Introduction

To create this application, various procedures, tools and coding languages were utilized.

procedures	Computer vision with haar-Cascade files to detect a person’s face Machine learning, specifically using a convolutional neural network and transfer learning to identify a person’s face shape A preference sorting algorithm to determine what styles look best on a person based on collected data
Programs/Tools Used	Ubuntu v17.04 Android Studios* Intel® Optimization for TensorFlow* OpenCV
Programming Languages	Java Python

2. Computer Vision

For this application we used Intel’s OpenCV library along with haar cascade files to detect a person’s face.

Haar-like features are digital features used in object recognition. They owe their name to their intuitive similarity with Haar wavelets and were used in the first real-time face detector. [1] A large amount of these haar-like features are put together to determine an object with sufficient accuracy and these files are called haar-cascade classifier files. These methods were used and tested in the Viola-Jones object detection framework. [2]

In particular the Frontal Face Detection file is being used to detect the user’s face. This file, along with various other haar-cascade files can be found here: http://alereimondo.no-ip.org/OpenCV/34.

This library and file was incorporated into our application to ensure that the user’s face is detected since the main objective is to determine the user’s face shape.

Figure 1: Testing out the OpenCV Library as well as the Frontal Face Haar-Cascade Classifer file in real-time.

OpenCV was integrated into Android’s camera2 API in order for this real-time processing to occur. An android device with an API level of 21 or higher is required to run tests and use the application because the camera2 API can only be used by phones of that version or greater.

3. Machine Learning

3.1 Convolutional Neural Networks

For the facial recognition aspect of our application, the process of using machine learning with a convolutional neural network(CNN) was used.

CNN’s are very commonly associated with image recognition and they can be trained with little difficult. The accuracy of a trained CNN is very high when it comes to detecting a correct image.

CNN architectures are inspired by biological processes and include variations of multilayer receptors that result in minimal amounts of preprocessing. [3] In a CNN, there are multiple layers that each have distinct functions to help us recognize an image. These layers include a convolutional layer, pooling layer, rectified linear unit (ReLU) layer, fully connected layer and loss layer.

Figure 2: A diagram of a convolutional neural network in action[4]

The Convolutional layer acts as the core of any CNN. The network of a CNN develops a 2-dimensional activation map that detects the special position of a feature at all the given spatial positions which are set by the parameters.
The Pooling layer acts as a form of down sampling. Max Pooling is the most common implementation of pooling. Max Pooling is ideal when dealing with smaller data sets which is why we are choosing to use it.
The ReLU layer is a layer of neurons which applies an activation function to increase the nonlinear properties of the decision function and of the overall network without affecting the receptive fields of the convolutional layer itself.
The Fully Connected Layer, which occurs after several convolutional and max pooling layers, does the high-level reasoning in the neural network. Neurons in this layer have connections to all the activations amongst the precious layers. After, the activations for the Fully Connected layer are computed by a matrix multiplication and a bias offset.
The Loss layer specifies how the network training penalizes the deviation between the predicted and true layers. Softmax Loss is the best for this application as this is ideal for detecting a single class in a set of mutually exclusive classes.

3.2 Transfer Learning with TensorFlow*

The layers of a CNN can be connected in various different orders and variations. The order depends on what type of data you are using and what kind of results you are trying to get back.

There are various well-known CNN models that have been created and put out into the public for research and use. These models include the AlexNet[5] which uses two GPU’s to train the model and various separate and combined layers. This model was entered in the ImageNet Large Scale Visual Recognition Competition [6] in 2012 and won. Another example is the VGGNet[7] that is a very deep net and uses many convolutional layers in its architecture.

A very popular CNN architecture for image classification is the Inception v3 or GoogLeNet model created by Google. This model was entered in the ImageNet Large Scale Visual Recognition Competition in 2014 and won.

Figure 3: A diagram of Google’s Inception v3 convolutional neural network model[8]

As you can see, there are various convolutional, pooling, ReLU, fully connected and loss layers being used in a specific order which will help output extremely accurate results when trying to classify an image.

This model is so well put together that many developers use a method called transfer learning with the Inception v3 model. Transfer learning is a technique that shortens the process of training a model from scratch by taking a fully-trained model from a set of categories like ImageNet and re-training it with the existing weights but for new classes.

Figure 4: Diagram showing the difference between Traditional Machine Learning and Transfer Learning[9]

To use the process of transfer learning for the application, TensorFlow was used along with a Docker image. This image had all the repositories needed for the process. Then the Inception v3 re-train model was loaded on to TensorFlow where we were able to re-train it with the dataset needed for our application to recognize face shapes.

Figure 5: How the Inception v3 model looks during the process of transfer learning[10]

During the process of transfer learning, only the last layer of the pre-trained model is dissected and modified. This is where the dataset for our application was inputted to be trained. The model uses all the previous knowledge it has acquired from the previous data to train the new data as accurately as possible.

This is the beauty of transfer learning and this is why using this time can save so much time and extremely accurate. Through a re-train algorithm the images within the dataset were passed through the last layer of the model and the model was accurately re-trained.

3.3 Dataset

There are many popular datasets that were created and collected by many individuals to help further the advancement and research of convolutional neural networks. One common dataset used is the MNIST dataset for recognizing handwritten digits.

Figure 6: Example of the MNIST dataset that is used for training and recognizing hand written digits. [11]

This dataset consists of thousands of images of handwritten digits and people can uses this dataset to train and test the accuracies of their own convolutional neural networks. Another popular dataset is the CIFAR-10[12] dataset that consists of thousands of images of 10 different objects/animals: an airplane, an automobile, a bird, a cat, a deer, a dog, a frog, a horse, a ship and a truck.

It is good to have large amounts of data but it is very hard to collect large amounts of data so that is why many collections are already made and ready to use for practice and training.

The objective of our CNN model was to recognize a user’s face shape and in order for it to do so, it was fed various images of people with different face shapes.

The face shapes were categorized into six different shapes: square, round, oval, oblong, diamond and triangular. A folder was created for each face shape and each folder contained various images of people with that certain face shape.

Figure 7: Example of the contents inside the folder for the square face shape

These images were gathered from various reliable articles about face shapes and hairstyles. We made sure to collect as accurate data as possible to get the best results. In total we had approximately 100 images of people with each type of face shape within each folder.

Figure 8: Example of a single image saved in the square face shape folder.[13]

These images were fed and trained through the model for 4000 iterations (steps) for get maximum accuracy.

While these images were being trained various bottlenecks were created. Bottlenecks contain the information about every image after it has been trained through the model various amounts of times.

Figure 9: Various bottlenecks being created while re-training the Inception v3 CNN

A few other files are also created including a retrained graph that has all the new information that you will need if you want to now recognize the images that you have just trained the model on.

This file is fine to use if they are to be used to recognize images on a computer but if we want to use this file on a mobile device then we would have to compress it but have it contain all the information necessary for it to still be accurate.

In order to do this we have to optimize the file to fit the size that we need. To do this we modify the following features of the file:

We remove all nodes that aren't needed for a given set of input and output nodes
We merge explicit batch normalization operations

After this we are left with two main files that we will load into Android Studio to use with our application.

Figure 10: Files that need to be imported into Android Studio

These files consist of the information needed to identify an image that the model has been trained to recognize once it is seen through a camera.

3.4 Accuracy

The accuracy of the retrained model is very important since the face shape being determined should be as accurate as possible for the user.

To have a high level of accuracy we had to make sure that various factors were taken into account. We had to make sure that we had a sufficient amount of images for the model to be trained on. We also had to make sure that the model trained on the images a sufficient amount of iterations.

For the first few trials we were getting a lot of mixed results and the accuracy for a predicted face shape was all over the place. For one image we were getting a 82% accuracy while for another image we were getting a 62% accuracy. This was obviously not good and we wanted to have much more accurate and precise data.

Figure 11: An example of a low accuracy level that we were receiving with our initial dataset.

At first we were using approximately 50 images of each face shape but to improve our low accuracy we increased this number to approximately 100 images of each face shape. These images were carefully hand-picked to fit the needs of our application and face shape recognition software. We wanted to reach a benchmark average accuracy of approximately 90%.

Figure 12: An example of a high accuracy level we were receiving after the changes we made with the dataset.

After these adjustments we saw a huge difference with our accuracy level and reached the benchmark we were aiming for. When it came time to compress the files necessary for the face shape detection software to work, we made sure that the accuracy level was not affected.

For ease of use by the user, after testing the accuracy levels of our application, we adjusted the code to output the highest percentage face shape that it detected in a simple and easy to read sentence rather than having various percentages appearing on the screen.

4. Application Functionality

4.1 User Interface

The user interface of the application consists of three main screens:

The face detection screen with the front-side camera. This camera screen will appear first so that the user can figure out his face shape right away with no hesitation. After the face shape detector has figured out the user’s face shape, the user can click on the “Preferences” button to go to the next screen.
The next screen is the preferences screen where the user inputs information about himself. The preference screen will ask the user to input certain characteristics about himself including the user’s face shape that he just discovered through the first screen (square, round, oval, oblong, diamond or triangular), the user’s hair texture (straight, wavy or coiled), the user’s hair thickness (thick or thin), if the user has facial hair (yes or no), the acne level of the user (none, moderate, excessive or prefer not to answer), and the lifestyle of the user (business, athlete or student). After the user has selected all of his preferences he can click on the “Get Hairstyles!” button to go to the final screen.
The final output screen is where a list of recommended hair/ beard styles along with tips the user can use when getting a haircut will be presented. The preferences that the user selects will go through a sorting algorithm that was created for this application. Afterwards, the user will be able to swipe through the final recommendation screen and be able to choose from various hair/beard styles. An image will complement each style so the user has a better idea of how the style looks. Also a list of tips will be generated so that the user will know what to say to his barber when getting a haircut.

Figure 13: This is a display of all the screens of the application. From left to right: Face shape detection screen, preferences screen, final recommendation screen with tips that the user can swipe through.

The application was meant to have a very simplistic design so we chose very basic complementary colors and a simple logo that got the point of the application across. To integrate our ideas of how the application should look into Android Studio we made sure to create a .png file of our logo and to take down the hexcolor code of the colors that we wanted to use. Once we had those, we used Android Studio’s easy to use user interface creator and added a layer for the toolbar and a layer for the logo.

4.2 Preference Sorting Algorithm

The preference screen was organized with six different spinners, one for every preference. Each option for each preference was linked to a specific array full of various different hair/beard styles that fit that one preference.

Figure 14: Snippet of the code used to assign each option of every preference an array of hairstyles.

These styles were categorized by doing extensive research on what styles fit every option within each preference. Then these arrays were sorted to find the hairstyles that were in common with every option the user chose.

For example, let’s say the user has a square face shape and straight hair. The hair styles that look good with a square face shape may be a fade, a combover and a crew cut. These three hairstyles would be put into an array for the square face shape option. The hairstyles that look good with straight hair may be a combover, a crew cut and a side part.. These three hairstyles would be put into an array for the straight hair option. Now these two arrays would be compared and whatever hairstyles the two arrays have in common would be placed into a new and final array with the personalized hairstyles that look good for the user based on both the face shape and hair type preferences. In this case, the final array would consist of combover and a crew cut since these are the two hairstyles that both preferences had in common. These hairstyles would then be outputted and recommended to the user.

Figure 15: Snippet of the code used to compare the six different preference arrays so that one final personalized array of hairstyles can be formed.

Once the final list of hairstyles is created, an array of images is created to match the same hairstyles in the final list and this array of images is used to create a gallery of personalized hairstyles that the user can swipe through and see what he likes and what he doesn’t like.

In addition, a list of tips are outputted for the user to view and take into consideration. These tips are based on what preference the user selected. For example, if the user selected excessive acne, a tip may be to go for a long hair style to keep the acne slightly hidden. These tips are generated by various if-statements and outputted on the final screen. Since this application cannot control every aspect of a user’s haircut we are hoping that these tips will be taken into consideration by the user and hopefully used when describing to the barber what type of haircut the user is looking for.

Figure 16: An example of how the outputted tips would look for the user once he selects his preferences.

5. Programs and Hardware

5.1 Intel Optimized TensorFlow

TensorFlow was a key framework that made it possible for us to train our model and have our application actually detect a user’s face shape.

TensorFlow was installed onto the Linux operating system, Ubuntu by following this tutorial:
https://www.tensorflow.org/install/install_linux

Intel’s TensorFlow optimizations were installed by following this tutorial:
https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture

Intel has optimized the TensorFlow framework in various ways to help improve the results of training a neural network and using TensorFlow in general. They have made many modifications to help people use CPUs for this process through Intel’s MKL (Math Kernal Library) optimized operations. They have also developed a custom pool allocator and a faster way to perform back propagation to also help improve results.

After all this had been installed, Python was used to write commands to facilitate with the transfer learning process and to re-train the convolutional neural network.

5.2 Android Studio

Android Studio is the main development kit used to create the application and make it come to life. Since both TensorFlow and Android are run under Google, they had various detailed tutorials explaining how to combine the trained data from TensorFlow and integrate it with Android Studio. [14] This made the process very simple as long as the instructions were followed.

Figure 17: Snippet of code that shows how the viewPager is used for sliding through various images

Android Studio also made it simple to create basic .xml files for the application. These .xml files were very customizable and allowed the original mock-ups of the application to come to life and take form. When creating these .xml files we were sure to click on the option to “infer constraints.” Without this option being checked, the various displays such as the text-view box or the spinners would be in random positions when the application is fully built. Also, the application should run very smoothly. Tutorials on how to connect two activities together[15] and how to create a view-page image gallery[16] were used to help make the application easily useable and smooth.

Figure 18: An example of inferring constraints to make sure everything appears properly during the full build.

5.3 Mobile Device

A countless number of tests were required to make sure certain parts of the code were working whenever a new feature was added to the application. This tests were done through an actual android smart phone that was given to us by Intel.

The camera2 that is used for this application requires an android phone with an API level of 21 or higher or of version 5.1 or higher so we used a phone model with an API level of 23. Though the camera was slow at time, the overall functionality of this device was great.

Whenever a slight modification was done to the code for this application, a full build and test was always done on this smartphone to ensure that the application was still running smoothly.

Figure 19: The Android phone we used with an API level of 23. You can see the Face It application logo in the center of the screen.

6. Summary and Future Work

Using various procedures, programs, tools and languages, we were able to form an application that that uses computer vision to acquire data about a user’s facial structure and machine learning, specifically transfer learning, to detect a person’s face shape. We then put this information as well as user inputted information through a preference sorting algorithm to output a personalized gallery of hairstyles for the user to view and choose from as well as personalized tips the user can tell his barber when getting a haircut or take into consideration when styling or growing out his hair.

There is always room for improvement and we definitely plan to improve many aspects of this application including even more accurate face shape detection results, an even cleaner looking user interface and many more hair and beard styles for the user to choose and select from.

Acknowledgements

I would like to personally thank the Intel Student Ambassador Program for AI for supporting us through the creation of this application and for the motivation to keep on adding to it. I would also like to thank Intel for providing us with the proper hardware and software that was necessary for us to create and test the application.

Online References

[1] https://en.wikipedia.org/wiki/Haar-like_features

[2] https://www.cs.ubc.ca/~lowe/425/slides/13-ViolaJones.pdf

[3] https://en.wikipedia.org/wiki/Convolutional_neural_network

[4] https://www.mathworks.com/help/nnet/convolutional-neural-networks.html

[5] https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks

[6] http://www.image-net.org/

[7] http://www.robots.ox.ac.uk/~vgg/research/very_deep/

[8] https://research.googleblog.com/2016/03/train-your-own-image-classifier-with.html

[9] https://www.slideshare.net/hunkim/transfer-defectlearningnew-completed

[10]https://medium.com/@vinayakvarrier/significance-of-transfer-learning-in-the-domain-space-of-artificial-intelligence-1ebd7a1298f2

[11]http://yann.lecun.com/exdb/mnist/

[12]https://www.cs.toronto.edu/~kriz/cifar.html

[13]http://shaverguru.com/finding-a-great-beard-style-for-your-face/

[14]https://www.tensorflow.org/deploy/distributed

[15]https://developer.android.com/training/basics/firstapp/starting-activity.html

[16]https://developer.android.com/training/animation/screen-slide.html

Read the previous Week 6 Update or continue with the Week 10 Update

↧

Usage of Intel® Performance Libraries with Intel® C++ Compiler in Visual Studio

September 13, 2017, 11:56 am

Latest and popular articles on Intel Technologies

≫ Next: Introducing Batch GEMM Operations

≪ Previous: Face It – The Artificially Intelligent Hairstylist

Affected products: Intel® Parallel Studio XE 2016, 2017 or 2018; Intel® System Studio 2016, 2017
Visual Studio versions: all supported Visual Studio, see Intel C++ Compiler Release Notes for details.

Compilation of application with use of Intel® Performance Libraries by Intel® C++ Compiler fails in Microsoft* Visual Studio and produces the warnings like:
“Could not expand [MKL|TBB|DAAL|IPP] ProductDir variable. The registry information may be incorrect."

There can be two root causes:

The library was not installed with the selected version of Intel® C++ Compiler.
“Use Intel® MKL”, “Use Intel® DAAL”, “Use Intel® IPP” and “Use Intel® TBB” properties in Visual Studio mimic behavior of /Qmkl, /Qdaal, /Qipp and /Qtbb compiler options. Include and library paths to the performance library installed together with selected Intel® C++ compiler are set up.
To fix the compilation, please, install the necessary performance libraries (Intel MKL, Intel DAAL, Intel IPP, and/or Intel TBB) from the same package from which selected versions of Intel® C++ Compiler was installed.
If you need to use different versions of Intel® Performance Libraries with Intel® C++ Compiler in Visual Studio, instead of using “Use Intel® MKL”, “Use Intel® DAAL”, “Use Intel® IPP” and “Use Intel® TBB” properties, please, manually, specify paths to headers and libraries of performance library in
“Project” > “Properties” > “VC++ Directories” and libraries in
“Project” > “Properties” > “Linker” > “Input” > “Additional Dependencies”.
For more information on correct paths and list of libraries, see the Intel® Math Kernel Library, Intel® DAAL, Intel® Integrated Performance Primitives, and Intel® Threading Building Blocks documentation.
Installation failed and registry is incorrect
Workaround: Repair Intel® Parallel Studio XE/Intel® System Studio installation. If still does not work, please report to Intel Online Service Center.

↧

Introducing Batch GEMM Operations

September 14, 2017, 1:55 am

Latest and popular articles on Intel Technologies

≫ Next: Boost Quality and Performance of Media Applications with the Latest Intel HEVC Encoder/Decoder

≪ Previous: Usage of Intel® Performance Libraries with Intel® C++ Compiler in Visual Studio

The general matrix-matrix multiplication (GEMM) is a fundamental operation in most scientific, engineering, and data applications. There is an everlasting desire to make this operation run faster. Optimized numerical libraries like Intel® Math Kernel Library (Intel® MKL) typically offer parallel high-performing GEMM implementations to leverage the concurrent threads supported by modern multi-core architectures. This strategy works well when multiplying large matrices because all cores are used efficiently. When multiplying small matrices, however, individual GEMM calls may not optimally use all the cores. Developers wanting to improve utilization usually batch multiple independent small GEMM operations into a group and then spawn multiple threads for different GEMM instances within the group. While this is a classic example of an embarrassingly parallel approach, making it run optimally requires a significant programming effort that involves threads creation/termination, synchronization, and load balancing. That is, until now.

Intel MKL 11.3 Beta (part of Intel® Parallel Studio XE 2016 Beta) includes a new flavor of GEMM feature called "Batch GEMM". This allows users to achieve the same objective described above with minimal programming effort. Users can specify multiple independent GEMM operations, which can be of different matrix sizes and different parameters, through a single call to the "Batch GEMM" API. At runtime, Intel MKL will intelligently execute all of the matrix multiplications so as to optimize overall performance. Here is an example that shows how "Batch GEMM" works:

Example

Let A0, A1 be two real double precision 4x4 matrices; Let B0, B1 be two real double precision 8x4 matrices. We'd like to perform these operations:

C0 = 1.0 * A0 * B0^T, and C1 = 1.0 * A1 * B1^T

where C0 and C1 are two real double precision 4x8 result matrices.

Again, let X0, X1 be two real double precision 3x6 matrices; Let Y0, Y1 be another two real double precision 3x6 matrices. We'd like to perform these operations:

Z0 = 1.0 * X0 * Y0^T + 2.0 * Z0, and Z1 = 1.0 * X1 * Y1^T + 2.0 * Z1

where Z0 and Z1 are two real double precision 3x3 result matrices.

We could accomplished these multiplications using four individual calls to the standard DGEMM API. Instead, here we use a single "Batch GEMM" call for the same with potentially improved overall performance. We illustrate this using the "cblas_dgemm_batch" function as an example below.

#define    GRP_COUNT    2

MKL_INT    m[GRP_COUNT] = {4, 3};
MKL_INT    k[GRP_COUNT] = {4, 6};
MKL_INT    n[GRP_COUNT] = {8, 3};

MKL_INT    lda[GRP_COUNT] = {4, 6};
MKL_INT    ldb[GRP_COUNT] = {4, 6};
MKL_INT    ldc[GRP_COUNT] = {8, 3};

CBLAS_TRANSPOSE    transA[GRP_COUNT] = {'N', 'N'};
CBLAS_TRANSPOSE    transB[GRP_COUNT] = {'T', 'T'};

double    alpha[GRP_COUNT] = {1.0, 1.0};
double    beta[GRP_COUNT] = {0.0, 2.0};

MKL_INT    size_per_grp[GRP_COUNT] = {2, 2};

// Total number of multiplications: 4
double    *a_array[4], *b_array[4], *c_array[4];
a_array[0] = A0, b_array[0] = B0, c_array[0] = C0;
a_array[1] = A1, b_array[1] = B1, c_array[1] = C1;
a_array[2] = X0, b_array[2] = Y0, c_array[2] = Z0;
a_array[3] = X1, b_array[3] = Y1, c_array[3] = Z1;

// Call cblas_dgemm_batch
cblas_dgemm_batch (
        CblasRowMajor,
        transA,
        transB,
        m,
        n,
        k,
        alpha,
        a_array,
        lda,
        b_array,
        ldb,
        beta,
        c_array,
        ldc,
        GRP_COUNT,
        size_per_group);

The "Batch GEMM" interface resembles the GEMM interface. It is simply a matter of passing arguments as arrays of pointers to matrices and parameters, instead of as matrices and the parameters themselves. We see that it is possible to batch the multiplications of different shapes and parameters by packaging them into groups. Each group consists of multiplications of the same matrices shape (same m, n, and k) and the same parameters.

Performance

While this example does not show performance advantages of "Batch GEMM", when you have thousands of independent small matrix multiplications then the advantages of "Batch GEMM" become apparent. The chart below shows the performance of 11K small matrix multiplications with various sizes using "Batch GEMM" and the standard GEMM, respectively. The benchmark was run on a 28-core Intel Xeon processor (Haswell). The performance metric is Gflops, and higher bars mean higher performance or a faster solution.

The second chart shows the same benchmark running on a 61-core Intel Xeon Phi co-processor (KNC). Because "Batch GEMM" is able to exploit parallelism using many concurrent multiple threads, its advantages are more evident on architectures with a larger core count.

Summary

This article introduces the new API for batch computation of matrix-matrix multiplications. It is an ideal solution when many small independent matrix multiplications need to be performed. "Batch GEMM" supports all precision types (S/D/C/Z). It has Fortran 77 and Fortran 95 APIs, and also CBLAS bindings. It is available in Intel MKL 11.3 Beta and later releases. Refer to the reference manual for additional documentation.

Optimization Notice in English

↧

Boost Quality and Performance of Media Applications with the Latest Intel HEVC Encoder/Decoder

September 14, 2017, 11:14 am

Latest and popular articles on Intel Technologies

≫ Next: Why won't the Intel® Software License Manager start on Windows*?

≪ Previous: Introducing Batch GEMM Operations

by Terry Deem, product marketing engineer, Intel Developer Products Division

Video Screens Man Media and video application developers can tune for even more brilliant visual quality and fast performance with new HEVC technology inside the just released Intel® Media Server Studio Professional Edition (2017 R3). In this new edition, key analysis tools' enhancements provide better and deeper data insights on application performance characteristics so devs can save time targeting and fixing optimization areas. The Intel Media Server Professional Edition includes easy-to-use video encoding and decoding APIs, and visual quality and performance analysis tools that help media applications to deliver higher resolutions and frame rates.

What's New

For this release, Intel improved the visual quality of its HEVC encoder and decoder. Developers can optimize for subjective video quality gains of 10% by using Intel’s HEVC encoder combined with Intel® Processor Graphics (GPU). Developers can also optimize for subjective quality gains of 5% by using the software-only HEVC encoder. Other feature enhancements include:

Improved compression efficiency for frequent Key Frames when using TU 1 / 2 encoding.
Support for MaxFrameSize for constrained variable bitrate encoding, which results in better quality streaming.
Advanced user-defined bitrate control. The software and GPU-accelerated HEVC encoders expose distortion metrics that can be used in concert with recently introduced external bitrate controls to support customized rate control for demanding customers.

Intel(r) Media Server Studio 2017 R3 - Professional Edition benchmark

Benchmark Source: Intel Corporation. See below for additional notes and disclaimers.¹

Optimize Performance and More with Super Component Tools Inside

Intel® VTune™ Amplifier

The Intel Media Server Studio Professional Edition includes several other tools that help developers understand their code and take advantage of the advanced Intel on-chip GPU. One of the more powerful analysis tools is Intel® VTune™ Amplifier. This tool allows for GPU in-kernel profiling, which helps developers quickly find memory latency of inefficient kernel algorithms. This analysis tool also provides a GPU Hotspots Summary view, which includes histograms of Packet Queue Depth and Packet Duration for analyzing DMA packet execution.

With Intel® VTune™ Amplifier, developers can also:

Detect GPU stalled/idle issues to improve application performance.
Find GPU hotspots for determining compute bound tasks hindered by GPU L3 bandwidth or DRAM bandwidth.

Intel® SDK for OpenCL™ Applications

Intel® SDK for OpenCL™ Applications (2017 version) is also included in this release of Intel Media Server Studio Professional Edition. This SDK adds OpenCL support for additional operating systems and platforms. It's fully compatible in Ubuntu* 16.04 with the latest OpenCL™ 2.0 CPU/GPU driver pages for Linux* OS (SRB5).

About Intel Media Server Studio

Intel Media Server Studio allows video delivery and cloud graphics solution providers to quickly create, optimize and debug media applications, enable real-time video transcoding in media delivery, cloud gaming, remote desktop, and more solutions targeting the latest Intel® Xeon® and Intel® Core™ processor-based platforms.**

More Resources

Learn more about Intel Media Server Studio
Access documentation, solution articles, and other support resources
Connect with other experts and ask questions at the Intel Media SDK community forum
To innovate and optimize client, mobile and IOT/embedded media/video applications and solutions, download Intel® Media SDK

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. Benchmark Source: Intel Corporation.

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804.

**Specific technical specifications apply. Review product site for more details.

↧

Why won't the Intel® Software License Manager start on Windows*?

September 15, 2017, 12:42 pm

Latest and popular articles on Intel Technologies

≫ Next: From UAV to VR in No Time: Autodesk® ReCap™ Photo Quickly Renders 3D Structures Thanks to Multicore Scalability

≪ Previous: Boost Quality and Performance of Media Applications with the Latest Intel HEVC Encoder/Decoder

Problem:

There is a known issue that may prevent the latest version of the Intel® Software License Manager from starting in Windows*. Although the license manager process (lmgrd) may be able to start during installation, you may get an error when trying to restart using the ilmconfig.exe or lmtools.exe utilities. This is the error you will see:

License manager log error

Solution:

Although the error message implies that the problem is with the license file, this is not the case. The ilmconfig.exe and lmtools.exe may try to write to a log file in a folder without write permission for users. This file is defaulted to the Intel Software License Manager installation folder (C:\Program Files\Intel\licenseserver\Iflexlm.log),and defined in the lmtools.exe Config Services tab:

Lmtools configuration

The path to the debug log file must be set to a folder that has user write permission, as it does not use administrator privileges. You can either change this value, or modify the permissions on the existing folder\file.

Have questions?

Check out the Licensing FAQ
Please visit our Get Help page for support options.

↧

From UAV to VR in No Time: Autodesk® ReCap™ Photo Quickly Renders 3D Structures Thanks to Multicore Scalability

September 18, 2017, 12:42 pm

Latest and popular articles on Intel Technologies

≫ Next: Gentle Introduction to PyDaal: Vol 2 of 3 Basic Operations on Numeric Tables

≪ Previous: Why won't the Intel® Software License Manager start on Windows*?

Ultimate Performance in VR Visualizations Powered by Intel® Xeon® Scalable Processors and Intel® Xeon® W Processors

Imagine snapping hundreds of photos with an unmanned aerial vehicle (UAV) as you guide it around a construction site. Imagine, afterwards, taking these new photos of the site and quickly building them into a photorealistic 3D model. This 3D model of the site could help you visualize a new, finished structure in that same location. No more drafting or surveying would be needed, and you’d have quick access to the context surrounding a planned building. Designing new construction on such a site could be made more efficient and effective than ever before.

This process describes the magic of photogrammetry as performed with the help of a UAV and Autodesk® ReCap™ Photo. Photogrammetry is the practice of using photographs to measure and model physical objects and scenes. Autodesk ReCap Photo makes use of the power of high-core-count Intel® Xeon® processors and cloud computing to push photogrammetry to a new level of ease, speed, realism, and accessibility. Consisting of a desktop app and a photo-to-3D web service, Autodesk ReCap Photo quickly converts photographs shot around a scene—normally with UAVs—into high-resolution 3D models that can be used in architecture, engineering, and construction. The models can be imported for use in engineering and design applications (such as Autodesk® Revit®, Autodesk® AutoCAD® Civil 3D®, or Autodesk® InfraWorks®), or they can be used for visualizations based in on-screen animations or virtual reality (VR), where the power of Intel Xeon processors delivers a seamless, high-quality experience.

UAV performing reality capture

Figure 1. A UAV performing reality capture by taking hundreds of photographs around a building

“Autodesk® ReCap™ Photo scales extremely well on high performance multi-core Intel® Xeon® processors to generate highly detailed and accurate 3D models. It provides smart features with smooth interaction in spite of the massive data generated by the reality capture sensors on the UAVs.”

— Murali Pappoppula
Senior Software Development Manager, Autodesk

Challenges with Photogrammetry and 3D Rendering

The advantages of using digital 3D models for design and engineering have long been understood. Using photographs to accurately create 3D models, however, has traditionally presented certain practical challenges, especially for large-scale scenes. One common hurdle has been the difficulty of taking enough photographs, at the proper angles required for photogrammetry, while remaining within budget. The increasingly widespread availability of affordable UAVs, however, has made the cost of gathering photos suitable for 3D modeling more approachable for many companies.

Another significant challenge for architecture, engineering, and construction companies has been the length of time required to process photos into a usable 3D model. This problem arises because the next step of the 3D modeling workflow—the model-computing phase—is especially CPU-intensive. During this stage, many complex calculations are used to extract 3D information (based on camera location, orientation, focal length, and distortion) from the overlapping portions of hundreds of photos. Hundreds of thousands of additional calculations can then be performed to complete the triangulations and correlations required to generate an accurate 3D representation of a captured scene or object, including its surface, elevation, volume, and texture.

On the typical desktop computers used in small- or medium-sized architecture, engineering, and construction companies, this processing can take many hours or even days. Companies who attempt to use insufficiently powerful computers might even witness their systems hanging indefinitely while a series of photographs is processed.

Powerful Workstations and Scalability in the Cloud Bring Fast Processing to All

Using a workstation with a high number of CPU cores is not always sufficient to overcome the lack of speed involved in creating a 3D model from photographs. After all, application performance does not automatically improve on multicore systems. Applications need to be specifically designed to take advantage of multicore processing. Fortunately, Autodesk ReCap Photo is just such an application, having been painstakingly engineered to scale well on high-core-count Intel Xeon processors.

One further hurdle remains for some companies, however, before they can enjoy the benefits of a 3D model built through reality capture. Businesses typically need to invest in powerful workstations that include many CPU cores to dramatically reduce the amount of time required to perform operations that scale well on multicore CPUs, and not every firm is prepared to make that kind of an investment. To solve this problem and bring photo-to-3D modeling within the reach of all construction, design, engineering, and architectural companies, Autodesk has made the photogrammetry process of Autodesk ReCap Photo available as a cloud service backed by multicore Intel Xeon processors.

Once the 3D model is created through the cloud service, other functions, such as visualization, editing, and exporting, are performed locally on the workstation. 3D models are built from photographs through cloud-based resources

Figure 2. With Autodesk® ReCap™ Photo, 3D models are built from photographs through cloud-based resources, but other functions rely on local resources

Different workstation configurations should be used to optimize performance for different design needs, as shown in Table 1.

Table 1. Suggested systems for various design needs

Workflow	Autodesk® Applications	Suggested Intel® Workstations
Entry-level computer-aided design (CAD) modeling, design, and simple models	Autodesk® AutoCAD®, Autodesk® Revit®	Intel® Xeon® processor E3 family–based system CPU: 4 cores Professional graphics card 16–32 GB RAM Intel Serial ATA (SATA) solid-state drive (SSD)
Complex CAD models, and VR visualizations	Autodesk Revit, Autodesk® Inventor®, Autodesk® Stingray, Autodesk® AutoCAD® Civil 3D®	One Intel® Xeon® W processor or Intel® Xeon® Gold processor CPU: 8–10 cores Professional graphics card 32–64 GB memory Intel® SSD 750 Series (with Peripheral Component Interconnect Express* [PCIe*])
Simulations, rendering, photogrammetry, and VR design	Autodesk® Maya®, Autodesk® 3ds Max®, Autodesk Stingray, Autodesk® ReCap™ Photo	2x Intel Xeon Gold processors CPU: 14+ cores per processor (28 cores total) Professional graphics card Memory: 64–128 GB DDR4 Intel SSD 750 Series with PCIe

The Power of Intel® Xeon® Processor–based High-Performance Computing in the Cloud

The Autodesk® ReCap™ Photo photogrammetry process runs on cloud-based virtual server instances powered by the Intel® Xeon® processor E5 v4 family. Small jobs run on eight virtual cores while large jobs run on 32 virtual cores. The high core counts of the Intel Xeon processors that power this process allow the service to build 3D models much faster than would be possible on a two- or four-core workstation, with all other factors being equal (such as memory and graphics processing unit [GPU]). The time savings made possible by the high-core-count Intel Xeon processors are then passed along to designers, architects, and engineers, making the employees and their firms more productive.

Virtual Reality Visualizations on Intel Xeon Scalable Processors

Starting with photos taken by UAVs, Autodesk ReCap Photo can quickly create 3D models that can be exported into Autodesk® Stingray to create VR visualizations. Viewers can then use a head-mounted display (HMD) to step into the 3D model after the visualization is complete and experience the entire visual environment with six degrees of freedom (DOF), as if those viewers were at the original scene.

VR represents a compelling visualization method that is growing fast in popularity. A high-quality VR experience, however, requires heavy local processing power. 3D models created from UAV photos consist of high-density polygons, which are needed to preserve the models life-like appearance. The same is true for VR architectural virtualizations created from Autodesk Revit and Autodesk® Revit LT™ models through the Autodesk Revit Live service. A powerful system is required to perform the fast calculations needed to simulate the lifelike 3D environment without latency as the viewer moves around in that environment.

Servers and workstations based on the latest Intel Xeon Scalable platform are excellent candidates for performing these high-quality VR visualizations through an HMD. Intel Xeon Scalable processor–based servers and workstations can pack up to 28 cores and 56 threads in a single processor and support up to 1.5 TB of DDR4 memory. In addition, all systems based on Intel Xeon Scalable processors include 48 PCIe 3.0* lanes per CPU, allowing you to add many high-performance graphics cards to a single system.

Autodesk Photo ReCap on High Performance Workstations Results in Faster Performance

Benchmark testing performed by engineers at Intel and Autodesk demonstrates the scalability of Autodesk ReCap Photo on Intel Xeon processors on a local workstation. By upgrading to the latest workstation with SSD storage, users can experience both faster load and processing times. Creating a 3D model from 164 test photos was found to be about 44 percent faster when performed on the latest Intel Xeon Scalable processor–based workstation, compared to a 3-year-old platform.¹

Figure 3. Testing on a small dataset revealed a 44 percent speed improvement in rendering 3D photos when upgrading from a 3-year-old system (at top) to a workstation based on the latest server platform (at bottom, shorter bar is better)¹

Figure 4. Testing on a large dataset revealed an 45 percent speed improvement in rendering 3D photos when upgrading from a 3-year-old system (at top) to a workstation based on the latest server platform (at bottom, shorter bar is better)¹

Other testing showed the power of additional cores to boost performance. On a larger dataset, the most processor-intensive phase of the photogrammetry process—the texture phase—was completed 16.5 percent faster on a 28-core Intel Xeon processor–based system than on a 20-core Intel Xeon processor–based system. ²

Figure 5. Completion times for the most processor-intensive phase of rendering (the texturing phase) showed a 16.5-percent improvement on the higher-core-count Intel® Xeon® processors (a lower bar is better)²

Storage is another important factor for performance on local workstations. Users can see significant load and save time improvements for 3D models by upgrading from spinning hard disk drives to Intel® SATA Solid State Drives and Intel® SSD 750 series with PCIe* NVM Express* (NVMe*).

UAVs and Autodesk ReCap Photo Lead to Better Design

Creating 3D models with a UAV and Autodesk ReCap Photo provides key advantages in engineering, construction, and design, advantages that are accessible to all businesses. First, the procedure saves time and money. It provides a fast and easy way to create an accurate 3D digital representation of a scene, sparing the expense of the many hours of engineering time that would otherwise be needed to build such a model from survey information. Perhaps more importantly, using a UAV with Autodesk ReCap Photo to quickly create a 3D model can lead to better designs and construction. With access to 3D models of sites now readily available, companies can design new buildings with an accurate representation of the surrounding context throughout the entire process.

Assisting Construction

Autodesk ReCap Photo is most often used to capture an outdoor scene before or after a change. For use cases such as conditions and contexts around buildings, ships, construction sites, quarries, mines, and infrastructure projects, companies can create multiple 3D models to compare construction progress as-built with the plans as originally designed. Backed by high-core-count Intel Xeon processors in the cloud, these useful 3D models can be created in an efficient way without interrupting workflow.

Figure 6. A mesh-based model created in Autodesk® ReCap™ Photo using photos taken by a UAV

Companies can also use Autodesk ReCap Photo to assist with design and construction by supplementing laser scanning. With laser scanning, a 3D model of a scene is created without photogrammetry; it instead uses lasers to measure distances from the scanner. Autodesk ReCap Photo can provide data that complements laser scans in areas that are difficult to reach with a laser scanner, such as rooftops and building facades.

Figure 7. 3D models created in Autodesk® ReCap™ Photo can be imported into other Autodesk programs and used to assist with design and engineering

Better Visualizations

Designers can export 3D models created in Autodesk ReCap Photo to other Autodesk applications. For example, as already described, they could help provide VR visualizations through Autodesk Stingray. But a 3D model can also be used to create an animation in the Autodesk® 3ds Max® application. In both cases, architects, designers, and construction firms can make use of the power of the Autodesk ecosystem to better test their own design visions and better communicate those visions to their customers.

Local Workstation Recommendations for 3D Modeling

The compute-intensive 3D model reconstruction phase of photogrammetry is performed in the cloud, which scales with Intel® Xeon® processor cores. To optimize performance of all other tasks in the 3D modeling workflow, it is recommended that you use a workstation based on an 8-to-10 core Intel Xeon W processor or Intel Xeon Gold processor with higher frequencies (4+ GHz), along with fast PCIe* or SATA Intel® SSD storage.

Conclusion: Multicore Intel Xeon Processors Accelerate Photogrammetry in Autodesk ReCap Photo

As proven by the test results, Autodesk ReCap Photo delivers quickly processed 3D models from multiple photos taken by UAVs around a scene or structure. The speed with which this operation is performed is a result of the application’s scaling efficiency on Intel Xeon processors, a feature for which the application code was specifically designed. And because Autodesk delivers the photogrammetry process of Autodesk ReCap Photo through a cloud-based service running on high-core-count Intel Xeon processor–based servers, this processing speed and efficiency is now available to businesses of all sizes. To provide high-quality visualizations of these models in VR, businesses will profit from the most powerful Intel Xeon processor–based workstations and servers, such as those based on Intel Xeon W processors and the latest Intel Xeon Scalable processors.

Creating 3D models of an outdoor scene for use in architecture, engineering, and construction is now easier and faster than ever with the help of UAVs, Autodesk ReCap Photo, and high-core-count Intel Xeon processors. Businesses in these sectors can take advantage of the new speed, ease, and availability of 3D modeling based on photogrammetry to meet more deadlines, speed construction time, improve design through better visualizations, and gain competitive advantage.

To learn more about Autodesk ReCap Photo, visit autodesk.com/products/recap/overview.

To learn more about Intel Xeon processors and workstations, visit intel.com/content/www/us/en/products/processors/xeon.html and Intel.com/workstations.

¹Testing conducted by Intel and Autodesk in 2017. Configurations: Baseline: 2x Intel® Xeon® processor E5-2697 v3 (28 cores, 2.6 GHz), a 1.2 TB Intel® SSD 750 Series drive with NVMe*, and 2x NVIDIA Quadro P6000*. Newer system: 2x Intel Xeon processor E5-2640 v4 (20 cores, 2.4 GHz), an ATA SAMSUNG SSD SM87* SCSI disk, 128 GB RAM, and an NVIDIA GeForce GTX Titan X*. Newest system: Intel Xeon processor W-2156 (8 cores, 3.7 GHz).

²Testing conducted by Intel and Autodesk in 2017. 20-core system: 2x Intel Xeon processor E5-2640 v4 (20 cores, 2.4 GHz), an ATA SAMSUNG SSD SM87* SCSI disk, 128 GB RAM, and an NVIDIA GeForce GTX Titan X*. 28-core system: 2x Intel® Xeon® processor E5-2697 v3 (28 cores, 2.6 GHz), a 1.2 TB Intel® SSD 750 Series drive with NVMe*, and 2x NVIDIA Quadro P6000*.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit intel.com/benchmarks.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Consult with a legal expert prior to operating any UAV. Always comply with Federal Aviation Administration regulations or other applicable laws and regulations in your country.

Autodesk, ReCap, Revit, Revit LT, InfraWorks, AutoCAD, Civil 3D, Inventor, Maya, 3ds Max and the Autodesk logo are registered trademarks or trademarks of Autodesk, Inc., and/or its subsidiaries and/or affiliates in the USA and/or other countries. All other brand names, product names, or trademarks belong to their respective holders. Autodesk reserves the right to alter product offerings and specifications at any time without notice, and is not responsible for typographical or graphical errors that may appear in this document.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

Intel, the Intel logo, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

↧

Gentle Introduction to PyDaal: Vol 2 of 3 Basic Operations on Numeric Tables

September 19, 2017, 6:03 pm

Latest and popular articles on Intel Technologies

≫ Next: Using Intel® Math Kernel Library Compiler Assisted Offload in Intel® Xeon Phi™ Processor

≪ Previous: From UAV to VR in No Time: Autodesk® ReCap™ Photo Quickly Renders 3D Structures Thanks to Multicore Scalability

A Wide range of classes are available in the Intel® Data Analytics Acceleration Library (DAAL) to create a numeric table accommodating various data layout, dtypes, and frequent access methods. Volume 1 of this series covers numeric table creation under different scenarios. Once created, DAAL provides operational methods for visualizing and mutating a user’s numeric tables. Volume 2 will cover the usage of the operational methods. Subsequently Volume 3 in this series gives a brief introduction to Algorithm section of PyDAAL Table 1 can be used as a quick reference for basic operations on DAAL’s numeric table.

Volumes in Gentle Introduction Series

Vol 1: Data Structures - Covers introduction to Data Management component of Intel® DAAL and available custom Data Structures(Numeric Table and Data Dictionary) with code examples.
Vol 2: Basic Operations on Numeric Tables - Covers introduction to possible operations that can be performed on Intel® DAAL's custom Data Structure (Numeric Table and Data Dictionary) with code examples.
Vol 3: Analytics Model Building and Deployment - Covers introduction to analytics model building and evaluation in Intel® DAAL with serialized deployment and distributed model fitting on large datasets.

Table 1. Quick reference table on available methods

Method Description	Usage Syntax
*Print numeric table as stored in memory to represent data layout. Method requires ‘nT’ as input argument	`printNumericTable(nT)`
*Quick visualization on multiple numeric tables	`printNumericTables(nT1,nT2)`
Check shape of numeric table	`#Number of Rows nT.getNumberOfRows() #Number of Columns nT.getNumberOfColumns()`
Allocate buffer to load block of numeric table for access and manipulation operations.	`block = BlockDescriptor_Float64() #Allocates a memory block with double dtype`
Retrieve block of rows and columns from numeric table into Block Descriptor for visualization. (Setting `rwflag` to ‘readOnly’ enables only read access to the buffer.)	`#Block of Column values nT.getBlockOfColumnValues(colIndex, firstRowIndex,lastRowIndex, rwflag, block) #Block of Rows nT.getBlockOfRows(firstRowIndex,lastRowIndex, rwflag, block)`
Extract numpy array from Block Descriptor object when loaded with block of values	`block.getArray()`
Release block of Rows from buffer	`nT.releaseBlockOfRows(block)`
*Print underlying array of numeric table. Method requires ‘np.array’ as input argument.	`printArray(block.getArray() , num_printed_cols, num_printed_rows, num_cols, message)`
Check FeatureTypes on each column of numeric table data dictionary	`dict[colIndex].featureType`

* denotes functions included in the ‘utils’ folder, which can be found in <install_root>/share/pydaal_examples/examples/python/source/. <install_root>

Different phases of Numeric Table life cycle

1. Initiate

Let’s begin by constructing a numeric table (nT) directly from a Numpy array. We will use the nT throughout the rest of the code examples in this volume.

import numpy as np
from daal.data_management import HomogenNumericTable
array =np.array([[1,2,3,4],
                [5,6,7,8]])
nT= HomogenNumericTable(array)

2. Operate

Once initialized, numeric tables provide various classes and member functions to access and manipulate data similar to a pandas DataFrame. We will dive next into DAAL’s operational methods, after an important note about DAAL’s bookkeeping object called Data Dictionary.

Data Dictionary:

As mentioned in Vol1 of this series on creation of Intel® DAAL’s numeric tables (link), these custom data structures must be accompanied by a Data Dictionary to perform operations. When raw data streams into memory to populate the numeric table structure, the table’s Data Dictionary concurrently records metadata. Dictionary creation will occur automatically unless specified to not allocate by the user. Various Data Dictionary methods are available to access and manipulate feature type, dtypes etc. If a user creates a numeric table without memory allocation, the Data Dictionary values have to be explicitly set with feature types. An important note is that DAAL’s Data Dictionary is a custom data structure, not a python dictionary.

More details on working with Intel DAAL Data Dictionaries

2.1 Data Mutation in Numeric Table:

2.2.1 Standardization and Normalization:
Data analysis work is usually preceded by a Data Preprocessing stage that includes data wrangling, quality checks, and assurance to handle null values, outliers etc. An important preprocessing activity is to normalize input data. Intel® DAAL offers routines to support two popular normalization techniques on numeric tables: Namely, Z-score standardization and Min-Max normalization.
Currently, DAAL only supports rescaling for descriptive analytics. In the future, support will be added for predictive analytics with the addition of a “transform()” method to be applied to new data.
- Z-score Standardization: Rescales numeric table values feature-wise to the number of standard deviations each value deviates from the mean. Below are the steps to use DAAL’s z-score standardization.
```
import daal.algorithms.normalization.zscore as zscore

# Create an algorithm
algorithm = zscore.Batch(method=zscore.sumDense)

# Set input object for the algorithm to nT
algorithm.input.set(zscore.data, nT)

# Compute Z-score normalization function
res = algorithm.compute()

#Retrieve normalized nT
Norm_nT= res.get(zscore.normalizedData)
```
- Min-Max Normalization: Rescales numeric table values feature-wise linearly to fit [0, 1] / [-1-1] range. Below are the steps to use DAAL’s Min-Max normalization.
```
import daal.algorithms.normalization.minmax as minmax

# Create an algorithm
algorithm = minmax.Batch(method=minmax.defaultDense)

# Set lower and upper bounds for the algorithm
algorithm.parameter.lowerBound = -1.0
algorithm.parameter.upperBound = 1.0

# Set input object for the algorithm to nT
algorithm.input.set(minmax.data, nT)

# Compute Min-max normalization function
res = algorithm.compute()

# Get normalized numeric table
Norm_nT = res.get(minmax.normalizedData)
```
Block Descriptor for Visualization and Mutation:
The Contents of a numeric table cannot be accessed directly to visualize or manipulate. Instead a user must first move a requested block of data values to a memory buffer. Once instantiated, this buffer is housed in an object called BlockDescriptor. A DAAL numeric table object has member functions to retrieve blocks of rows/columns and add to the BlockDescriptor. The argument rwflag is used to set “readOnly”/“readWrite” mode, depending on whether the user intends to update values in the numeric table while releasing the block. Conveniently, the BlockDescriptor class allows for block retrieval of data in specific rows and/or columns. Note: the dtype of data in the BlockDescriptor buffer is not required to match the numeric table that sourced the block.
Access Modes:
- “readOnly” argument sets rwflag to provide read only access to numeric table contents, thus performing no updates to the table when the block is released from buffer memory.
  Syntax and Usage:
```
from daal.data_management import BlockDescriptor_Float64, readOnly
#Allocate a readOnly memory block with double dtype
block = BlockDescriptor_Float64()
nT.getBlockOfRows(0,1, readOnly, block)
```
- “readWrite” argument sets rwflag to write back any changes from block descriptor object to the numeric table when the block is released from buffer memory, thus enabling numeric table mutation with the help of block descriptor.
  Syntax and Usage:
```
from daal.data_management import BlockDescriptor_Float64, readWrite
#Allocate a readOnly memory block with double dtype
block = BlockDescriptor_Float64()
nT.getBlockOfRows(0,1, readWrite, block)
```

BlockDescriptor() in “readWrite” mode:

When rwflag argument is set to “readWrite” in getBlockOfRows()/ getBlockOfColumnValues(), contents of BlockDecriptor object are written back to the numeric table while releasing block of rows, making edits possible on existing rows/columns in numeric table.

Let’s create a basic numeric table to explain BlockDecriptor in “readWrite” mode in detail.

import numpy as np
from daal.data_management import HomogenNumericTable, readWrite, BlockDescriptor
array =np.array([[1,2,3,4],
                [5,6,7,8]])
nT= HomogenNumericTable(array)

Edit numeric table Row-wise:

#Create buffer object with ntype "double"
doubleBlock = BlockDescriptor(ntype=np.float64)

firstRow = 0
lastRow = nT.getNumberOfRows()

#getBlockOfRows() member function in "readWrite" mode to retrieve numeric table contents and populate "doubleBlock" object
nT.getBlockOfRows(firstRow,lastRow, readWrite, doubleBlock)
#Access array contents from "doubleBlock" object
array = doubleBlock.getArray()
print (array)
#Mutate 1st row of array to reflect on "doubleBlock" object
array[0] = [0,0,0,0]
#Release buffer object and write changes back to numeric table
nT.releaseBlockOfRows(doubleBlock)

nT was originally created with data [[1,2,3,4],[5,6,7,8]]. After row mutation the first row is now replaced using buffer memory. Updated nT has data [[0,0,0,0],[5,6,7,8]].

Edit numeric table Column-wise:

#Create  buffer object with ntype "double"
doubleBlock = BlockDescriptor(ntype=np.intc)
ColIndex = 2
firstRow = 0
lastRow = nT.getNumberOfRows()

#getBlockOfColumnValues() member function in "readWrite" mode to retrieve numeric table ColIndex contents and populate "doubleBlock" object
nT.getBlockOfColumnValues(ColIndex,firstRow,lastRow,readWrite,doubleBlock)

#Access array contents from "doubleBlock" object
array = doubleBlock.getArray()
print (array)

#Mutate array to reflect on "doubleBlock" object
array[:][:] = 0

#Release buffer object and write changes back to numeric table
nT.releaseBlockOfColumnValues(doubleBlock)

nT was originally created with data [[1,2,3,4],[5,6,7,8]] After column mutations, the third column is replaced with [0,0] using buffer memory. Updated nT has data [[1,2,0,4],[5,6,0,8]].

Merge numeric table:

Numeric tables can be appended along rows and columns, provided, they share the same array size along the relevant axis to merge. RowMergedNumericTable()and MergedNumericTable() are the 2 classes available to merge numeric tables. The latter is used for merges on column indexes.

Merge Row-wise:

Syntax:

mnT = RowMergedNumericTable()
mnT.addNumericTable(nT1); mnT.addNumericTable(nT2); mnt.addNumericTable(nT3)

Code Example:

from daal.data_management import HomogenNumericTable, RowMergedNumericTable
import numpy as np
from utils import printNumericTable


#nT1 and nT2 are 2 numeric tables having equal number of COLUMNS
array =np.array([[1,2,3,4],
                 [5,6,7,8]], dtype = np.intc)
nT1= HomogenNumericTable(array)
array =np.array([[9,10,11,12],
                 [13,14,15,16]],dtype = np.intc)
nT2= HomogenNumericTable(array)

#Create merge numeric table object
mnT = RowMergedNumericTable()

#add numeric tables to merged numeric table object
mnT.addNumericTable(nT1); mnT.addNumericTable(nT2)
printNumericTable(mnT)

Output:

1.000     2.000     3.000     4.000
5.000     6.000     7.000     8.000
9.000     10.000    11.000    12.000
13.000    14.000    15.000    16.000

Merge Column-wise:

Syntax:

mnT = MergedNumericTable()
mnT.addNumericTable(nT1); mnT.addNumericTable(nT1); mnt.addNumericTable(nT3)

Code Example:

from daal.data_management import HomogenNumericTable, MergedNumericTable

import numpy as np

from utils import printNumericTable



#nT1 and nT2 are 2 numeric tables having equal number of ROWS

array =np.array([[1,2,3,4],

                 [5,6,7,8]], dtype = np.intc)

nT1= HomogenNumericTable(array)

array =np.array([[9,10,11,12],

                 [13,14,15,16]],dtype = np.intc)

nT2= HomogenNumericTable(array)



#Create merge numeric table object

mnT = MergedNumericTable()



#add numeric tables to merged numeric table object

mnT.addNumericTable(nT1); mnT.addNumericTable(nT2)

printNumericTable(mnT)

Output:

1.000 2.000 3.000 4.000 9.000 10.000 11.000 12.000

5.000 6.000 7.000 8.000 13.000 14.000 15.000 16.000

Split Numeric table:

See Table 1 for a quick reference on available methods for the entries getBlockOfRows() and getBlockOfColumnValues() methods, used to extract sections of a numeric table by row or column values. Additionally, the helper function getBlockOfNumericTable() is provided below and implements the capability to extract a contiguous subset of the table with selected range of rows and columns. getBlockOfNumericTable() accepts int or list keyword arguments for ranges of rows and columns, using conventional python 0 - based indexing.

Syntax and Usage: getBlockOfNumericTable(nT, Rows = ‘All’, Columns = ‘All’)

Helper Function:

def getBlockOfNumericTable(nT,Rows = 'All', Columns = 'All'):
    from daal.data_management import HomogenNumericTable_Float64, \
    MergedNumericTable, readOnly, BlockDescriptor
    import numpy as np
#------------------------------------------------------
    # Get First and Last Row indexes
    lastRow = nT.getNumberOfRows()
    if type(Rows)!= str:
        if type(Rows) == list:
            firstRow = Rows[0]
            if len(Rows) == 2: lastRow = min(Rows[1], lastRow)
        else:firstRow = 0; lastRow = Rows
    elif Rows== 'All':firstRow = 0
    else:
        warnings.warn('Type error in "Rows" arguments, Can be only int/list type')
        raise SystemExit
#------------------------------------------------------
    # Get First and Last Column indexes
    nEndDim = nT.getNumberOfColumns()
    if type(Columns)!= str:
        if type(Columns) == list:
            nStartDim = Columns[0]
            if len(Columns) == 2: nEndDim = min(Columns[1], nEndDim)
        else: nStartDim = 0; nEndDim = Columns
    elif Columns == 'All': nStartDim = 0
    else:
        warnings.warn ('Type error in "Columns" arguments, Can be only int/list type')
        raise SystemExit
#------------------------------------------------------
    #Retrieve block of Columns Values within First & Last Rows
    #Merge all the retrieved block of Columns Values
    #Return merged numeric table
    mnT = MergedNumericTable()
    for idx in range(nStartDim,nEndDim):
        block = BlockDescriptor()
        nT.getBlockOfColumnValues(idx,firstRow,(lastRow-firstRow),readOnly,block)
        mnT.addNumericTable(HomogenNumericTable_Float64(block.getArray()))
        nT.releaseBlockOfColumnValues(block)
    block = BlockDescriptor()
    mnT.getBlockOfRows (0, mnT.getNumberOfRows(), readOnly, block)
    mnT = HomogenNumericTable (block.getArray())
    return mnT

There are 4 different ways of passing arguments to this function:

getBlockOfNumericTable(nT) - Extracts block of numeric table having all rows and columns of nT.
getBlockOfNumericTable(nT, Rows = 4, Columns = 5) - Retrieves first 4 rows and first 5 column values of nT
getBlockOfNumericTable(nT, Rows=[2,4], Columns = [1,3])-Slices numeric table along row and column directions using lower bound and upper bound passed as parameters in list.
getBlockOfNumericTable(nT, Rows=[1,], Columns = [1,])-Extracts all rows and columns from lower bound through last index.

Change feature type:

Numeric table objects have dictionary manipulation methods to get and set feature types in the Data Dictionary for each column. Categorical(0), Ordinal(1), and Continuous(2) are available feature types in Data Dictionary supported by Intel® DAAL.

Get dictionary object associated with nT :

Syntax: nT.getDictionary()

Code Example:

dict = nT.getDictionary() # nT is numeric table created in section 1''''dict' object has data dictionary of numeric table nT. This can be used to update metadata information about the data. Most common use case is to modify default feature type of nT columns.'''
# Print default feature type of 3rd feature (example feature is continuous):
print(dict[2].featureType) #outputs “2” (denotes Continuous feature type)

# Modify feature type from Continuous to Categorical:
dict[2].featureType = data_feature_utils.DAAL_CATEGORICAL
print(dict[2].featureType) #outputs “0” (denotes Categorical feature type)

Set dictionary object associated with nT:
This is the method used to replace current Data Dictionary values or to create new Data Dictionaries, if needed. Also, for batch updates, an existing Data Dictionary can be overwritten in full using setDictionary() method.
When tables are created without allocating memory for the Data Dictionary, the setDictionary() method must be used to construct metadata for features in the table. Let us again consider nT created in section-1 having 4 features
Syntax: nT.setDictionary()
Code Example:
```
nT.releaseBlockOfRows(block)

#Create a dictionary object using Numeric table dictionary class with the number of features
dict = NumericTableDictionary(nFeatures)
#Allocate a feature type for each feature
dict[0].featureType = data_feature_utils.DAAL_CONTINUOUS
dict[1].featureType = data_feature_utils.DAAL_CATEGORICAL
dict[2].featureType = data_feature_utils.DAAL_CONTINUOUS
dict[3].featureType = data_feature_utils.DAAL_CATEGORICAL

#set the nT numeric table dictionary with “dict”
nT.setDictionary(dict)
```

2.2 Export Numeric Table to disk:

Numeric tables can be exported and saved as a numpy binary file (.npy) file to disk. The following two sections contain helper functions to complete the task of saving in binary form, as well as compressing the data on disk.

Serialization

Intel® DAAL provides interfaces to serialize numeric table objects into a data archive that can be converted to a numpy array object. The resulting Numpy array, which houses the serialized form of the data, can be saved to disk and subsequently reloaded in the future to reconstruct the source numeric table.

To automate this process, the following helper function can be used to serialize and save to disk.

Helper Function:

def Serialize(nT):
#Construct input data archive Object
#Serialize nT contents into data archive Object
#Copy data archive contents to numpy array
#Save numpy array as .npy in the path
   from daal.data_management import InputDataArchive
   import numpy as np

   dataArch = InputDataArchive()

   nT.serialize(dataArch)

   length = dataArch.getSizeOfArchive()
   buffer_array = np.zeros(length, dtype=np.ubyte)
   dataArch.copyArchiveToArray(buffer_array)

   return buffer_array
buffer_array = Serialize(nT) # call helper function
#np.save(<path>, buffer)# This step is optional</path>

Compression

Compressor methods are also available in Intel® DAAL to achieve reduced memory footprint when large datasets must be stored to disk. A serialized array (see “Serialization” section) representation of a DAAL numeric table can be compressed before saving it to disk, hence achieving optimal storage.

To automate this process, the following helper function can be used to serialize, then compress the resulting serialized array.

Incorporate helper functions Serialize(nT) and CompressToDisk (nT, path) to compress and write numeric tables to disk.

Helper Function:

def CompressToDisk(nT, path):
    # Serialize nT contents
    # Create a compressor object
    # Create a stream for compression
    # Write numeric table to the compression stream
    # Allocate memory to store the compressed data
    # Store compressed data
    # Save compressed data to disk
    from daal.data_management
    import Compressor_Zlib, level9, CompressionStream
    import numpy as np

    buffer = Serialize (nT)
    compressor = Compressor_Zlib ()
    compressor.parameter.gzHeader = True
    compressor.parameter.level = level9
    comprStream = CompressionStream (compressor)
    comprStream.push_back (buffer)
    compressedData = np.empty (comprStream.getCompressedDataSize (), dtype=np.uint8)
    comprStream.copyCompressedArray (compressedData)
    np.save (path, compressedData)
    CompressToDisk (nT, < path >)

2.3 Import Numeric Table from disk:

As mentioned in the previous sections, numeric tables can be stored in the form of either serialized or compressed numpy files. Decompression/ Deserialization methods are available to reconstruct the numeric table.

Deserialization

The helper function below is available to reconstruct a numeric table from serialized array objects.

Helper Function:

def DeSerialize(buffer_array):
    from daal.data_management import OutputDataArchive, HomogenNumericTable
    #Load serialized contents to construct output data archive object
    #De-serialize into nT object and return nT

    dataArch = OutputDataArchive(buffer_array)
    nT = HomogenNumericTable()
    nT.deserialize(dataArch)
    return nT
#buffer_array = np.load(path) # this step is optional, used only when serialized contents have to be written to  disk
nT = DeSerialize(buffer_array)

Decompression

As compression stage involves serialization of numeric table object, decompression stage includes deserialization. See DeSerialize helper function to recover the numeric table. Refer below for a quick de-compression helper function.

Helper Function:

def DeCompressFromDisk(path):
    from daal.data_management import  Decompressor_Zlib, DecompressionStream
    # Create a decompressor
    decompressor = Decompressor_Zlib()
    decompressor.parameter.gzHeader = True

    # Create a stream for decompression
    deComprStream = DecompressionStream(decompressor)

    # Write the compressed data to the decompression stream and decompress it
    deComprStream.push_back(np.load(path))

    # Allocate memory to store the decompressed data
    deCompressedData = np.empty(deComprStream.getDecompressedDataSize(), dtype=np.uint8)

    # Store the decompressed data
    deComprStream.copyDecompressedArray(deCompressedData)

    #Deserialize
    return DeSerialize(deCompressedData)

nT = DeCompressFromDisk(<path>)#path must be ‘.npy’ file

Intel® DAAL also implements several other generic compression and decompression methods that include ZLIB, LZO, RLE, and BZIP (reference)

Conclusion

Intel® DAAL’s data management component provides classes and methods to perform common operations on numeric table contents. Some of the basic numeric table operations include - access, mutation, export to disk and import from disk. Helper functions covered in this document will help automating Intel® DAAL’s creation of numeric table subsets, as well as serialization and compression processes.

The next volume (Volume 3) in the Gentle Introduction series gives a brief introduction to Algorithm section of PyDAAL. Volume 3 focuses on the workflow of important descriptive and predictive algorithms available in Intel® DAAL. Advanced features such as setting hyperparameters, distributing fit calculations, and deploying models as serialized objects will all be covered.

↧

Using Intel® Math Kernel Library Compiler Assisted Offload in Intel® Xeon Phi™ Processor

September 22, 2017, 3:05 pm

Latest and popular articles on Intel Technologies

≫ Next: Webinar: Better Threaded Performance and Scalability With Intel(R) Vtune Amplifier + OpenMP*

≪ Previous: Gentle Introduction to PyDaal: Vol 2 of 3 Basic Operations on Numeric Tables

Introduction

Beside native execution, another usage model of using the Intel® Math Kernel Library (Intel® MKL) on an Intel® Xeon Phi™ processor is the compiler assisted offload (CAO). The CAO usage model allows users to offload Intel MKL functions and data to an Intel Xeon Phi processor by using the Intel® compiler and its offload pragma support to manage functions and offloaded data.

This document shows how users can offload Intel MKL functions and data to the Intel Xeon Phi processor from an Intel® Xeon® processor-based machine. In order to use Intel MKL CAO in an Intel Xeon Phi processor, users need to set up Offload over Fabric software first.

Part 1 – Installing Offload over Fabric Software

In this example, Intel® Omni-Path Architecture (Intel® OPA) was used to connect an Intel Xeon processor machine and an Intel Xeon Phi processor machine. For details on installation and configuring IP over Fabric, please refer to the paper How to install the Intel® Omni-Path Architecture Software.

In this article, an Intel® Xeon® processor E5-2698 v3 @ 2.30GHz server is the host machine and an Intel Xeon Phi Processor is the target machine. Both machines run on Red Hat Enterprise Linux* 7.2. Each machine has an Intel® Omni-Path Host Fabric Interface PCIx16 card and they are connected with an Intel® Omni-Path Cable.

Install the Intel® Omni-Path Architecture Fabric Host Software, IntelOPA-IFS.RHEL72-x86_64.10.4.2.0.7.tgz, which can be downloaded from the Intel download center, on both machines. Note that this version 10.4.2.07 of the Intel OPA Fabric Host Software requires the library libfabric version 1.4 or greater. libfabric is a core component of OpenFabrics Interfaces*. Therefore, you need to recompile and install a newer version of libfabric, as shown in the next step (see A BKM for Working with libfabric* on a Cluster System when using Intel® MPI Library).

Download libfabric-1.4.2.tar.bz2 and rebuild libfabric:

# rpmbuild -ta libfabric-1.4.2.tar.bz2 --define 'configopts --enable-verbs=yes'
# cd /root/rpmbuild/RPMS/x86_64
# yum install libfabric-1.4.2-1.el7_2.x86_64.rpm libfabric-debuginfo-1.4.2-1.el7_2.x86_64.rpm libfabric-devel-1.4.2-1.el7_2.x86_64.rpm
# fi_info
provider: psm2

After installing the Intel OPA Fabric Host Software on both host and target machines, reboot them.
Configure Intel OFA IP over Infiniband (IPoIB). In this example, the IP addresses on the host and target are 192.168.100.101 and 192.168.100.102, respectively.

Bring the IP over Fabric interface up on both machines:
On the host machine:

[host]# ifup ib0
[host]# ifconfig ib0
ib0: flags=4163  mtu 65520
        inet 192.168.100.101  netmask 255.255.255.0  broadcast 192.168.100.255
        inet6 fe80::211:7501:179:311  prefixlen 64  scopeid 0x20
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband 80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)
        RX packets 5415223  bytes 47440566267 (44.1 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 5850844  bytes 47481697417 (44.2 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Similarly, on the target machine:

[target]# ifup ib0
[target]# ifconfig ib0
ib0: flags=4163  mtu 65520
        inet 192.168.100.102  netmask 255.255.255.0  broadcast 192.168.100.255
        inet6 fe80::211:7501:174:44e0  prefixlen 64  scopeid 0x20
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband 80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)
        RX packets 11370  bytes 1989607 (1.8 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 11551  bytes 5639588 (5.3 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Run the Intel OPA Fabric Manager

[host]# opaconfig –E opafm
[host]# service opafm start

[target]# opaconfig –E opafm
[target]# service opafm start

Set up Secure Shell (SSH) password-less for the offload testing. To do this, first generate a pair of authentication keys on the host without entering a passphrase:
```
[host]$ ssh-keygen -t rsa
```
Then append the host machine new public key to the target machine public key using the command ssh-copy-id:
```
[host]$ ssh-copy-id @192.168.100.102
```
Download and install the Offload over Fabric software for host version 1.5.2 from the Intel Xeon Phi Processor Software page on the host machine (follow the instructions in the User Guide).
Similarly, download the Offload over Fabric for target software version 1.5.2 from the Intel Xeon Phi Processor Software page and install it on the target machine.
Finally, install the latest version of Intel® Parallel Studio XE (in this example the Intel Parallel Studio XE 2018 is used) on the host machine.

Part 2 – Using Intel® Math Kernel Library Compiler Assisted Offload

In the second part of this article, an example of using the Intel MKL CAO feature to offload an Intel MKL function to the target machine is shown. The original code from Multiplying Matrices Using dgemm was modified to add the offload capability.

You can use offload pragma to initiate an offload from the host to the target. The in specifier defines a variable as strictly an input to the target; the value is not copied back to the host. The inout specifier defines a variable that is both copied from the host to the target and then from the target to the host. The program offloads the same function many times. Also, to retain data between different offloads, you can specify alloc_if(1) to perform fresh memory allocation in the first iteration, and specify free_if(0) to retain the memory. In the subsequent offload, you can specify alloc_if(0) to reuse the memory and specify free_if(0) to retain the memory. In the last offload, you can specify alloc_if(0) to reuse the memory and specify free_if(1) to free up the memory. In the sample code in the appendix, the program iterates the offload process three times: in the first iteration, the memory in the Intel Xeon Phi processor is allocated to store the matrices and retained, and the memory is reused in the subsequent iterations. In the last iteration, the memory is freed.

The code sample offloads the cblas_dgemm function for matrix multiplication to the Intel Xeon Phi processor via Offload over Fabric. The cblas_dgemm function computes a scalar-matrix-matrix product and adds the result to a scalar-matrix product:

Where A, B, and C are double-precision matrices, α and β are double-precision scalars.

The program calls the interface and passes the following arguments:

cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A, k, B, n, beta, C, n);

Where:

CblasRowMajor indicates that the elements of each row of the matrices are stored contiguously.
CblasNoTrans indicates that the matrix A and B should not be transposed or conjugate transposed before the matrix multiplication.
m, n, k are integers that indicate the matrix dimension. In this case, m specifies the number of rows of matrix A and of matrix C, n specifies the number of column of matrix B, k specifies the number of columns of A and the number of rows of B. Thus, matrix A is m rows by k columns, matrix B is k rows by n columns, and matrix C is m rows by n columns.
alpha and beta are scalars used in the multiplication as shown above in the formula.
A, B, and C are arrays used to store the matrices, respectively.
Leading dimension of the matrices A, B, and C: in this example, k is the leading dimension of matrix A (or the number of columns). n is the leading dimension of matrix B; n is also the leading dimension of matrix C.

To run the application, set the proper compiler environment variables for the Intel® Parallel Studio XE 2018 and compile the code sample from the host machine:

[host]$ source /opt/intel/parallel_studio_xe_2018.0.033/psxevars.sh intel64
[host]$ icc -mkl -qopenmp mkl-cao.c -o mkl-cao.out

Prior to executing the program, you need to set the environment variable OFFLOAD_NODES to the IP address (over the high-speed network Intel OPA) of the target machine, in this case 192.168.100.102, to indicate that the target is available for offloading.

[host]$ export OFFLOAD_NODES=192.168.100.102

Optionally, to generate offload execution time and the amount of data transferred, one can set the environment variable OFFLOAD_REPORT to 2 (value 1 reports the offload computation time only, while value 3 reports the offload computation time, the amount of data transferred, device initialization, and individual variable transfers).

[host]$ export OFFLOAD_REPORT=2

To run the application, one must pass the values of m, n, k, alpha and beta to the application running on the host machine. For example, the following command line triggers a matrix multiplication where m=n=k=14096, alpha=1.0, and beta=2.0. Note that for simplicity, all elements in the matrix A are initialized to 1.0, and all elements in the matrix B were initiated to 2.0. The application allocates the memory for matrices in the host, then offloads the MKL matrix multiplication function and the matrix arrays to the Intel Xeon Phi processor three times. The target machine allocates memory in the first iteration, performs the matrix multiplication, and sends the results back to the host. In the second iteration, the target machine re-uses the allocated memory and performs the matrix multiplication. In the last iteration, the target machine de-allocates memory after performing the matrix multiplication and sending back the result.

[host]$./mkl-cao.out 14906 14906 14906 1.0 2.0
m:14906 n:14906 k:14906 alpha:    1.00 beta:    2.00
[Offload] [MIC 0] [File]                    mkl-cao.c
[Offload] [MIC 0] [Line]                    76
[Offload] [MIC 0] [Tag]                     Tag 0
[Offload] [HOST]  [Tag 0] [CPU Time]        8.436242(seconds)
[Offload] [MIC 0] [Tag 0] [CPU->MIC Data]   5332532092 (bytes)
[Offload] [MIC 0] [Tag 0] [MIC Time]        3.096220(seconds)
[Offload] [MIC 0] [Tag 0] [MIC->CPU Data]   1777510688 (bytes)

[Offload] [MIC 0] [File]                    mkl-cao.c
[Offload] [MIC 0] [Line]                    76
[Offload] [MIC 0] [Tag]                     Tag 1
[Offload] [HOST]  [Tag 1] [CPU Time]        3.002104(seconds)
[Offload] [MIC 0] [Tag 1] [CPU->MIC Data]   5332532116 (bytes)
[Offload] [MIC 0] [Tag 1] [MIC Time]        2.669768(seconds)
[Offload] [MIC 0] [Tag 1] [MIC->CPU Data]   1777510688 (bytes)

[Offload] [MIC 0] [File]                    mkl-cao.c
[Offload] [MIC 0] [Line]                    76
[Offload] [MIC 0] [Tag]                     Tag 2
[Offload] [HOST]  [Tag 2] [CPU Time]        3.233454(seconds)
[Offload] [MIC 0] [Tag 2] [CPU->MIC Data]   5332532116 (bytes)
[Offload] [MIC 0] [Tag 2] [MIC Time]        2.668376(seconds)
[Offload] [MIC 0] [Tag 2] [MIC->CPU Data]   1777510688 (bytes)

 Top left corner of matrix A(m x k):
        1.00        1.00        1.00        1.00        1.00        1.00
        1.00        1.00        1.00        1.00        1.00        1.00
        1.00        1.00        1.00        1.00        1.00        1.00
        1.00        1.00        1.00        1.00        1.00        1.00
        1.00        1.00        1.00        1.00        1.00        1.00
        1.00        1.00        1.00        1.00        1.00        1.00

 Top left corner of matrix B(k x n):
        2.00        2.00        2.00        2.00        2.00        2.00
        2.00        2.00        2.00        2.00        2.00        2.00
        2.00        2.00        2.00        2.00        2.00        2.00
        2.00        2.00        2.00        2.00        2.00        2.00
        2.00        2.00        2.00        2.00        2.00        2.00
        2.00        2.00        2.00        2.00        2.00        2.00

 Top left corner of matrix C(m x n):
   208684.00   208684.00   208684.00   208684.00   208684.00   208684.00
   208684.00   208684.00   208684.00   208684.00   208684.00   208684.00
   208684.00   208684.00   208684.00   208684.00   208684.00   208684.00
   208684.00   208684.00   208684.00   208684.00   208684.00   208684.00
   208684.00   208684.00   208684.00   208684.00   208684.00   208684.00
   208684.00   208684.00   208684.00   208684.00   208684.00   208684.00

Conclusion

With the Intel OPA, a host computer can take advantage of Intel® Many Integrated Core Architecture by using the Intel compiler’s offload pragma support for the Intel Xeon Phi processor. Moreover, Intel MKL CAO allows users to use the highly optimized Intel MKL functions on an Intel Xeon Phi processor. This paper shows users, step-by-step, how to set up and configure the Intel OPA and enable Offload over Fabric. The code samples illustrate how to use offload pragma to offload an Intel MKL function from a host to an Intel Xeon Phi processor.

References

A BKM for Working with libfabric* on a Cluster System when using Intel® MPI Library

How to Install the Intel® Omni-Path Architecture Software

Developer Reference for Intel Math Kernel Library 2018 - C

Effective Use of the Intel Compiler’s Offload Features

Download sample code [1.59KB]

Appendix A

The sample code is shown below.

/*
 *  Copyright (c) 2017 Intel Corporation. All Rights Reserved.
 *
 *  Portions of the source code contained or described herein and all documents related
 *  to portions of the source code ("Material") are owned by Intel Corporation or its
 *  suppliers or licensors.  Title to the Material remains with Intel
 *  Corporation or its suppliers and licensors.  The Material contains trade
 *  secrets and proprietary and confidential information of Intel or its
 *  suppliers and licensors.  The Material is protected by worldwide copyright
 *  and trade secret laws and treaty provisions.  No part of the Material may
 *  be used, copied, reproduced, modified, published, uploaded, posted,
 *  transmitted, distributed, or disclosed in any way without Intel's prior
 *  express written permission.
 *
 *  No license under any patent, copyright, trade secret or other intellectual
 *  property right is granted to or conferred upon you by disclosure or
 *  delivery of the Materials, either expressly, by implication, inducement,
 *  estoppel or otherwise. Any license under such intellectual property rights
 *  must be express and approved by Intel in writing.
 */
#include <stdlib.h>
#include <stdio.h>
#include "omp.h"
#include "mkl.h"

#define min(x,y) (((x) < (y)) ? (x) : (y))

int offload(int m, int n, int k, double alpha, double beta)
{
   int i, j;

   /* Allocate memory using MKL function to aligned on 64-byte boundary */
   double *A = mkl_malloc(sizeof(double) * m * k, 64);
   if (A == NULL)
      return (-1);
   else
   {
      /* Initialize matrix A */
      for (i = 0; i < m*k; i++)
         A[i] = 1.0;
   }

   double *B = mkl_malloc(sizeof(double) * k * n, 64);
   if (B == NULL)
   {
      mkl_free(A);
      return (-1);
   }
   else
   {
      /* Initialize matrix B */
      for (i = 0; i < k*n; i++)
         B[i] = 2.0;
   }

   double *C = mkl_malloc(sizeof(double) * m * n, 64);
   if (C == NULL)
   {
      mkl_free(A);
      mkl_free(B);
      return (-1);
   }
   else
   {
      /* Initialize matrix C */
      for (i = 0; i < m*n; i++)
         C[i] = 0.0;
   }

   const int NITERS = 3;

   for (i = 0; i < NITERS; i++)
   {
      static int first_run = 1, last_run = 0;

#pragma offload target(mic:0) in(m, n, k, alpha, beta) \
		in(A: length(m*k) alloc_if(first_run) free_if(last_run)) \
		in(B: length(k*n) alloc_if(first_run) free_if(last_run)) \
		inout(C: length(m*n) alloc_if(first_run) free_if(last_run))
      {
         cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
	   m, n, k, alpha, A, k, B, n, beta, C, n);
      }


      first_run = 0;
      if (i == NITERS-2)
         last_run = 1;
   }

   // Verify
   printf (" Top left corner of matrix A(m x k): \n");
   for (i=0; i<min(m,6); i++)
   {
      for (j=0; j<min(k,6); j++)
         printf ("%12.2f", A[j+i*k]);

      printf ("\n");
   }

   printf ("\n Top left corner of matrix B(k x n): \n");
   for (i=0; i<min(k,6); i++)
   {
      for (j=0; j<min(n,6); j++)
         printf ("%12.2f", B[j+i*n]);

      printf ("\n");
   }

   printf ("\n Top left corner of matrix C(m x n): \n");
   for (i=0; i<min(m,6); i++)
   {
      for (j=0; j<min(n,6); j++)
         printf ("%12.2f", C[j+i*n]);

      printf ("\n");
   }

   mkl_free(A);
   mkl_free(B);
   mkl_free(C);

   return 0;
}

int main(int argc, char **argv)
{
   int rc, m, n, k;
   double alpha, beta;

   if (argc != 6)
   {
      printf("Usage: ./mkl-cao.out m n k alpha beta \n");
      printf("Where m is the number of row of matrix A \n");
      printf("      n is the number of column of matrix B \n");
      printf("      k is the number of column of matrix A \n");
      printf("      alpha and beta are scale factors  \n");

      return argc;
   }

   m = atoi(argv[1]);
   n = atoi(argv[2]);
   k = atoi(argv[3]);
   alpha = atoi(argv[4]);
   beta = atoi(argv[5]);

   printf("m:%d n:%d k:%d alpha:%8.2f beta:%8.2f\n", m, n, k, alpha, beta);
   rc = offload(m, n, k, alpha, beta);

   return rc;
}

↧

Webinar: Better Threaded Performance and Scalability With Intel(R) Vtune Amplifier + OpenMP*

September 26, 2017, 12:02 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® System Studio 2018 Beta Documentation

≪ Previous: Using Intel® Math Kernel Library Compiler Assisted Offload in Intel® Xeon Phi™ Processor

Pre-requisites:

Intel® Parallel Studio Professional or Ultimate Edition Installed on Linux machines (Provides Intel® C++ Compiler, Intel® Vtune Amplifier, Intel® Advisor which we will use in this lab).
Install OpenCV latest version:
1. Download the source from github (https://github.com/opencv/opencv) using git clone command.
2. Build OpenCV libraries using instructions documented at http://docs.opencv.org/trunk/d7/d9f/tutorial_linux_install.html.
Make sure that you have a copy of the source code for your lab which includes the lab documentation.

Introduction:

This lab will help you understand how to use Intel® Vtune Amplifier and Intel® Advisor to look for tuning opportunities and tune the code by enabling threading (using OpenMP or Intel® Threading Building Blocks [Intel® TBB]) and enabling vectorization (using OpenMP 4.0 SIMD constructs).

Detailed document is here.

↧

Intel® System Studio 2018 Beta Documentation

September 26, 2017, 8:30 am

Latest and popular articles on Intel Technologies

≫ Next: Robot and Me: A night in

≪ Previous: Webinar: Better Threaded Performance and Scalability With Intel(R) Vtune Amplifier + OpenMP*

Release Notes and What's New

Intel System Studio 2018 Beta Release Notes
What's New
FAQs and User Forum - not available for Beta

System Requirements and Installation

Getting Started Guides

User Guides

Developer, User, and Reference Guides

Compilers:

Intel System Studio 2018: C++ Developer Guide and Reference

Threading and Performance Libraries:

Performance Analyzers:

Debuggers:

Downloadable Documentation

For your offline use, you can download the full documentation package. Click one of these links to start the download:
.tar.gz or .zip

↧

Robot and Me: A night in

September 26, 2017, 4:30 pm

Latest and popular articles on Intel Technologies

≫ Next: Robot and Me: A night out

≪ Previous: Intel® System Studio 2018 Beta Documentation

Our new comic strip introduces a software developer's home life with her friendly robot. In this snapshot of the future, we imagine what life might be like when artificial intelligence enables us all to have robots at home. As they become smarter and develop more life-like behaviors, we might start to think of them more as flat mates, than machines. If you're developing artificial intelligence applications, take a look at the resources in the Intel® Nervana^TM AI Academy.

Next comic strip: Robot and Me: A night out

↧

Robot and Me: A night out

September 26, 2017, 4:34 pm

Latest and popular articles on Intel Technologies

≫ Next: Robot and Me: Baking a cake

≪ Previous: Robot and Me: A night in

Our second cartoon strip in the Robot and Me series sees software developer Sarah going outside with her robot buddy. It invites us to think about whether machines could ever appreciate or understand beauty. What would that mean? How would that change our relationship with technology?

If you'd like to develop applications using artificial intelligence, take a look at the resources available in the Intel® Nervana^TMAI Academy. It has a range of tools and training to help you make AI software today.

Next comic strip: Robot and Me: Baking a cake

↧

Introduction

Performance Test Procedure

Test Configurations

Test Results

Conclusion

References

Introduction

What has changed?

What to do to achieve higher ZMM register usage for more 512-bit SIMD vectorization?

Conclusion

Note that starting the license manager may change the port number used by the INTEL vendor daemon. Be sure to note the port number and exclude it from your firewall if necessary.

To install the new version using the recommended defaults:

If you prefer to keep the existing installation and update the files manually:

Note that by following these steps you will not be able to use the new uninstall process, and it may cause difficulties with future license manager upgrades.

Abstract

1. Introduction

2. Computer Vision

3. Machine Learning

3.1 Convolutional Neural Networks

3.2 Transfer Learning with TensorFlow*

3.3 Dataset

3.4 Accuracy

4. Application Functionality

4.1 User Interface

4.2 Preference Sorting Algorithm

5. Programs and Hardware

5.1 Intel Optimized TensorFlow

5.2 Android Studio

5.3 Mobile Device

6. Summary and Future Work

Acknowledgements

Online References

Example

Performance

Summary

What's New

Optimize Performance and More with Super Component Tools Inside

Intel® VTune™ Amplifier

Intel® SDK for OpenCL™ Applications

About Intel Media Server Studio

More Resources

Problem:

Solution:

Ultimate Performance in VR Visualizations Powered by Intel® Xeon® Scalable Processors and Intel® Xeon® W Processors

Challenges with Photogrammetry and 3D Rendering

Powerful Workstations and Scalability in the Cloud Bring Fast Processing to All

The Power of Intel® Xeon® Processor–based High-Performance Computing in the Cloud

Virtual Reality Visualizations on Intel Xeon Scalable Processors

Autodesk Photo ReCap on High Performance Workstations Results in Faster Performance

UAVs and Autodesk ReCap Photo Lead to Better Design

Assisting Construction

Better Visualizations

Local Workstation Recommendations for 3D Modeling

Conclusion: Multicore Intel Xeon Processors Accelerate Photogrammetry in Autodesk ReCap Photo

Volumes in Gentle Introduction Series

Different phases of Numeric Table life cycle

1. Initiate

2. Operate

Data Dictionary:

2.1 Data Mutation in Numeric Table:

2.2.1 Standardization and Normalization:

Block Descriptor for Visualization and Mutation:

Access Modes:

BlockDescriptor() in “readWrite” mode:

Edit numeric table Row-wise:

Edit numeric table Column-wise:

Merge numeric table:

Merge Row-wise:

Merge Column-wise:

Split Numeric table:

Helper Function:

Change feature type:

Get dictionary object associated with nT :

Set dictionary object associated with nT:

2.2 Export Numeric Table to disk:

Serialization

Helper Function:

Compression

Helper Function:

2.3 Import Numeric Table from disk: