Quantcast
Channel: Intel Developer Zone Articles
Viewing all 3384 articles
Browse latest View live

Robot and Me: Baking a cake

$
0
0

Perhaps the most visible application of artificial intelligence today is the recommendation engine that suggests other products you might like when shopping online. It's not always perfect, so Sarah's attempt to bake a cake with her robot buddy in our latest strip might feel familiar to you. If you're engineering today's artificial intelligence solutions, or you're looking to the future, Intel has resources to help you. Visit the Intel® NervanaTM AI Academy.

First comic strip: Robot and Me: A night in


Face It – The Artificially Intelligent Hairstylist

$
0
0

Abstract

Face It is a mobile application that uses computer vision to acquire data about a user’s facial structure as well as machine learning to determine the user’s face shape. This information is then combined with manually inputted information to give the user a personalized set of hair and beard styles that are guaranteed to make the user look his best. A personalized list of tips are also generated for the user to take into account when getting a haircut.

 

 

1. Introduction

To create this application, various procedures, tools and coding languages were utilized.


procedures
  • Computer vision with haar-Cascade files to detect a person’s face

  • Machine learning, specifically using a convolutional neural network and transfer learning to identify a person’s face shape

  • A preference sorting algorithm to determine what styles look best on a person based on collected data


Programs/Tools Used
  • Ubuntu v17.04

  • Android Studios*

  • Intel® Optimization for TensorFlow*

  • OpenCV

Programming Languages
  • Java
  • Python

 

 

2. Computer Vision

For this application we used Intel’s OpenCV library along with haar cascade files to detect a person’s face.

Haar-like features are digital features used in object recognition. They owe their name to their intuitive similarity with Haar wavelets and were used in the first real-time face detector. [1] A large amount of these haar-like features are put together to determine an object with sufficient accuracy and these files are called haar-cascade classifier files. These methods were used and tested in the Viola-Jones object detection framework. [2]

In particular the Frontal Face Detection file is being used to detect the user’s face. This file, along with various other haar-cascade files can be found here: http://alereimondo.no-ip.org/OpenCV/34.

This library and file was incorporated into our application to ensure that the user’s face is detected since the main objective is to determine the user’s face shape.

Figure 1: Testing out the OpenCV Library as well as the Frontal Face Haar-Cascade Classifer file in real-time.

 

OpenCV was integrated into Android’s camera2 API in order for this real-time processing to occur. An android device with an API level of 21 or higher is required to run tests and use the application because the camera2 API can only be used by phones of that version or greater.

 

 

 

3. Machine Learning

3.1 Convolutional Neural Networks

For the facial recognition aspect of our application, the process of using machine learning with a convolutional neural network(CNN) was used.

CNN’s are very commonly associated with image recognition and they can be trained with little difficult. The accuracy of a trained CNN is very high when it comes to detecting a correct image.

CNN architectures are inspired by biological processes and include variations of multilayer receptors that result in minimal amounts of preprocessing. [3] In a CNN, there are multiple layers that each have distinct functions to help us recognize an image. These layers include a convolutional layer, pooling layer, rectified linear unit (ReLU) layer, fully connected layer and loss layer.

Figure 2: A diagram of a convolutional neural network in action[4]

 

 

  • The Convolutional layer acts as the core of any CNN. The network of a CNN develops a 2-dimensional activation map that detects the special position of a feature at all the given spatial positions which are set by the parameters.
  • The Pooling layer acts as a form of down sampling. Max Pooling is the most common implementation of pooling. Max Pooling is ideal when dealing with smaller data sets which is why we are choosing to use it.
  • The ReLU layer is a layer of neurons which applies an activation function to increase the nonlinear properties of the decision function and of the overall network without affecting the receptive fields of the convolutional layer itself.
  • The Fully Connected Layer, which occurs after several convolutional and max pooling layers, does the high-level reasoning in the neural network. Neurons in this layer have connections to all the activations amongst the precious layers. After, the activations for the Fully Connected layer are computed by a matrix multiplication and a bias offset.
  • The Loss layer specifies how the network training penalizes the deviation between the predicted and true layers. Softmax Loss is the best for this application as this is ideal for detecting a single class in a set of mutually exclusive classes.

 

3.2 Transfer Learning with TensorFlow*

The layers of a CNN can be connected in various different orders and variations. The order depends on what type of data you are using and what kind of results you are trying to get back.

There are various well-known CNN models that have been created and put out into the public for research and use. These models include the AlexNet[5] which uses two GPU’s to train the model and various separate and combined layers. This model was entered in the ImageNet Large Scale Visual Recognition Competition [6] in 2012 and won. Another example is the VGGNet[7] that is a very deep net and uses many convolutional layers in its architecture.

A very popular CNN architecture for image classification is the Inception v3  or GoogLeNet model created by Google. This model was entered in the ImageNet Large Scale Visual Recognition Competition in 2014 and won.

Figure 3: A diagram of Google’s Inception v3 convolutional neural network model[8]

 

 

As you can see, there are various convolutional, pooling, ReLU, fully connected and loss layers being used in a specific order which will help output extremely accurate results when trying to classify an image.

This model is so well put together that many developers use a method called transfer learning with the Inception v3 model. Transfer learning is a technique that shortens the process of training a model from scratch by taking a fully-trained model from a set of categories like ImageNet and re-training it with the existing weights but for new classes.

Figure 4: Diagram showing the difference between Traditional Machine Learning and Transfer Learning[9]

 

 

To use the process of transfer learning for the application, TensorFlow was used along with a Docker image. This image had all the repositories needed for the process. Then the Inception v3 re-train model was loaded on to TensorFlow where we were able to re-train it with the dataset needed for our application to recognize face shapes.

Figure 5: How the Inception v3 model looks during the process of transfer learning[10]

 

During the process of transfer learning, only the last layer of the pre-trained model is dissected and modified. This is where the dataset for our application was inputted to be trained. The model uses all the previous knowledge it has acquired from the previous data to train the new data as accurately as possible.

This is the beauty of transfer learning and this is why using this time can save so much time and extremely accurate. Through a re-train algorithm the images within the dataset were passed through the last layer of the model and the model was accurately re-trained.

 

 

3.3 Dataset

There are many popular datasets that were created and collected by many individuals to help further the advancement and research of convolutional neural networks. One common dataset used is the MNIST dataset for recognizing handwritten digits.

Figure 6: Example of the MNIST dataset that is used for training and recognizing hand written digits. [11]

 

This dataset consists of thousands of images of handwritten digits and people can uses this dataset to train and test the accuracies of their own convolutional neural networks. Another popular dataset is the CIFAR-10[12] dataset that consists of thousands of images of 10 different objects/animals: an airplane, an automobile, a bird, a cat, a deer, a dog, a frog, a horse, a ship and a truck.

It is good to have large amounts of data but it is very hard to collect large amounts of data so that is why many collections are already made and ready to use for practice and training.

The objective of our CNN model was to recognize a user’s face shape and in order for it to do so, it was fed various images of people with different face shapes.

The face shapes were categorized into six different shapes: square, round, oval, oblong, diamond and triangular. A folder was created for each face shape and each folder contained various images of people with that certain face shape.

Figure 7: Example of the contents inside the folder for the square face shape

 

These images were gathered from various reliable articles about face shapes and hairstyles. We made sure to collect as accurate data as possible to get the best results. In total we had approximately 100 images of people with each type of face shape within each folder.

Figure 8: Example of a single image saved in the square face shape folder.[13]

 

These images were fed and trained through the model for 4000 iterations (steps) for get maximum accuracy.

While these images were being trained various bottlenecks were created. Bottlenecks contain the information about every image after it has been trained through the model various amounts of times.

Figure 9: Various bottlenecks being created while re-training the Inception v3 CNN

 

A few other files are also created including a retrained graph that has all the new information that you will need if you want to now recognize the images that you have just trained the model on.

This file is fine to use if they are to be used to recognize images on a computer but if we want to use this file on a mobile device then we would have to compress it but have it contain all the information necessary for it to still be accurate.

In order to do this we have to optimize the file to fit the size that we need. To do this we modify the following features of the file:

  1. We remove all nodes that aren't needed for a given set of input and output nodes
  2. We merge explicit batch normalization operations

After this we are left with two main files that we will load into Android Studio to use with our application.

Figure 10: Files that need to be imported into Android Studio

These files consist of the information needed to identify an image that the model has been trained to recognize once it is seen through a camera.

 

3.4 Accuracy

The accuracy of the retrained model is very important since the face shape being determined should be as accurate as possible for the user.

To have a high level of accuracy we had to make sure that various factors were taken into account. We had to make sure that we had a sufficient amount of images for the model to be trained on. We also had to make sure that the model trained on the images a sufficient amount of iterations.

For the first few trials we were getting a lot of mixed results and the accuracy for a predicted face shape was all over the place. For one image we were getting a 82% accuracy while for another image we were getting a 62% accuracy. This was obviously not good and we wanted to have much more accurate and precise data.

Figure 11: An example of a low accuracy level that we were receiving with our initial dataset.

 

At first we were using approximately 50 images of each face shape but to improve our low accuracy we increased this number to approximately 100 images of each face shape. These images were carefully hand-picked to fit the needs of our application and face shape recognition software. We wanted to reach a benchmark average accuracy of approximately 90%.

Figure 12: An example of a high accuracy level we were receiving after the changes we made with the dataset.

 

After these adjustments we saw a huge difference with our accuracy level and reached the benchmark we were aiming for. When it came time to compress the files necessary for the face shape detection software to work, we made sure that the accuracy level was not affected.

For ease of use by the user, after testing the accuracy levels of our application, we adjusted the code to output the highest percentage face shape that it detected in a simple and easy to read sentence rather than having various percentages appearing on the screen.

 

4. Application Functionality

4.1 User Interface

The user interface of the application consists of three main screens:

  1.  The face detection screen with the front-side camera. This camera screen will appear first so that the user can figure out his face shape right away with no hesitation. After the face shape detector has figured out the user’s face shape, the user can click on the “Preferences” button to go to the next screen.
  2. The next screen is the preferences screen where the user inputs information about himself. The preference screen will ask the user to input certain characteristics about himself including the user’s face shape that he just discovered through the first screen (square, round, oval, oblong, diamond or triangular), the user’s hair texture (straight, wavy or coiled), the user’s hair thickness (thick or thin), if the user has facial hair (yes or no), the acne level of the user (none, moderate, excessive or prefer not to answer), and the lifestyle of the user (business, athlete  or student). After the user has selected all of his preferences he can click on the “Get Hairstyles!” button to go to the final screen.
  3. The final output screen is where a list of recommended hair/ beard styles along with tips the user can use when getting a haircut will be presented. The preferences that the user selects will go through a sorting algorithm that was created for this application. Afterwards, the user will be able to swipe through the final recommendation screen and be able to choose from various hair/beard styles. An image will complement each style so the user has a better idea of how the style looks. Also a list of tips will be generated so that the user will know what to say to his barber when getting a haircut.

Figure 13: This is a display of all the screens of the application. From left to right: Face shape detection screen, preferences screen, final recommendation screen with tips that the user can swipe through.

The application was meant to have a very simplistic design so we chose very basic complementary colors and a simple logo that got the point of the application across. To integrate our ideas of how the application should look into Android Studio we made sure to create a .png file of our logo and to take down the hexcolor code of the colors that we wanted to use. Once we had those, we used Android Studio’s easy to use user interface creator and added a layer for the toolbar and a layer for the logo.

 

 

4.2 Preference Sorting Algorithm

The preference screen was organized with six different spinners, one for every preference. Each option for each preference was linked to a specific array full of various different hair/beard styles that fit that one preference.

Figure 14: Snippet of the code used to assign each option of every preference an array of hairstyles.

 

These styles were categorized by doing extensive research on what styles fit every option within each preference. Then these arrays were sorted to find the hairstyles that were in common with every option the user chose.

For example, let’s say the user has a square face shape and straight hair. The hair styles that look good with a square face shape may be a fade, a combover and a crew cut. These three hairstyles would be put into an array for the square face shape option. The hairstyles that look good with straight hair may be a combover, a crew cut and a side part.. These three hairstyles would be put into an array for the straight hair option. Now these two arrays would be compared and whatever hairstyles the two arrays have in common would be placed into a new and final array with the personalized hairstyles that look good for the user based on both the face shape and hair type preferences. In this case, the final array would consist of  combover and a crew cut since these are the two hairstyles that both preferences had in common. These hairstyles would then be outputted and recommended to the user.

Figure 15: Snippet of the code used to compare the six different preference arrays so that one final personalized array of hairstyles can be formed.

 

Once the final list of hairstyles is created, an array of images is created to match the same hairstyles in the final list and this array of images is used to create a gallery of personalized hairstyles that the user can swipe through and see what he likes and what he doesn’t like.

In addition, a list of tips are outputted for the user to view and take into consideration. These tips are based on what preference the user selected. For example, if the user selected excessive acne, a tip may be to go for a long hair style to keep the acne slightly hidden. These tips are generated by various if-statements and outputted on the final screen. Since this application cannot control every aspect of a user’s haircut we are hoping that these tips will be taken into consideration by the user and hopefully used when describing to the barber what type of haircut the user is looking for.

Figure 16: An example of how the outputted tips would look for the user once he selects his preferences.

 

5. Programs and Hardware

5.1 Intel Optimized TensorFlow

TensorFlow was a key framework that made it possible for us to train our model and have our application actually detect a user’s face shape.

TensorFlow was installed onto the Linux operating system, Ubuntu by following this tutorial:
https://www.tensorflow.org/install/install_linux

Intel’s TensorFlow optimizations were installed by following this tutorial:
https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture

Intel has optimized the TensorFlow framework in various ways to help improve the results of training a neural network and using TensorFlow in general. They have made many modifications to help people use CPUs for this process through Intel’s MKL (Math Kernal Library) optimized operations. They have also developed a custom pool allocator and a faster way to perform back propagation to also help improve results.

After all this had been installed, Python was used to write commands to facilitate with the transfer learning process and to re-train the convolutional neural network.

 

5.2 Android Studio

Android Studio is the main development kit used to create the application and make it come to life. Since both TensorFlow and Android are run under Google, they had various detailed tutorials explaining how to combine the trained data from TensorFlow and integrate it with Android Studio. [14] This made the process very simple as long as the instructions were followed.

Figure 17: Snippet of code that shows how the viewPager is used for sliding through various images

 

Android Studio also made it simple to create basic .xml files for the application. These .xml files were very customizable and allowed the original mock-ups of the application to come to life and take form. When creating these .xml files we were sure to click on the option to “infer constraints.” Without this option being checked,  the various displays such as the text-view box or the spinners would be in random positions when the application is fully built. Also, the application should run very smoothly. Tutorials on how to connect two activities together[15] and how to create a view-page image gallery[16] were used to help make the application easily useable and smooth.

Figure 18: An example of inferring constraints to make sure everything appears properly during the full build.

 

5.3 Mobile Device

A countless number of tests were required to make sure certain parts of the code were working whenever a new feature was added to the application. This tests were done through an actual android smart phone that was given to us by Intel.

The camera2 that is used for this application requires an android phone with an API level of 21 or higher or of version 5.1 or higher so we used a phone model with an API level of 23. Though the camera was slow at time, the overall functionality of this device was great.

Whenever a slight modification was done to the code for this application, a full build and test was always done on this smartphone to ensure that the application was still running smoothly.

Figure 19: The Android phone we used with an API level of 23. You can see the Face It application logo in the center of the screen.

 

 

6. Summary and Future Work

Using various procedures, programs, tools and languages, we were able to form an application that that uses computer vision to acquire data about a user’s facial structure and machine learning, specifically transfer learning, to detect a person’s face shape. We then put this information as well as user inputted information through a preference sorting algorithm to output a personalized gallery of hairstyles for the user to view and choose from as well as personalized tips the user can tell his barber when getting a haircut or take into consideration when styling or growing out his hair.

There is always room for improvement and we definitely plan to improve many aspects of this application including even more accurate face shape detection results, an even cleaner looking user interface and many more hair and beard styles for the user to choose and select from.

 

Acknowledgements

I would like to personally thank the Intel Student Ambassador Program for AI for supporting us through the creation of this application and for the motivation to keep on adding to it. I would also like to thank Intel for providing us with the proper hardware and software that was necessary for us to create and test the application.

 

Online References

[1]   https://en.wikipedia.org/wiki/Haar-like_features

[2]   https://www.cs.ubc.ca/~lowe/425/slides/13-ViolaJones.pdf

[3]   https://en.wikipedia.org/wiki/Convolutional_neural_network

[4]   https://www.mathworks.com/help/nnet/convolutional-neural-networks.html

[5]   https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks

[6]   http://www.image-net.org/

[7]   http://www.robots.ox.ac.uk/~vgg/research/very_deep/

[8]   https://research.googleblog.com/2016/03/train-your-own-image-classifier-with.html

[9]   https://www.slideshare.net/hunkim/transfer-defectlearningnew-completed

[10]https://medium.com/@vinayakvarrier/significance-of-transfer-learning-in-the-domain-space-of-artificial-intelligence-1ebd7a1298f2

[11]http://yann.lecun.com/exdb/mnist/

[12]https://www.cs.toronto.edu/~kriz/cifar.html

[13]http://shaverguru.com/finding-a-great-beard-style-for-your-face/

[14]https://www.tensorflow.org/deploy/distributed

[15]https://developer.android.com/training/basics/firstapp/starting-activity.html

[16]https://developer.android.com/training/animation/screen-slide.html

 

Read the previous Week 6 Update or continue with the Week 10 Update

 

Usage of Intel® Performance Libraries with Intel® C++ Compiler in Visual Studio

$
0
0

 

Affected products: Intel® Parallel Studio XE 2016, 2017 or 2018; Intel® System Studio 2016, 2017
Visual Studio versions: all supported Visual Studio, see Intel C++ Compiler Release Notes for details.

Compilation of application with use of Intel® Performance Libraries by Intel® C++ Compiler fails in Microsoft* Visual Studio and produces the warnings like:
“Could not expand [MKL|TBB|DAAL|IPP] ProductDir variable. The registry information may be incorrect."

There can be two root causes:

  1. The library was not installed with the selected version of Intel® C++ Compiler.
    “Use Intel® MKL”, “Use Intel® DAAL”, “Use Intel® IPP” and “Use Intel® TBB” properties in Visual Studio mimic behavior of /Qmkl, /Qdaal, /Qipp and /Qtbb compiler options. Include and library paths to the performance library installed together with selected Intel® C++ compiler are set up.
    To fix the compilation, please, install the necessary performance libraries (Intel MKL, Intel DAAL, Intel IPP, and/or Intel TBB) from the same package from which selected versions of Intel® C++ Compiler was installed.
    If you need to use different versions of Intel® Performance Libraries with Intel® C++ Compiler in Visual Studio, instead of using “Use Intel® MKL”, “Use Intel® DAAL”, “Use Intel® IPP” and “Use Intel® TBB” properties, please, manually, specify paths to headers and libraries of performance library in
    “Project” > “Properties” > “VC++ Directories” and libraries in
    “Project” > “Properties” > “Linker” > “Input” > “Additional Dependencies”.
    For more information on correct paths and list of libraries, see the Intel® Math Kernel Library, Intel® DAAL, Intel® Integrated Performance Primitives, and Intel® Threading Building Blocks documentation.
  2. Installation failed and registry is incorrect
    Workaround: Repair Intel® Parallel Studio XE/Intel® System Studio installation. If still does not work, please report to Intel Online Service Center.

Introducing Batch GEMM Operations

$
0
0

The general matrix-matrix multiplication (GEMM) is a fundamental operation in most scientific, engineering, and data applications. There is an everlasting desire to make this operation run faster. Optimized numerical libraries like Intel® Math Kernel Library (Intel® MKL) typically offer parallel high-performing GEMM implementations to leverage the concurrent threads supported by modern multi-core architectures. This strategy works well when multiplying large matrices because all cores are used efficiently. When multiplying small matrices, however, individual GEMM calls may not optimally use all the cores. Developers wanting to improve utilization usually batch multiple independent small GEMM operations into a group and then spawn multiple threads for different GEMM instances within the group. While this is a classic example of an embarrassingly parallel approach, making it run optimally requires a significant programming effort that involves threads creation/termination, synchronization, and load balancing. That is, until now. 

Intel MKL 11.3 Beta (part of Intel® Parallel Studio XE 2016 Beta) includes a new flavor of GEMM feature called "Batch GEMM". This allows users to achieve the same objective described above with minimal programming effort. Users can specify multiple independent GEMM operations, which can be of different matrix sizes and different parameters, through a single call to the "Batch GEMM" API. At runtime, Intel MKL will intelligently execute all of the matrix multiplications so as to optimize overall performance. Here is an example that shows how "Batch GEMM" works:

Example

Let A0, A1 be two real double precision 4x4 matrices; Let B0, B1 be two real double precision 8x4 matrices. We'd like to perform these operations:

C0 = 1.0 * A0 * B0T  , and C1 = 1.0 * A1 * B1T

where C0 and C1 are two real double precision 4x8 result matrices. 

Again, let X0, X1 be two real double precision 3x6 matrices; Let Y0, Y1 be another two real double precision 3x6 matrices. We'd like to perform these operations:

Z0 = 1.0 * X0 * Y0T + 2.0 * Z0and Z1 = 1.0 * X1 * Y1T + 2.0 * Z1

where Z0 and Z1 are two real double precision 3x3 result matrices.

We could accomplished these multiplications using four individual calls to the standard DGEMM API. Instead, here we use a single "Batch GEMM" call for the same with potentially improved overall performance. We illustrate this using the "cblas_dgemm_batch" function as an example below.

#define    GRP_COUNT    2

MKL_INT    m[GRP_COUNT] = {4, 3};
MKL_INT    k[GRP_COUNT] = {4, 6};
MKL_INT    n[GRP_COUNT] = {8, 3};

MKL_INT    lda[GRP_COUNT] = {4, 6};
MKL_INT    ldb[GRP_COUNT] = {4, 6};
MKL_INT    ldc[GRP_COUNT] = {8, 3};

CBLAS_TRANSPOSE    transA[GRP_COUNT] = {'N', 'N'};
CBLAS_TRANSPOSE    transB[GRP_COUNT] = {'T', 'T'};

double    alpha[GRP_COUNT] = {1.0, 1.0};
double    beta[GRP_COUNT] = {0.0, 2.0};

MKL_INT    size_per_grp[GRP_COUNT] = {2, 2};

// Total number of multiplications: 4
double    *a_array[4], *b_array[4], *c_array[4];
a_array[0] = A0, b_array[0] = B0, c_array[0] = C0;
a_array[1] = A1, b_array[1] = B1, c_array[1] = C1;
a_array[2] = X0, b_array[2] = Y0, c_array[2] = Z0;
a_array[3] = X1, b_array[3] = Y1, c_array[3] = Z1;

// Call cblas_dgemm_batch
cblas_dgemm_batch (
        CblasRowMajor,
        transA,
        transB,
        m,
        n,
        k,
        alpha,
        a_array,
        lda,
        b_array,
        ldb,
        beta,
        c_array,
        ldc,
        GRP_COUNT,
        size_per_group);



The "Batch GEMM" interface resembles the GEMM interface. It is simply a matter of passing arguments as arrays of pointers to matrices and parameters, instead of as matrices and the parameters themselves. We see that it is possible to batch the multiplications of different shapes and parameters by packaging them into groups. Each group consists of multiplications of the same matrices shape (same m, n, and k) and the same parameters. 

Performance

While this example does not show performance advantages of "Batch GEMM", when you have thousands of independent small matrix multiplications then the advantages of "Batch GEMM" become apparent. The chart below shows the performance of 11K small matrix multiplications with various sizes using "Batch GEMM" and the standard GEMM, respectively. The benchmark was run on a 28-core Intel Xeon processor (Haswell). The performance metric is Gflops, and higher bars mean higher performance or a faster solution.

The second chart shows the same benchmark running on a 61-core Intel Xeon Phi co-processor (KNC). Because "Batch GEMM" is able to exploit parallelism using many concurrent multiple threads, its advantages are more evident on architectures with a larger core count. 

Summary

This article introduces the new API for batch computation of matrix-matrix multiplications. It is an ideal solution when many small independent matrix multiplications need to be performed. "Batch GEMM" supports all precision types (S/D/C/Z). It has Fortran 77 and Fortran 95 APIs, and also CBLAS bindings. It is available in Intel MKL 11.3 Beta and later releases. Refer to the reference manual for additional documentation.  

Optimization Notice in English

Gentle Introduction to PyDAAL: Vol 2 of 3 Basic Operations on Numeric Tables

$
0
0

Previous: Vol 1: Data Structures

A wide range of classes are available in the Intel® Data Analytics Acceleration Library (Intel® DAAL) to create a numeric table accommodating various data layout, dtypes, and frequent access methods. Volume 1 of this series covers numeric table creation under different scenarios. Once created, Intel® DAAL provides operational methods for visualizing and mutating a user’s numeric tables. Volume 2 will cover the usage of the operational methods. Subsequently Volume 3 in this series gives a brief introduction to Algorithm section of PyDaal. Table 1 can be used as a quick reference for basic operations on Intel® DAAL’s numeric table.

Volumes in Gentle Introduction Series

  • Vol 1: Data Structures - Covers introduction to Data Management component of Intel® DAAL and available custom Data Structures(Numeric Table and Data Dictionary) with code examples.
  • Vol 2: Basic Operations on Numeric Tables - Covers introduction to possible operations that can be performed on Intel® DAAL's custom Data Structure (Numeric Table and Data Dictionary) with code examples.
  • Vol 3: Analytics Model Building and Deployment - Covers introduction to analytics model building and evaluation in Intel® DAAL with serialized deployment and distributed model fitting on large datasets.

Table 1. Quick reference table on available methods

Method DescriptionUsage Syntax
*Print numeric table as stored in memory to represent data layout. Method requires ‘nT’ as input argumentprintNumericTable(nT)
*Quick visualization on multiple numeric tablesprintNumericTables(nT1,nT2)
Check shape of numeric table#Number of Rows
nT.getNumberOfRows()
#Number of Columns
nT.getNumberOfColumns()
Allocate buffer to load block of numeric table for access and manipulation operations.block = BlockDescriptor_Float64()
#Allocates a memory block with double dtype
Retrieve block of rows and columns from numeric table into Block Descriptor for visualization. (Setting rwflag to ‘readOnly’ enables only read access to the buffer.)#Block of Column values
nT.getBlockOfColumnValues(colIndex, firstRowIndex,lastRowIndex, rwflag, block)

#Block of Rows
nT.getBlockOfRows(firstRowIndex,lastRowIndex, rwflag, block)
Extract numpy array from Block Descriptor object when loaded with block of valuesblock.getArray()
Release block of Rows from buffernT.releaseBlockOfRows(block)
*Print underlying array of numeric table. Method requires ‘np.array’ as input argument.printArray(block.getArray() , num_printed_cols, num_printed_rows, num_cols, message)
Check FeatureTypes on each column of numeric table data dictionarydict[colIndex].featureType

* denotes functions included in the ‘utils’ folder, which can be found in <install_root>/share/pydaal_examples/examples/python/source/. <install_root>

Different phases of Numeric Table life cycle

1. Initiate

Let’s begin by constructing a numeric table (nT) directly from a Numpy array. We will use the nT throughout the rest of the code examples in this volume.

import numpy as np
from daal.data_management import HomogenNumericTable
array =np.array([[1,2,3,4],
                [5,6,7,8]])
nT= HomogenNumericTable(array)

2. Operate

Once initialized, numeric tables provide various classes and member functions to access and manipulate data similar to a pandas DataFrame. We will dive next into Intel DAAL’s operational methods, after an important note about Intel DAAL’s bookkeeping object called Data Dictionary.

Data Dictionary:

As mentioned in Vol1 of this series on creation of Intel DAAL’s numeric tables (link), these custom data structures must be accompanied by a Data Dictionary to perform operations. When raw data streams into memory to populate the numeric table structure, the table’s Data Dictionary concurrently records metadata. Dictionary creation will occur automatically unless specified to not allocate by the user. Various Data Dictionary methods are available to access and manipulate feature type, dtypes etc. If a user creates a numeric table without memory allocation, the Data Dictionary values have to be explicitly set with feature types. An important note is that Intel DAAL’s Data Dictionary is a custom data structure, not a Python dictionary.

More details on working with Intel DAAL Data Dictionaries

2.1 Data Mutation in Numeric Table:

2.1.1 Standardization and Normalization:

Data analysis work is usually preceded by a Data Preprocessing stage that includes data wrangling, quality checks, and assurance to handle null values, outliers etc. An important preprocessing activity is to normalize input data. Intel DAAL offers routines to support two popular normalization techniques on numeric tables: Namely, Z-score standardization and Min-Max normalization.

Currently, Intel DAAL only supports rescaling for descriptive analytics. In the future, support will be added for predictive analytics with the addition of a “transform()” method to be applied to new data.

  • Z-score Standardization: Rescales numeric table values feature-wise to the number of standard deviations each value deviates from the mean. Below are the steps to use Intel DAAL’s z-score standardization.

    import daal.algorithms.normalization.zscore as zscore
    
    # Create an algorithm
    algorithm = zscore.Batch(method=zscore.sumDense)
    
    # Set input object for the algorithm to nT
    algorithm.input.set(zscore.data, nT)
    
    # Compute Z-score normalization function
    res = algorithm.compute()
    
    #Retrieve normalized nT
    Norm_nT= res.get(zscore.normalizedData)
    
  • Min-Max Normalization: Rescales numeric table values feature-wise linearly to fit [0, 1] / [-1-1] range. Below are the steps to use Intel DAAL’s Min-Max normalization.

    import daal.algorithms.normalization.minmax as minmax
    
    # Create an algorithm
    algorithm = minmax.Batch(method=minmax.defaultDense)
    
    # Set lower and upper bounds for the algorithm
    algorithm.parameter.lowerBound = -1.0
    algorithm.parameter.upperBound = 1.0
    
    # Set input object for the algorithm to nT
    algorithm.input.set(minmax.data, nT)
    
    # Compute Min-max normalization function
    res = algorithm.compute()
    
    # Get normalized numeric table
    Norm_nT = res.get(minmax.normalizedData)

2.1.2 Block Descriptor for Visualization and Mutation:

The Contents of a numeric table cannot be accessed directly to visualize or manipulate. Instead a user must first move a requested block of data values to a memory buffer. Once instantiated, this buffer is housed in an object called BlockDescriptor. An Intel DAAL numeric table object has member functions to retrieve blocks of rows/columns and add to the BlockDescriptor. The argument rwflag is used to set “readOnly”/“readWrite” mode, depending on whether the user intends to update values in the numeric table while releasing the block. Conveniently, the BlockDescriptor class allows for block retrieval of data in specific rows and/or columns. Note: the dtype of data in the BlockDescriptor buffer is not required to match the numeric table that sourced the block.

Access Modes:
  • “readOnly” argument sets rwflag to provide read only access to numeric table contents, thus performing no updates to the table when the block is released from buffer memory.

    Syntax and Usage:

    from daal.data_management import BlockDescriptor_Float64, readOnly
    #Allocate a readOnly memory block with double dtype
    block = BlockDescriptor_Float64()
    nT.getBlockOfRows(0,1, readOnly, block)
  • “readWrite” argument sets rwflag to write back any changes from block descriptor object to the numeric table when the block is released from buffer memory, thus enabling numeric table mutation with the help of block descriptor.

    Syntax and Usage:

    from daal.data_management import BlockDescriptor_Float64, readWrite
    #Allocate a readOnly memory block with double dtype
    block = BlockDescriptor_Float64()
    nT.getBlockOfRows(0,1, readWrite, block)

2.1.3 BlockDescriptor() in “readWrite” mode:

When rwflag argument is set to “readWrite” in getBlockOfRows()/ getBlockOfColumnValues(), contents of BlockDecriptor object are written back to the numeric table while releasing block of rows, making edits possible on existing rows/columns in numeric table.

Let’s create a basic numeric table to explain BlockDecriptor in “readWrite” mode in detail.

import numpy as np
from daal.data_management import HomogenNumericTable, readWrite, BlockDescriptor
from utils import printNumericTable
array =np.array([[1,2,3,4],
                [5,6,7,8]])
nT= HomogenNumericTable(array)
  • Edit numeric table Row-wise:
    printNumericTable(nT,"Original nT: ")
    #Create buffer object with ntype "double"
    doubleBlock = BlockDescriptor(ntype=np.float64)
    
    firstRow = 0
    lastRow = nT.getNumberOfRows()
    
    #getBlockOfRows() member function in "readWrite" mode to retrieve numeric table contents and populate "doubleBlock" object
    nT.getBlockOfRows(firstRow,lastRow, readWrite, doubleBlock)
    #Access array contents from "doubleBlock" object
    array = doubleBlock.getArray()
    #Mutate 1st row of array to reflect on "doubleBlock" object
    array[0] = [0,0,0,0]
    #Release buffer object and write changes back to numeric table
    nT.releaseBlockOfRows(doubleBlock)
    printNumericTable(nT,"Updated nT: ")

    nT was originally created with data [[1,2,3,4],[5,6,7,8]]. After row mutation the first row is now replaced using buffer memory. Updated nT has data [[0,0,0,0],[5,6,7,8]].

  • Edit numeric table Column-wise:
    printNumericTable(nT,"Original nT: ")
    #Create  buffer object with ntype "double"
    doubleBlock = BlockDescriptor(ntype=np.intc)
    ColIndex = 2
    firstRow = 0
    lastRow = nT.getNumberOfRows()
    
    #getBlockOfColumnValues() member function in "readWrite" mode to retrieve numeric table ColIndex contents and populate "doubleBlock" object
    nT.getBlockOfColumnValues(ColIndex,firstRow,lastRow,readWrite,doubleBlock)
    
    #Access array contents from "doubleBlock" object
    array = doubleBlock.getArray()
    
    #Mutate array to reflect on "doubleBlock" object
    array[:][:] = 0
    
    #Release buffer object and write changes back to numeric table
    nT.releaseBlockOfColumnValues(doubleBlock)
    printNumericTable(nT, "Updated nT: ")

     nT was originally created with data [[1,2,3,4],[5,6,7,8]] After column mutations, the third column is replaced with [0,0] using buffer memory. Updated nT has data [[1,2,0,4],[5,6,0,8]].

2.1.4 Merge numeric table:

Numeric tables can be appended along rows and columns, provided, they share the same array size along the relevant axis to merge. RowMergedNumericTable()and MergedNumericTable() are the 2 classes available to merge numeric tables. The latter is used for merges on column indexes.

  • Merge Row-wise:

    Syntax:

    mnT = RowMergedNumericTable()
    mnT.addNumericTable(nT1); mnT.addNumericTable(nT2); mnt.addNumericTable(nT3)

    Code Example:

    from daal.data_management import HomogenNumericTable, RowMergedNumericTable
    import numpy as np
    from utils import printNumericTable
    
    
    #nT1 and nT2 are 2 numeric tables having equal number of COLUMNS
    array =np.array([[1,2,3,4],
                     [5,6,7,8]], dtype = np.intc)
    nT1= HomogenNumericTable(array)
    array =np.array([[9,10,11,12],
                     [13,14,15,16]],dtype = np.intc)
    nT2= HomogenNumericTable(array)
    
    #Create merge numeric table object
    mnT = RowMergedNumericTable()
    
    #add numeric tables to merged numeric table object
    mnT.addNumericTable(nT1); mnT.addNumericTable(nT2)
    printNumericTable(nT1, "Numeric Table nT1: ")
    printNumericTable(nT2, "Numeric Table nT2: ")
    printNumericTable(mnT, "Merged Numeric Table nT1 and nT2: ")

     Output:

    1.000     2.000     3.000     4.000    
    5.000     6.000     7.000     8.000    
    9.000     10.000    11.000    12.000   
    13.000    14.000    15.000    16.000  

  • Merge Column-wise:

    Syntax:

    mnT = MergedNumericTable()
    mnT.addNumericTable(nT1); mnT.addNumericTable(nT1); mnt.addNumericTable(nT3) 

    Code Example:

    from daal.data_management import HomogenNumericTable, MergedNumericTable
    import numpy as np
    from utils import printNumericTable
    
    #nT1 and nT2 are 2 numeric tables having equal number of ROWS
    array =np.array([[1,2,3,4],
                     [5,6,7,8]], dtype = np.intc)
    nT1= HomogenNumericTable(array)
    
    array =np.array([[9,10,11,12],
                     [13,14,15,16]],dtype = np.intc)
    nT2= HomogenNumericTable(array)
    
    #Create merge numeric table object
    mnT = MergedNumericTable()
    
    #add numeric tables to merged numeric table object
    mnT.addNumericTable(nT1); mnT.addNumericTable(nT2)
    
    printNumericTable(nT1, "Numeric Table nT1: ")
    printNumericTable(nT2, "Numeric Table nT2: ")
    printNumericTable(mnT, "Merged Numeric Table nT1 & nT2: ")


    Output:

    1.000     2.000     3.000     4.000     9.000     10.000    11.000    12.000    

    5.000     6.000     7.000     8.000     13.000    14.000    15.000    16.000

2.1.5 Split Numeric table:

See Table 1 for a quick reference on available methods for the entries getBlockOfRows() and getBlockOfColumnValues() methods, used to extract sections of a numeric table by row or column values. Additionally, the helper function getBlockOfNumericTable() is provided below and implements the capability to extract a contiguous subset of the table with selected range of rows and columns. getBlockOfNumericTable() accepts int or list keyword arguments for ranges of rows and columns, using conventional Python 0 - based indexing.

Syntax and Usage: getBlockOfNumericTable(nT, Rows = ‘All’, Columns = ‘All’)

Helper Function:
def getBlockOfNumericTable(nT,Rows = 'All', Columns = 'All'):
    from daal.data_management import HomogenNumericTable_Float64, \
    MergedNumericTable, readOnly, BlockDescriptor
    import numpy as np
#------------------------------------------------------
    # Get First and Last Row indexes
    lastRow = nT.getNumberOfRows()
    if type(Rows)!= str:
        if type(Rows) == list:
            firstRow = Rows[0]
            if len(Rows) == 2: lastRow = min(Rows[1], lastRow)
        else:firstRow = 0; lastRow = Rows
    elif Rows== 'All':firstRow = 0
    else:
        warnings.warn('Type error in "Rows" arguments, Can be only int/list type')
        raise SystemExit
#------------------------------------------------------
    # Get First and Last Column indexes
    nEndDim = nT.getNumberOfColumns()
    if type(Columns)!= str:
        if type(Columns) == list:
            nStartDim = Columns[0]
            if len(Columns) == 2: nEndDim = min(Columns[1], nEndDim)
        else: nStartDim = 0; nEndDim = Columns
    elif Columns == 'All': nStartDim = 0
    else:
        warnings.warn ('Type error in "Columns" arguments, Can be only int/list type')
        raise SystemExit
#------------------------------------------------------
    #Retrieve block of Columns Values within First & Last Rows
    #Merge all the retrieved block of Columns Values
    #Return merged numeric table
    mnT = MergedNumericTable()
    for idx in range(nStartDim,nEndDim):
        block = BlockDescriptor()
        nT.getBlockOfColumnValues(idx,firstRow,(lastRow-firstRow),readOnly,block)
        mnT.addNumericTable(HomogenNumericTable_Float64(block.getArray()))
        nT.releaseBlockOfColumnValues(block)
    block = BlockDescriptor()
    mnT.getBlockOfRows (0, mnT.getNumberOfRows(), readOnly, block)
    mnT = HomogenNumericTable (block.getArray())
    return mnT 



There are 4 different ways of passing arguments to this function:

  1. getBlockOfNumericTable(nT) - Extracts block of numeric table having all rows and columns of nT.
  2. getBlockOfNumericTable(nT, Rows = 4, Columns = 5) - Retrieves first 4 rows and first 5 column values of nT
  3. getBlockOfNumericTable(nT, Rows=[2,4], Columns = [1,3])-Slices numeric table along row and column directions using lower bound and upper bound passed as parameters in list.
  4. getBlockOfNumericTable(nT, Rows=[1,], Columns = [1,])-Extracts all rows and columns from lower bound through last index.

2.1.6 Change feature type:

Numeric table objects have dictionary manipulation methods to get and set feature types in the Data Dictionary for each column. Categorical(0), Ordinal(1), and Continuous(2) are available feature types in Data Dictionary supported by Intel DAAL.

  • Get dictionary object associated with nT :

    Syntax: nT.getDictionary()

    Code Example:

    dict = nT.getDictionary() # nT is numeric table created in section 1''''dict' object has data dictionary of numeric table nT. This can be used to update metadata information about the data. Most common use case is to modify default feature type of nT columns.'''
    # Print default feature type of 3rd feature (example feature is continuous):
    print(dict[2].featureType) #outputs “2” (denotes Continuous feature type)
    
    # Modify feature type from Continuous to Categorical:
    dict[2].featureType = data_feature_utils.DAAL_CATEGORICAL
    print(dict[2].featureType) #outputs “0” (denotes Categorical feature type)
         
  • Set dictionary object associated with nT:

    This is the method used to replace current Data Dictionary values or to create new Data Dictionaries, if needed. Also, for batch updates, an existing Data Dictionary can be overwritten in full using setDictionary() method.

    When tables are created without allocating memory for the Data Dictionary, the setDictionary() method must be used to construct metadata for features in the table. Let us again consider nT created in section-1 having 4 features

    Syntax: nT.setDictionary()

    Code Example:

    nT.releaseBlockOfRows(block)
    
    #Create a dictionary object using Numeric table dictionary class with the number of features
    dict = NumericTableDictionary(nFeatures)
    #Allocate a feature type for each feature
    dict[0].featureType = data_feature_utils.DAAL_CONTINUOUS
    dict[1].featureType = data_feature_utils.DAAL_CATEGORICAL
    dict[2].featureType = data_feature_utils.DAAL_CONTINUOUS
    dict[3].featureType = data_feature_utils.DAAL_CATEGORICAL
    
    #set the nT numeric table dictionary with “dict”
    nT.setDictionary(dict)
    
    

2.2 Export Numeric Table to disk:

Numeric tables can be exported and saved as a numpy binary file (.npy) file to disk. The following two sections contain helper functions to complete the task of saving in binary form, as well as compressing the data on disk.

2.2.1 Serialization

Intel DAAL provides interfaces to serialize numeric table objects into a data archive that can be converted to a numpy array object. The resulting Numpy array, which houses the serialized form of the data, can be saved to disk and subsequently reloaded in the future to reconstruct the source numeric table.

To automate this process, the following helper function can be used to serialize and save to disk.

Helper Function:
def Serialize(nT):
#Construct input data archive Object
#Serialize nT contents into data archive Object
#Copy data archive contents to numpy array
#Save numpy array as .npy in the path
   from daal.data_management import InputDataArchive
   import numpy as np

   dataArch = InputDataArchive()

   nT.serialize(dataArch)

   length = dataArch.getSizeOfArchive()
   buffer_array = np.zeros(length, dtype=np.ubyte)
   dataArch.copyArchiveToArray(buffer_array)

   return buffer_array
buffer_array = Serialize(nT) # call helper function
#np.save(<path>, buffer)# This step is optional</path>

2.2.2 Compression

Compressor methods are also available in Intel DAAL to achieve reduced memory footprint when large datasets must be stored to disk. A serialized array representation of an Intel DAAL numeric table can be compressed before saving it to disk, hence achieving optimal storage.

To automate this process, the following helper function can be used to serialize, then compress the resulting serialized array.

Incorporate helper functions Serialize(nT) and CompressToDisk (nT, path) to compress and write numeric tables to disk.

Helper Function:
def CompressToDisk(nT, path):
    # Serialize nT contents
    # Create a compressor object
    # Create a stream for compression
    # Write numeric table to the compression stream
    # Allocate memory to store the compressed data
    # Store compressed data
    # Save compressed data to disk
    from daal.data_management
    import Compressor_Zlib, level9, CompressionStream
    import numpy as np

    buffer = Serialize (nT)
    compressor = Compressor_Zlib ()
    compressor.parameter.gzHeader = True
    compressor.parameter.level = level9
    comprStream = CompressionStream (compressor)
    comprStream.push_back (buffer)
    compressedData = np.empty (comprStream.getCompressedDataSize (), dtype=np.uint8)
    comprStream.copyCompressedArray (compressedData)
    np.save (path, compressedData)
    CompressToDisk (nT, < path >)

2.3 Import Numeric Table from disk:

As mentioned in the previous sections, numeric tables can be stored in the form of either serialized or compressed numpy files. Decompression/ Deserialization methods are available to reconstruct the numeric table.

2.3.1 Deserialization

The helper function below is available to reconstruct a numeric table from serialized array objects.

Helper Function:
def DeSerialize(buffer_array):
    from daal.data_management import OutputDataArchive, HomogenNumericTable
    #Load serialized contents to construct output data archive object
    #De-serialize into nT object and return nT

    dataArch = OutputDataArchive(buffer_array)
    nT = HomogenNumericTable()
    nT.deserialize(dataArch)
    return nT
#buffer_array = np.load(path) # this step is optional, used only when serialized contents have to be written to  disk
nT = DeSerialize(buffer_array)

2.3.2 Decompression

As compression stage involves serialization of numeric table object, decompression stage includes deserialization. See DeSerialize helper function to recover the numeric table. Refer below for a quick de-compression helper function.

Incorporate helper functions DeSerialize(buffer_array) and DeCompressFromDisk(path) to compress and read numeric tables from disk.

Helper Function:
def DeCompressFromDisk(path):
    from daal.data_management import  Decompressor_Zlib, DecompressionStream
    # Create a decompressor
    decompressor = Decompressor_Zlib()
    decompressor.parameter.gzHeader = True

    # Create a stream for decompression
    deComprStream = DecompressionStream(decompressor)

    # Write the compressed data to the decompression stream and decompress it
    deComprStream.push_back(np.load(path))

    # Allocate memory to store the decompressed data
    deCompressedData = np.empty(deComprStream.getDecompressedDataSize(), dtype=np.uint8)

    # Store the decompressed data
    deComprStream.copyDecompressedArray(deCompressedData)

    #Deserialize
    return DeSerialize(deCompressedData)

nT = DeCompressFromDisk(<path>)#path must be ‘.npy’ file

Intel® DAAL also implements several other generic compression and decompression methods that include ZLIB, LZO, RLE, and BZIP (reference)

Conclusion

Intel® DAAL’s data management component provides classes and methods to perform common operations on numeric table contents. Some of the basic numeric table operations include - access, mutation, export to disk and import from disk. Helper functions covered in this document will help automating Intel® DAAL’s creation of numeric table subsets, as well as serialization and compression processes.

The next volume (Volume 3) in the Gentle Introduction series gives a brief introduction to Algorithm section of PyDAAL. Volume 3 focuses on the workflow of important descriptive and predictive algorithms available in Intel® DAAL. Advanced features such as setting hyperparameters, distributing fit calculations, and deploying models as serialized objects will all be covered.

 

 Other Related Links:

Lexi article test 09-27-17

$
0
0

**INTEL-ONLY VIEW**

testing drafts/moderation/publish status from bulk updates.

Potential issues with the RDMA translation cache in the Intel® MPI Library

$
0
0

The Intel® MPI Library comes with a cache that helps to speed up the translation of memory addresses in between the MPI library and the underlying DAPL fabric. As the Intel MPI Library documentation mentions: "The cache substantially increases performance, but may lead to correctness issues in certain situations."

While the performance benefit of the cache on real world applications is not very large, starting with Intel MPI Library 2018, the RDMA translation cache has been deactivated by default.

Users of older Intel MPI Library versions may still experience issues with the cache, which could either result in numerical correctness issues or crashes in Intel MPI. Such crashes would typically look like the following:

Assertion failed in file ../../i_rtc_cache.c at line ...

Therefore, users of older Intel MPI Library versions are recommended to switch off the RDMA translation cache. This can be accomplished by setting the following Intel MPI Library environment variables:

export I_MPI_OFA_TRANSLATION_CACHE=0
export I_MPI_DAPL_TRANSLATION_CACHE=0
export I_MPI_DAPL_UD_TRANSLATION_CACHE=0

Users of the Intel MPI Library 2018 and newer will already have these settings by default.

Introducing the new Graphics Multiframe Analyzer tool for Metal* Beta

$
0
0

Say "Hello!"

to the newest member of the

Intel® Graphics Performance Analyzers tool suite

 

Perform live analysis using the System Analyzer HUD

Using the new System Analyzer HUD UI, view live metrics during gameplay to determine areas that need attention, and gather useful information for the offline analysis phase. Choose from over 200 metrics to display in the HUD. 

 

Pause on a frame during live analysis

During live analysis, have the freedom to easily pause and investigate any frame. Analyze the resources that go into creating that scene and identify bottlenecks in the rendering pipeline.

 

Capture and Share a Stream

Save a Multiframe Stream to use during offline analysis. Replay this stream as many times as you want - pausing on whatever frame interests you - and easily share it with others. Use the optional "Pause on frame" feature to isolate a specific frame within your saved stream. 

 

Download for free on the Intel® GPA sitE!

 


Getting started with IoTivity* on Intel Devices

$
0
0

Introduction

As more IoT devices are deployed into the marketplace, there is a desire to have devices that are intelligent and able to connect with each other in a common standardized way.  IoTivity is a step in this direction providing a framework that is cross platform, architecture independent, and is an open source solution for developers to use with their IoT devices.  This tutorial shows how to get started setting up a host build system, building the framework, and running a client/server example.

Prerequisites

01:  Download and Install Ubuntu* 16 - http://www.ubuntu.com

02:  Learn more about IoTivity - https://www.iotivity.org/documentation

Tutorial

01:  Clone Repositories

git clone https://github.com/iotivity/iotivity.git
git clone https://github.com/01org/tinycbor.git extlibs/tinycbor/tinycbor -b v0.4.1
git clone https://github.com/ARMmbed/mbedtls.git extlibs/mbedtls/mbedtls -b          

02:  Install Host Packages

sudo apt-get install \
build-essential git scons libtool autoconf \
valgrind doxygen wget unzip \
libboost-dev libboost-program-options-dev libboost-thread-dev \
uuid-dev libexpat1-dev libglib2.0-dev libsqlite3-dev libcurl4-gnutls-dev

03:  Build IoTivity for Intel Devices

scons TARGET_PLATFORM=x86

04:  Run the Simple Client/Server Example

            04a:  Open a new Terminal and run the Server        

export LD_LIBRARY_PATH=~/iotivity/out/linux/x86_64/release
cd ~/iotivity/out/linux/x86_64/release/resource/examples
./simpleserver

Deploying the Simple Server

 

          04b:  Open a new Terminal and run the Client

export LD_LIBRARY_PATH=~/iotivity/out/linux/x86_64/release
cd ~/iotivity/out/linux/x86_64/release/resource/examples
./simpleclient

Running the Simple Client that discovers the server resources

 

Summary

This tutorial got you started using the IoTivity framework on Intel Devices.  It was described how to setup the host build environment, how to build the source code, and run a client and server example project.  You are now ready to work with the framework examples and APIs for your next IoT project. 

About the Author

Mike Rylee is a Software Engineer at Intel Corporation with a background in developing embedded systems and apps for Android*, Windows*, iOS*, and Mac*.  He currently works on Internet of Things projects.        

Notices

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

Intel, Intel RealSense, Intel Edison. and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others

© 2017 Intel Corporation.            

Intel® Accelerates Hardware and Software Performance for Server-Side Java* Applications

$
0
0

Intel® contributes significantly to both software and hardware optimizations for Java*. These optimizations can deliver performance advantages for Java applications that run using the optimized Java Virtual Machine (JVM), and which are powered by Intel® Xeon® processors and Intel® Xeon Phi™ processors. Developers do not need to recompile to get the benefit of these optimizations. The capabilities are already integrated into the latest Intel® processors and the latest Java Development Kit (JDK) release.

This paper describes:

  • Key architectural advancements that benefit Java applications which run on the latest Intel Xeon processors and Intel® Xeon Phi™ processors.
  • Some of Intel’s software contributions in team efforts with Oracle* and the OpenJDK* community. The collaboration work includes enabling platform capabilities, improving performance, optimizing microarchitecture for Java applications, improved libraries, and tuning Java virtual machines (JVMs) for specific Java JVM frameworks and applications, such as Apache Hadoop* and Apache Spark*.
  • How both hardware and software optimizations can benefit your Java applications.
  • Techniques and strategies you can use to take advantage of these features

 Download complete paper (PDF)

Vector API Developer Program for Java* Software

$
0
0

Introduction

Big data applications, distributed deep learning and artificial intelligence solutions today can run directly on top of existing Spark or Apache Hadoop* clusters, and can benefit from efficient scale out. For desirable data-parallelism in these applications, Open JDK Project Panama offers Vector API. Vector API Developer Program for Java* software provides a broad set of methods to enrich machine learning and deep learning experience for Java developers. 

This article introduces Vector API to Java developers, it shows how to start using the API in Java programs, and provides examples of vector algorithms. It provides step-by-step details on how to build the vector API and build java applications using it. Further, we provide a detailed tutorial on how to implement Vector code for your own algorithms in Java for faster performance.

 

What is SIMD?

Single Instruction, Multiple Data (SIMD) allows the same operations to be performed on multiple data-points simultaneously, benefiting from data level parallelism in the application. Modern CPUs have advanced SIMD operation support such as AVX2, AVX3 which provide SIMD (instructions) acceleration.

Big Data applications e.g. Apache Flink, Apache Spark Machine Learning libraries and Intel Big DL, Data Analytics and Deep Learning training workloads etc. run highly data-parallel algorithms. Having robust SIMD support in Java opens up ways to expand some of these areas.

 

What is Vector API?

Vector API Developer Program for Java* software makes it possible to write compute-intensive applications, machine learning and artificial intelligence algorithms in Java without Java Native Interface (JNI) performance overhead or further maintenance need for non-portable native code. It introduces a set of methods for Data-parallel operations on sized vector-types for programming in Java directly, without any required knowledge of underlying CPU. These low-level APIs are further efficiently mapped to SIMD instructions on modern CPUs by the JVM JIT complier for desired performance acceleration; otherwise default VM implementation will be used to map java byte-codes into hardware instructions.

Vector Interface

Vector API interface looks as below:

Vector Type

Vector type (Vector<E,S>) takes 'E' Element type and 'S' shape or bitwise length of the vector. Based on the recent development, Project Panama supports Vectors creation of the following Elements and Shapes. 

 Element types:  Byte, Short, Integer, Long, Float, and Double
 Shape types (bit-size): 128, 256, and 512

Vector shapes are chosen to closely map them to largest SIMD registers available on the CPU platform. 

Vector Operations

Basic Vector-Vector functionalities is available for all of these Vector types. Vector operations for typical arithmetic and trigonometric are available in masked format. Mask is used for conditional if-else type operations.

Example section shows how to use Vector mask in the program. 

	  public abstract class DoubleVector<S extends Vector.Shape<Vector<?,?>>> implements Vector<Double,S> {
	  Vector<Double, S> add (Vector<Double, S> v2);
	  Vector<Double,S> add (Vector<Double, S> o, Mask<Double, S> m);
	  Vector<Double, S> mul (Vector<Double, S> v2);
	  Vector<Double, S> mul (Vector<Double, S> o, Mask<Double, S> m);….
	  Vector<Double, S> sin ();
	  Vector<Double, S> sin (Mask<Double, S> m);
	  Vector<Double, S> sqrt (),…
	}

Vector API also provides more advanced vector operations, often needed in Financial Services Industry (FSI) and Machine Learning applications.

	  public abstract class IntVector<S extends Vector.Shape<Vector<?,?>>> implements Vector<Integer,S> {
	  int sumAll ();
             void intoArray(int[] a, int ix);
             void intoArray (int [] is, int ix, Mask<Integer, S> m);
             Vector<Integer, S> fromArray (int [] fs, int ix);
             Vector<Integer, S> blend (Vector<Integer, S> o, Mask<Integer, S> m);
             Vector<Integer, S> shuffle (Vector<Integer, S> o, Shuffle<Integer, S> s);
             Vector<Integer, S> fromByte (byte f);…
            }

Performance Speed Up in Machine Learning 

Basic Linear Algebra Subprograms (BLAS)

Using Vector implementation BLAS level-I, II and III routines can achieve 3-4 times performance speed-up.

BLAS level-I and II routines are commonly used in Apache Spark Machine Learning libraries. Those are applicable to classification and regression of liner models and decision trees, collaborative filtering and clustering, and dimensional reduction problems. BLAS-III routines like GEMM are vastly used in solving deep learning and neural network problems used in artificial intelligence. 

*Open JDK Project Panama source build 09182017. Java Hotspot 64-bit Server VM (mixed mode). OS version: Cent OS 7.3 64-bit

Intel® Xeon® Platinum 8180 processor (using 512 byte and 1024 byte chunk of floating pointing data).

JVM options: -XX:+UnlockDiagnosticVMOptions -XX:-CheckIntrinsics -XX:TypeProfileLevel=121 -XX:+UseVectorApiIntrinsics 

Image Processing Filtering

Using Vector API, Sepia filtering can be done up to 6X faster. 

 

Writing Vector Code

Using Vector API in Java*

Vector interface is bundled as part of com.oracle.vector package, we begin with Vector API by importing the following in our program. Depending on the Vector type, user can chose to import FloatVector, IntVector etc.

import jdk.incubator.vector.FloatVector;
 import jdk.incubator.vector.Vector;
import jdk.incubator.vector.Shapes;

Vector type (Vector<E, S>) takes two parameters.

‘E’: the element type, broadly supporting int, float and double primitive types.

‘S’ specifies shape or bitwise size of the vector.

Before using vector operations, programmer must create a very first vector instance to capture both element type and vector shape. Using that vectors of that particular size and shape can be created. 

	private static final FloatVector.FloatSpecies<Shapes.S256Bit> species = (FloatVector.FloatSpecies<Shapes.S256Bit>) Vector.speciesInstance (Float.class, Shapes.S_256_BIT);
	IntVector.IntSpecies<Shapes.S512Bit> ispec = (IntVector.IntSpecies<Shapes.S512Bit>) Vector.speciesInstance(Integer.class, Shapes.S_512_BIT);

Here onwards, users can create vector instances of FloatVector<Shapes.S256Bit> and IntVector<Shapes.S512Bit> types.

Simple Vector Loops

In this section we provide a flavor of vector API programming. Detailed tips and tricks on how to write vector algorithms in provided in the white paper Vector API: writing own-vector algorithms in Java* for performance.  Sample Vector code examples for BLAS and FSI routines can be found in the subsequent sections.

First example shows vector addition of two arrays. Program uses vector operations like fromArray(), intoArray() to load/store the vectors into arrays.

Vector add() operation for the arithmetic operation.

	public static void AddArrays (float [] left, float [] right, float [] res, int i) {
	FloatVector.FloatSpecies<Shapes.S256Bit> species = (FloatVector.FloatSpecies<Shapes.S256Bit>)
           Vector.speciesInstance (Float.class, Shapes.S_256_BIT);
           FloatVector<Shapes.S256Bit> l  = species.fromArray (left, i);
           FloatVector<Shapes.S256Bit> r  = species.fromArray (right, i);
           FloatVector<Shapes.S256Bit> lr = l.add(r);
           lr.intoArray (res, i);
	}

Vector loops should be written by querying for vector size using species.length (). Consider the scalar loop below which adds arrays A and B and stores the result into array C. 

                for (int i = 0; i < C.length; i++) {
                     C[i] = A[i] + B[i];
                 }                                                

 Vectorized loop looks like one below: 

	public static void add (int [] C, int [] A, int [] B) {
	        IntVector.IntSpecies<Shapes.S256Bit> species =
	        (IntVector.IntSpecies<Shapes.S256Bit>)     Vector.speciesInstance(Integer.class, Shapes.S_256_BIT);
	        int i;
	        for (i = 0; (i + species.length()) < C.length; i += species.length ()) {
		IntVector<Shapes.S256Bit> av = species.fromArray (A, i);
	            IntVector<Shapes.S256Bit> bv = species.fromArray (B, i);
	            av.add(bv).intoArray(C, i);
	        }
	        for (; i < C.length; i++) { // Cleanup loop
	            C[i] = A[i] + B[i];
	        }
	    }

One can also write this program in length-agnostic manner, independent of the vector size. Following program parameterizes the vector code via the Shape without providing it explicitly. 

	public class AddClass<S extends Vector.Shape<Vector<?, ?>>> {
	      private final FloatVector.FloatSpecies<S> spec;
	      AddClass (FloatVector.FloatSpecies<S> v) {spec = v; }
	      //vector routine for add
	       void add (float [] A, float [] B, float [] C) {
	        int i=0;
	        for (; i+spec.length ()<C.length;i+=spec.length ()) {
	            FloatVector<S> av = spec.fromArray (A, i);
	            FloatVector<S> bv = spec.fromArray (B, i);
	            av.add (bv).intoArray(C, i);
	        }
	       //clean up loop
	        for (;i<a.length;i++) C[i]=A[i]+B[i];

Operations in conditional statement can be written in vector form using the mask.  We grab a mask for the condition first.

Scalar routine below, 

	for (int i = 0; i < SIZE; i++) {
	    float res = b[i];
	    if (a[i] > 1.0) {
	      res = res * a[i];
              }
             c[i] = res;
          }

Vector routine with mask.

	public void useMask (float [] a, float [] b, float [] c, int SIZE) {
	 FloatVector.FloatSpecies<Shapes.S256Bit> species = (FloatVector.FloatSpecies <Shapes.S256Bit>) Vector.speciesInstance   Float.class, Shapes.S_256_BIT);
	 FloatVector<Shapes.S256Bit> tv=species.broadcast (1.0f); int i = 0;
	 for (; i+ species.length() < SIZE; i+ = species.length()){
	   FloatVector<Shapes.S256Bit> rv = species.fromArray (b, i);
	   FloatVector<Shapes.S256Bit> av = species.fromArray (a, i);
	   Vector.Mask<Float,Shapes.S256Bit> mask = av.greaterThan (tv);
	   rv.mul (av, mask).intoArray(c, i);
	 }
	  //tail processing
	}

Tutorial: writing own-vector algorithms

Vector API: writing own-vector algorithms in Oracle Java for faster performance white paper provides several tricks and tips on writing Java code using Vector API and, also goes over some ways to increase performance.

These examples should give you some guidelines and best practices for vector programming in Oracle Java*, to help you to write successful vector versions of your own compute-intensive algorithms.

See the attached PDF for more information.

Tutorial: all you need to know about Vector API

 

Getting Started

Building Vector API

This section assumes users to be familiar with basic Linux utilities.

Set up JDK8 binary set up as JAVA_HOME

Project Panama requires JDK8 on the system. JDK can be downloaded from this location.

# export JAVA_HOME=/pathto/jdk1.8-u91
# export PATH=$JAVA_HOME/bin:$PATH

Download and build Panama Sources

One can download Project Panama sources using mercurial  source control management tool.

# hg clone http://hg.openjdk.java.net/panama/panama/
# source get_source.sh
# ./configure
# make all

Building your own application using Panama JDK

We need to copy vector.jar file to the location of java application. From the parent directory containing panama sources.

import jdk.incubator.vector.IntVector;
import jdk.incubator.vector.Shapes;
import jdk.incubator.vector.Vector;

public class HelloVectorApi {
    public static void main(String[] args) {
        IntVector.IntSpecies<Shapes.S128Bit> species =
                (IntVector.IntSpecies<Shapes.S128Bit>) Vector.speciesInstance(
                        Integer.class, Shapes.S_128_BIT);
        int val = 1;
        IntVector<Shapes.S128Bit> hello = species.broadcast(val);
        if (hello.sumAll() == val * species.length()) {
            System.out.println("Hello Vector API!");
        }
    }
}

Run your application

Currently these are set of experimental flags used by Project Panama.

/pathto/panama/build/linux-x86_64-normal-server-release/images/jdk/bin/java --add-modules=jdk.incubator.vector -XX:TypeProfileLevel=121 HelloVectorApi

IDE Configurations

Configuring IntelliJ for development for OpenJDK Panama

1) Create a new project. If this is a fresh installation of IntelliJ or have no projects open, you will press the "Create New Project" on the window that comes up (you can see window below).

Otherwise, File > New > Project... will have the same effect.

2) In the "New Project" window that comes up, make sure to select Java on left side. You will also have to select the Panama build as your Project SDK.

If Panama build has not been set up as a Project SDK, press the "New..." button on right side. Otherwise, go to step 4.

3) The window that comes up will be named "Select Home Directory for JDK". The path you want to select is /path/to/panama/build/linux-x86_64-normal-server-release/images/jdk. Press OK.

4) Press Next. At this point you can select Create project from template. Go ahead and select "Command Line App" and click Next again.

5) Give your project a name and location and click Finish.

6) Once project is created, a few more steps are needed to use Vector API successfully. Go to File > Project Structure ...

 

7) Make sure that "Project" is selected in left pane. Change "Project language level:" to "9 - Modules, private methods in interfaces etc.". Finally, press OK.

8) In the left pane that shows directory structure, right click on "src" folder. Navigate to New > module-info.java

 

9) Inside this file, add following line "requires jdk.incubator.vector;". Save the file.

 

10) Go back to Main.java. Add your desired code that uses API. For an example, please see HelloVectorApi.java.

 

11) Before running application, you will need to edit the run configuration. Press on button with class name next to "play" button. You should see "Edit Configurations...". Click on that.

 

12) In the VM options, you will need to add "-XX:TypeProfileLevel=121 -XX:+UseVectorApiIntrinsics". Both of these are likely to become optional in future. If you want to play with turning on/off the optimization that converts VectorApi to optimized x86 intrinsics (for stability reasons), you will need to say:

"-XX:-UseVectorApiIntrinsics".

13) Press the "play" button to build and run the application. At bottom of screen in terminal window, you should see "Hello Vector API!" or whatever output you made application print.

 

 

Vector Examples

BLAS Machine Learning

Copyright 2017 Intel Corporation

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

BLAS-I

import jdk.incubator.vector.DoubleVector;
import jdk.incubator.vector.Vector;
import jdk.incubator.vector.Shapes;
import java.lang.Math;

public class BLAS  {



    static void VecDaxpy(double[] a, int a_offset, double[] b, int b_offset, double alpha) {
        DoubleVector.DoubleSpecies<Shapes.S512Bit> spec= (DoubleVector.DoubleSpecies<Shapes.S512Bit>) Vector.speciesInstance(Double.class, Shapes.S_512_BIT);
        DoubleVector<Shapes.S512Bit> alphaVec = spec.broadcast(alpha);
        int i = 0;
        for (; (i + a_offset+ spec.length()) < a.length && (i + b_offset + spec.length()) < b.length; i += spec.length()) {
            DoubleVector<Shapes.S512Bit> bv = spec.fromArray(b, i + b_offset);
            DoubleVector<Shapes.S512Bit> av = spec.fromArray(a, i + a_offset);
            bv.add(av.mul(alphaVec)).intoArray(b, i + b_offset);
        }

        for (; i+a_offset < a.length && i+b_offset<b.length; i++) b[i + b_offset] += alpha * a[i + a_offset]; //tail
    }

	static void VecDaxpyFloat(float[] a, int a_offset, float[] b, int b_offset, float alpha) {
        FloatVector.FloatSpecies<Shapes.S256Bit> spec= (FloatVector.FloatSpecies<Shapes.S256Bit>) Vector.speciesInstance(Float.class, Shapes.S_256_BIT);

        int i = 0;
        for (; (i + a_offset+spec.length()) < a.length && (i+b_offset+spec.length())<b.length; i += spec.length()) {

            FloatVector<Shapes.S256Bit> bv = spec.fromArray(b, i + b_offset);
            FloatVector<Shapes.S256Bit> av = spec.fromArray(a, i + a_offset);
            FloatVector<Shapes.S256Bit> alphaVec = spec.broadcast(alpha);
            bv.add(av.mul(alphaVec)).intoArray(b, i + b_offset);
        }

        for (; i+a_offset < a.length && i+b_offset<b.length; i++) b[i + b_offset] += alpha * a[i + a_offset];
    }


    static void VecDdot(double[] a, int a_offset, double[] b, int b_offset) {
        DoubleVector.DoubleSpecies<Shapes.S512Bit> spec= (DoubleVector.DoubleSpecies<Shapes.S512Bit>) Vector.speciesInstance(Double.class, Shapes.S_512_BIT);

        int i = 0; double sum = 0;
        for (; (i + a_offset + spec.length()) < a.length && (i + b_offset+ spec.length()) < b.length; i += spec.length()) {
            DoubleVector<Shapes.S512Bit> l = spec.fromArray(a, i + a_offset);
            DoubleVector<Shapes.S512Bit> r = spec.fromArray(b, i + b_offset);
            sum+=l.mul(r).sumAll();
        }
        for (; (i + a_offset < a.length) && (i + b_offset < b.length); i++) sum += a[i+a_offset] * b[i+b_offset]; //tail
    }

	static void VecDdotFloat(float[] a, int a_offset, float[] b, int b_offset) {
        FloatVector.FloatSpecies<Shapes.S256Bit> spec= (FloatVector.FloatSpecies<Shapes.S256Bit>) Vector.speciesInstance(Float.class, Shapes.S_256_BIT);

        int i = 0; float sum = 0;
        for (; i+a_offset + spec.length() < a.length && i+b_offset+spec.length()<b.length; i += spec.length()) {
            FloatVector<Shapes.S256Bit> l = spec.fromArray(a, i + a_offset);
            FloatVector<Shapes.S256Bit> r = spec.fromArray(b, i + b_offset);
            sum+=l.mul(r).sumAll();
        }
        for (; i+a_offset < a.length && i+b_offset<b.length; i++) sum += a[i+a_offset] * b[i+b_offset]; //tail
    }
}

BLAS-II (DSPR)

import jdk.incubator.vector.DoubleVector;
import jdk.incubator.vector.Vector;
import jdk.incubator.vector.Shapes;

public class BLAS_II {

    public static void VecDspr(String uplo, int n, double alpha, double[] x, int _x_offset, int incx, double[] ap, int _ap_offset) {
        DoubleVector.DoubleSpecies<Shapes.S512Bit> spec= (DoubleVector.DoubleSpecies<Shapes.S512Bit>) Vector.speciesInstance(Double.class, Shapes.S_512_BIT);

        double temp = 0.0;
        int i = 0;
        int ix = 0;
        int j = 0;
        int jx = 0;
        int k = 0;
        int kk = 0;
        int kx = 0;
        kk = 1;
        if (uplo.equals("U")) {
            // *        Form  A  when upper triangle is stored in AP.
            if (incx == 1) {
                for (j=0; j<n; j++) {
                    if (x[j+_x_offset] != 0.0) {
                        temp = alpha*x[j+_x_offset];
                        DoubleVector<Shapes.S512Bit> tv = spec.broadcast(temp);
                        for (i=0, k=kk; i+spec.length()<=j && i + _x_offset + spec.length() < x.length && k + _ap_offset + spec.length() < ap.length; i+= spec.length(), k+=spec.length()) {
                            DoubleVector<Shapes.S512Bit> av = spec.fromArray(ap, k+_ap_offset);
                            DoubleVector<Shapes.S512Bit> xv = spec.fromArray(x, i+_x_offset);
                            av.add(xv.mul(tv)).intoArray(ap,k+_ap_offset);
                        }
                        for (; i<=j && i + _x_offset < x.length && k + _ap_offset <ap.length; i++, k++) {
                            ap[k+_ap_offset]=ap[k+_ap_offset]+x[i+_x_offset]*temp;
                        }
                    }
                    kk = kk + j;
                }
            }
        } else {
            // *        Form  A  when lower triangle is stored in AP.
            if (incx == 1) {
                for (j=0; j<n; j++) {
                    if (x[j+_x_offset] != 0.0) {
                        temp = alpha*x[j+_x_offset];
                        DoubleVector<Shapes.S512Bit> tv=spec.broadcast(temp);
                        k = kk;
                        for (i=j; i+spec.length()<n && i + _x_offset + spec.length() < x.length && k + _ap_offset + spec.length() < ap.length; i+=spec.length(), k+=spec.length()) {
                            DoubleVector<Shapes.S512Bit> av = spec.fromArray(ap, k+_ap_offset);
                            DoubleVector<Shapes.S512Bit> xv = spec.fromArray(x, i+_x_offset);
                            av.add(xv.mul(tv)).intoArray(ap,k+_ap_offset);
                        }
                        for (; i<n && i + _x_offset < x.length && k + _ap_offset <ap.length; i++, k++) {
                            ap[k+_ap_offset] = ap[k+_ap_offset]+x[i+_x_offset]*temp;
                        }
                    }
                    kk = kk+n-j;
                }
            }
        }
    }

}

BLAS-II (DYSR)

import jdk.incubator.vector.DoubleVector;
import jdk.incubator.vector.Shapes;
import jdk.incubator.vector.Vector;

public class BLAS2DSYR {


    public static void VecDsyr(String uplo, int n, double alpha, double[] x, int _x_offset, int incx, double[] a, int _a_offset, int lda) {
        DoubleVector.DoubleSpecies<Shapes.S512Bit> spec= (DoubleVector.DoubleSpecies<Shapes.S512Bit>) Vector.speciesInstance(Double.class, Shapes.S_512_BIT);
        double temp = 0.0;
        int i = 0;
        int ix = 0;
        int j = 0;
        int jx = 0;
        int kx = 0;

        if (uplo.equals("U") && incx == 1) {
            for (j=0; j<n; j++) {
                if (x[j+_x_offset] != 0.0) {
                    temp=alpha*x[j+_x_offset];
                    DoubleVector<Shapes.S512Bit> tv = spec.broadcast(temp);
                    for (i=0; (i+spec.length())<=j && i+_x_offset+spec.length()<x.length && i+j*lda+_a_offset+spec.length()<a.length; i+= spec.length()) {
                        DoubleVector<Shapes.S512Bit> xv = spec.fromArray(x, i+_x_offset);
                        DoubleVector<Shapes.S512Bit> av = spec.fromArray(a, i+j*lda+_a_offset);
                        av.add(xv.mul(tv)).intoArray(a,i+j*lda+_a_offset);
                    }
                    for (; i<=j && i+j*lda+_a_offset<a.length && i+_x_offset<x.length; i++) {
                        a[i+j*lda+_a_offset] = a[i+j*lda+_a_offset]+x[i+_x_offset]*temp;
                    }
                }
            }

        } else if (uplo.equals("L") && incx == 1) {
            for (j = 0; j < n; j++) {
                if (x[j+_x_offset] != 0.0) {
                    temp=alpha*x[j+_x_offset];
                    DoubleVector<Shapes.S512Bit> tv = spec.broadcast(temp);
                    for (i=j; (i+spec.length())<n && i+_x_offset+spec.length()<x.length && i+j*lda+_a_offset+spec.length()<a.length; i+=spec.length()) {
                        DoubleVector<Shapes.S512Bit> xv = spec.fromArray(x,i+_x_offset);
                        DoubleVector<Shapes.S512Bit> av = spec.fromArray(a,i+j*lda+_a_offset);
                        av.add(xv.mul(tv)).intoArray(a,i+j*lda+_a_offset);
                    }
                    for (; i<n && i+j*lda+_a_offset<a.length && i+_x_offset<x.length; i++) {
                        a[i+j*lda+_a_offset]=a[i+j*lda+_a_offset]+x[i+_x_offset]*temp;
                    }
                }
            }

        }

    }
}

BLAS-III(DSYR2K)

import jdk.incubator.vector.FloatVector;
import jdk.incubator.vector.DoubleVector;
import jdk.incubator.vector.Shapes;
import jdk.incubator.vector.Vector;

public class BLAS3DSYR2K {


    public void VecDsyr2k(String uplo, String trans, int n, int k, double alpha, double[] a, int _a_offset, int lda, double[] b, int _b_offset, int ldb, double beta, double[] c, int _c_offset, int Ldc) {
        DoubleVector.DoubleSpecies<Shapes.S512Bit> spec= (DoubleVector.DoubleSpecies<Shapes.S512Bit>) Vector.speciesInstance(Double.class, Shapes.S_512_BIT);
        double temp1 = 0.0;
        double temp2 = 0.0;
        int i = 0;
        int info = 0;
        int j = 0;
        int l = 0;
        int nrowa = 0;
        boolean upper = false;
        if (trans.equals("N")) {
            nrowa = n;
        } else {
            nrowa = k;
        }              //  Close else.
        DoubleVector<Shapes.S512Bit> zeroVec = spec.broadcast(0.0D);
        DoubleVector<Shapes.S512Bit> betaVec = spec.broadcast(beta);
        upper = uplo.equals("U");
        if (alpha == 0.0) {
            if (upper) {
                if (beta == 0.0) {
                    for (j = 0; j < n; j++) {
                        i = 0;
                        for (; (i + spec.length()) < j; i += spec.length()) {
                            zeroVec.intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i < j; i++) {
                            c[i + j * Ldc + _c_offset] = 0.0;
                        }
                    }
                } else {
                    for (j = 0; j < n; j++) {
                        i = 0;
                        for (; (i + spec.length()) < j; i += spec.length()) {
                            DoubleVector<Shapes.S512Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                            cV.mul(betaVec).intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i < j; i++) {
                            c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset];
                        }
                    }
                }
            }

            //lower
            else {
                if (beta == 0.0) {
                    for (j = 0; j < n; j++) {
                        i = j;
                        for (; i + spec.length() < n; i += spec.length()) {
                            zeroVec.intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i < n; i++) {
                            c[i + j * Ldc + _c_offset] = 0.0;
                        }
                    }
                } else {
                    for (j = 0; j < n; j++) {
                        i = j;
                        for (; i + spec.length() < n; i += spec.length()) {
                            DoubleVector<Shapes.S512Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                            cV.mul(betaVec).intoArray(c, i + j * Ldc + _c_offset);
                        }
                    }
                    for (; i < n; i++) {
                        c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset];
                    }
                }
            }
        }
        //start operations
        if (trans.equals("N")) {
            // *        Form  C := alpha*A*B**T + alpha*B*A**T + C.
            if (upper) {
                for (j = 0; j < n; j++) {
                    if (beta == 0.0) {
                        i = 0;
                        for (; i + spec.length() < j; i += spec.length()) {
                            zeroVec.intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i < j; i++) {
                            c[i + j * Ldc + _c_offset] = 0.0;
                        }

                    } else if (beta != 1.0) {
                        i = 0;
                        for (; i + spec.length() < j; i += spec.length()) {
                            DoubleVector<Shapes.S512Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                            cV.mul(betaVec).intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i < j; i++) {
                            c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset];
                        }
                    }

                    for (l = 0; l < k; l++) {
                        if ((a[j + l * lda + _a_offset] != 0.0) || (b[j + l * ldb + _b_offset] != 0.0)) {
                            temp1 = alpha * b[j + l * ldb + _b_offset]; DoubleVector<Shapes.S512Bit> tv1 = spec.broadcast(temp1);
                            temp2 = alpha * a[j + l * lda + _a_offset]; DoubleVector<Shapes.S512Bit> tv2 = spec.broadcast(temp2);
                            i = 0;
                            for (; (i + spec.length()) < j; i += spec.length()) {
                                DoubleVector<Shapes.S512Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                                DoubleVector<Shapes.S512Bit> bV = spec.fromArray(b, i + l * ldb + _b_offset);
                                DoubleVector<Shapes.S512Bit> aV = spec.fromArray(a, i + l * lda + _a_offset);
                                cV.add(aV.mul(tv1)).add(bV.mul(tv2)).intoArray(c, i + j * Ldc + _c_offset);
                            }
                            for (; i < j; i++) {
                                c[i + j * Ldc + _c_offset] = c[i + j * Ldc + _c_offset] + a[i + l * lda + _a_offset] * temp1 + b[i + l * ldb + _b_offset] * temp2;
                            }
                        }
                    }
                }
            } else {

                for (j = 0; j < n; j++) {
                    if (beta == 0.0) {
                        i = j;
                        for (; (i + spec.length()) < n; i += spec.length()) {
                            zeroVec.intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i < n; i++) {
                            c[i + j * Ldc + _c_offset] = 0.0;
                        }
                    } else if (beta != 1.0) {
                        i = j;
                        for (; (i + spec.length()) < n; i += spec.length()) {
                            DoubleVector<Shapes.S512Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                            cV.mul(betaVec).intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i < n; i++) {
                            c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset];
                        }
                    }
                    for (l = 0; l < k; l++) {
                        if ((a[j + l * lda + _a_offset] != 0.0) || (b[j + l * ldb + _b_offset] != 0.0)) {
                            temp1 = alpha * b[j + l * ldb + _b_offset]; DoubleVector<Shapes.S512Bit> tv1 = spec.broadcast(temp1);
                            temp2 = alpha * a[j + l * lda + _a_offset]; DoubleVector<Shapes.S512Bit> tv2 = spec.broadcast(temp2);
                            i = j;
                            for (; i + spec.length() < n; i += spec.length()) {
                                DoubleVector<Shapes.S512Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                                DoubleVector<Shapes.S512Bit> aV = spec.fromArray(a, i + l * lda + _a_offset);
                                DoubleVector<Shapes.S512Bit> bV = spec.fromArray(b, i + l * ldb + _b_offset);
                                cV.add(aV.mul(tv1)).add(bV.mul(tv2)).intoArray(c, i + j * Ldc + _c_offset);
                            }
                            for (; i < n; i++) {
                                c[i + j * Ldc + _c_offset] = c[i + j * Ldc + _c_offset] + a[i + l * lda + _a_offset] * temp1 + b[i + l * ldb + _b_offset] * temp2;
                            }
                        }
                    }
                }
            }
        } else {

// *        Form  C := alpha*A**T*B + alpha*B**T*A + C.
            if (upper) {
                for (j = 0; j < n; j++) {
                    for (i = 0; i < j; i++) {
                        temp1 = 0.0;
                        temp2 = 0.0;
                        l = 0;
                        for (; l + spec.length() < k; l += spec.length()) {
                            DoubleVector<Shapes.S512Bit> aV1 = spec.fromArray(a, l + i * lda + _a_offset);
                            DoubleVector<Shapes.S512Bit> bV1 = spec.fromArray(b, l + j * ldb + _b_offset);
                            DoubleVector<Shapes.S512Bit> aV2 = spec.fromArray(a, l + j * lda + _a_offset);
                            DoubleVector<Shapes.S512Bit> bV2 = spec.fromArray(b, l + i * ldb + _b_offset);
                            temp1 += aV1.mul(bV1).sumAll();
                            temp2 += aV2.mul(bV2).sumAll();
                        }
                        for (; l < k; l++) {
                            temp1 = temp1 + a[l + i * lda + _a_offset] * b[l + j * ldb + _b_offset];
                            temp2 = temp2 + b[l + i * ldb + _b_offset] * a[l + j * lda + _a_offset];
                        }
                        if (beta == 0.0) {
                            c[i + j * Ldc + _c_offset] = alpha * temp1 + alpha * temp2;
                        } else {
                            c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset] + alpha * temp1 + alpha * temp2;
                        }
                    }
                }
            } else {
                for (j = 0; j < n; j++) {
                    for (i = j; i < n; i++) {
                        temp1 = 0.0;
                        temp2 = 0.0;
                        l = 0;
                        for (; l+spec.length() < k; l+=spec.length()) {
                            DoubleVector<Shapes.S512Bit> aV1=spec.fromArray(a,l + i * lda + _a_offset);
                            DoubleVector<Shapes.S512Bit> bV1=spec.fromArray(b,l + j * ldb + _b_offset);
                            DoubleVector<Shapes.S512Bit> bV2=spec.fromArray(b,l + i * ldb + _b_offset);
                            DoubleVector<Shapes.S512Bit> aV2=spec.fromArray(a,l + j * lda + _a_offset);
                            temp1+=aV1.mul(bV1).sumAll();
                            temp2+=aV2.mul(bV2).sumAll();
                        }
                        for (; l < k; l++) {
                            temp1 = temp1 + a[l + i * lda + _a_offset] * b[l + j * ldb + _b_offset];
                            temp2 = temp2 + b[l + i * ldb + _b_offset] * a[l + j * lda + _a_offset];
                        }
                        if (beta == 0.0) {
                            c[i + j * Ldc + _c_offset] = alpha * temp1 + alpha * temp2;
                        } else {
                            c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset] + alpha * temp1 + alpha * temp2;
                        }
                    }
                }
            }
        }

    }

    static public void VecDsyr2kFloat(String uplo, String trans, int n, int k, float alpha, float[] a, int _a_offset, int lda, float[] b, int _b_offset, int ldb, float beta, float[] c, int _c_offset, int Ldc) {
        FloatVector.FloatSpecies<Shapes.S256Bit> spec= (FloatVector.FloatSpecies<Shapes.S256Bit>) Vector.speciesInstance(Float.class, Shapes.S_256_BIT);
        float temp1 = 0.0f;
        float temp2 = 0.0f;
        int i = 0;
        int info = 0;
        int j = 0;
        int l = 0;
        int nrowa = 0;
        boolean upper = false;
        if (trans.equals("N")) {
            nrowa = n;
        } else {
            nrowa = k;
        }              //  Close else.
        //FloatVector<Shapes.S256Bit> zeroVec = spec.broadcast(0.0f);
       // FloatVector<Shapes.S256Bit> betaVec = spec.broadcast(beta);
        upper = uplo.equals("U");
        if (alpha == 0.0) {
            if (upper) {
                if (beta == 0.0) {
                    for (j = 0; j < n; j++) {
                        i = 0;
                        for (; (i + spec.length()) < j; i += spec.length()) {
                            FloatVector<Shapes.S256Bit> zeroVec = spec.broadcast(0.0f);
                            zeroVec.intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i <= j; i++) {
                            c[i + j * Ldc + _c_offset] = 0.0f;
                        }
                    }
                } else {
                    for (j = 0; j < n; j++) {
                        i = 0;
                        for (; (i + spec.length()) <= j; i += spec.length()) {
                            FloatVector<Shapes.S256Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                            FloatVector<Shapes.S256Bit> betaVec = spec.broadcast(beta);
                            cV.mul(betaVec).intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i <= j; i++) {
                            c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset];
                        }
                    }
                }
            }

            //lower
            else {
                if (beta == 0.0) {
                    for (j = 0; j < n; j++) {
                        i = j;
                        for (; i + spec.length() < n; i += spec.length()) {
                            FloatVector<Shapes.S256Bit> zeroVec = spec.broadcast(0.0f);
                            zeroVec.intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i < n; i++) {
                            c[i + j * Ldc + _c_offset] = 0.0f;
                        }
                    }
                } else {
                    for (j = 0; j < n; j++) {
                        i = j;
                        for (; i + spec.length() < n; i += spec.length()) {
                            FloatVector<Shapes.S256Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                            FloatVector<Shapes.S256Bit> betaVec = spec.broadcast(beta);
                            cV.mul(betaVec).intoArray(c, i + j * Ldc + _c_offset);
                        }
                    }
                    for (; i < n; i++) {
                        c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset];
                    }
                }
            }
        }
        //start operations
        if (trans.equals("N")) {
            // *        Form  C := alpha*A*B**T + alpha*B*A**T + C.
            if (upper) {
                for (j = 0; j < n; j++) {
                    if (beta == 0.0) {
                        i = 0;
                        for (; i + spec.length() <= j; i += spec.length()) {
                            FloatVector<Shapes.S256Bit> zeroVec = spec.broadcast(0.0f);
                            zeroVec.intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i <= j; i++) {
                            c[i + j * Ldc + _c_offset] = 0.0f;
                        }

                    } else if (beta != 1.0) {
                        i = 0;
                        for (; i + spec.length() <= j; i += spec.length()) {
                            FloatVector<Shapes.S256Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                            FloatVector<Shapes.S256Bit> betaVec = spec.broadcast(beta);
                            cV.mul(betaVec).intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i <= j; i++) {
                            c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset];
                        }
                    }

                    for (l = 0; l < k; l++) {
                        if ((a[j + l * lda + _a_offset] != 0.0) || (b[j + l * ldb + _b_offset] != 0.0)) {
                            temp1 = alpha * b[j + l * ldb + _b_offset];
                            temp2 = alpha * a[j + l * lda + _a_offset];
                            i = 0;
                            for (; (i + spec.length()) <= j; i += spec.length()) {
                                FloatVector<Shapes.S256Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                                FloatVector<Shapes.S256Bit> bV = spec.fromArray(b, i + l * ldb + _b_offset);
                                FloatVector<Shapes.S256Bit> aV = spec.fromArray(a, i + l * lda + _a_offset);
                                FloatVector<Shapes.S256Bit> tv1 = spec.broadcast(temp1);
                                FloatVector<Shapes.S256Bit> tv2 = spec.broadcast(temp2);
                                cV.add(aV.mul(tv1)).add(bV.mul(tv2)).intoArray(c, i + j * Ldc + _c_offset);
                            }
                            for (; i <= j; i++) {
                                c[i + j * Ldc + _c_offset] = c[i + j * Ldc + _c_offset] + a[i + l * lda + _a_offset] * temp1 + b[i + l * ldb + _b_offset] * temp2;
                            }
                        }
                    }
                }
            } else {

                for (j = 0; j < n; j++) {
                    if (beta == 0.0) {
                        i = j;
                        for (; (i + spec.length()) < n; i += spec.length()) {
                            FloatVector<Shapes.S256Bit> zeroVec = spec.broadcast(0.0f);
                            zeroVec.intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i < n; i++) {
                            c[i + j * Ldc + _c_offset] = 0.0f;
                        }
                    } else if (beta != 1.0) {
                        i = j;
                        for (; (i + spec.length()) < n; i += spec.length()) {
                            FloatVector<Shapes.S256Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                            FloatVector<Shapes.S256Bit> betaVec = spec.broadcast(beta);
                            cV.mul(betaVec).intoArray(c, i + j * Ldc + _c_offset);
                        }
                        for (; i < n; i++) {
                            c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset];
                        }
                    }
                    for (l = 0; l < k; l++) {
                        if ((a[j + l * lda + _a_offset] != 0.0) || (b[j + l * ldb + _b_offset] != 0.0)) {
                            temp1 = alpha * b[j + l * ldb + _b_offset];
                            temp2 = alpha * a[j + l * lda + _a_offset];
                            i = j;
                            for (; i + spec.length() < n; i += spec.length()) {
                                FloatVector<Shapes.S256Bit> cV = spec.fromArray(c, i + j * Ldc + _c_offset);
                                FloatVector<Shapes.S256Bit> aV = spec.fromArray(a, i + l * lda + _a_offset);
                                FloatVector<Shapes.S256Bit> bV = spec.fromArray(b, i + l * ldb + _b_offset);
                                FloatVector<Shapes.S256Bit> tv1 = spec.broadcast(temp1);
                                FloatVector<Shapes.S256Bit> tv2 = spec.broadcast(temp2);
                                cV.add(aV.mul(tv1)).add(bV.mul(tv2)).intoArray(c, i + j * Ldc + _c_offset);
                            }
                            for (; i < n; i++) {
                                c[i + j * Ldc + _c_offset] = c[i + j * Ldc + _c_offset] + a[i + l * lda + _a_offset] * temp1 + b[i + l * ldb + _b_offset] * temp2;
                            }
                        }
                    }
                }
            }
        } else {

// *        Form  C := alpha*A**T*B + alpha*B**T*A + C.
            if (upper) {
                for (j = 0; j < n; j++) {
                    for (i = 0; i < j; i++) {
                        temp1 = 0.0f;
                        temp2 = 0.0f;
                        l = 0;
                        for (; l + spec.length() < k; l += spec.length()) {
                            FloatVector<Shapes.S256Bit> aV1 = spec.fromArray(a, l + i * lda + _a_offset);
                            FloatVector<Shapes.S256Bit> bV1 = spec.fromArray(b, l + j * ldb + _b_offset);
                            FloatVector<Shapes.S256Bit> aV2 = spec.fromArray(a, l + j * lda + _a_offset);
                            FloatVector<Shapes.S256Bit> bV2 = spec.fromArray(b, l + i * ldb + _b_offset);
                            temp1 += aV1.mul(bV1).sumAll();
                            temp2 += aV2.mul(bV2).sumAll();
                        }
                        for (; l < k; l++) {
                            temp1 = temp1 + a[l + i * lda + _a_offset] * b[l + j * ldb + _b_offset];
                            temp2 = temp2 + b[l + i * ldb + _b_offset] * a[l + j * lda + _a_offset];
                        }
                        if (beta == 0.0) {
                            c[i + j * Ldc + _c_offset] = alpha * temp1 + alpha * temp2;
                        } else {
                            c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset] + alpha * temp1 + alpha * temp2;
                        }
                    }
                }
            } else {
                for (j = 0; j < n; j++) {
                    for (i = j; i < n; i++) {
                        temp1 = 0.0f;
                        temp2 = 0.0f;
                        l = 0;
                        for (; l+spec.length() < k; l+=spec.length()) {
                            FloatVector<Shapes.S256Bit> aV1=spec.fromArray(a,l + i * lda + _a_offset);
                            FloatVector<Shapes.S256Bit> bV1=spec.fromArray(b,l + j * ldb + _b_offset);
                            FloatVector<Shapes.S256Bit> bV2=spec.fromArray(b,l + i * ldb + _b_offset);
                            FloatVector<Shapes.S256Bit> aV2=spec.fromArray(a,l + j * lda + _a_offset);
                            temp1+=aV1.mul(bV1).sumAll();
                            temp2+=aV2.mul(bV2).sumAll();
                        }
                        for (; l < k; l++) {
                            temp1 = temp1 + a[l + i * lda + _a_offset] * b[l + j * ldb + _b_offset];
                            temp2 = temp2 + b[l + i * ldb + _b_offset] * a[l + j * lda + _a_offset];
                        }
                        if (beta == 0.0) {
                            c[i + j * Ldc + _c_offset] = alpha * temp1 + alpha * temp2;
                        } else {
                            c[i + j * Ldc + _c_offset] = beta * c[i + j * Ldc + _c_offset] + alpha * temp1 + alpha * temp2;
                        }
                    }
                }
            }
        }

    }

} // End class.

BLASS-III(DGEMM)

import jdk.incubator.vector.DoubleVector;
import jdk.incubator.vector.Shapes;
import jdk.incubator.vector.Vector;

public class BLAS3GEMM {



    void VecDgemm(String transa, String transb, int m, int n, int k, double alpha, double[] a, int a_offset, int lda, double[] b, int b_offset, int ldb, double beta, double[] c, int c_offset, int ldc) {
        DoubleVector.DoubleSpecies<Shapes.S512Bit> spec= (DoubleVector.DoubleSpecies<Shapes.S512Bit>) Vector.speciesInstance(Double.class, Shapes.S_512_BIT);
        double temp = 0.0;
        int i = 0;
        int info = 0;
        int j = 0;
        int l = 0;
        int ncola = 0;
        int nrowa = 0;
        int nrowb = 0;
        boolean nota = false;
        boolean notb = false;
        DoubleVector<Shapes.S512Bit> zeroVec = spec.broadcast(0.0);

        if (m == 0 || n == 0 || ((alpha == 0 || k == 0) && beta == 1.0))
            return;
        //double temp=0.0;
        if (alpha == 0.0) {
            if (beta == 0.0) {
                for (j = 0; j < n; j++) {
                    for (i = 0; (i + spec.length()) < m && (i + j * ldc + c_offset+spec.length())<c.length; i += spec.length()) {
                        zeroVec.intoArray(c, i + j * ldc + c_offset);
                    }
                    for (; i < m && i + j * ldc + c_offset<c.length; i++) {
                        c[i + j * ldc + c_offset] = 0.0;
                    }
                }
            }

            //beta!=0.0
            else {
                for (j = 0; j < n; j++) {
                    DoubleVector<Shapes.S512Bit> bv = spec.broadcast(beta);
                    for (i = 0; (i + spec.length()) < m && (i + j * ldc + c_offset+spec.length())<c.length; i += spec.length()) {
                        DoubleVector<Shapes.S512Bit> cv = spec.fromArray(c, i + j * ldc + c_offset);
                        cv.mul(bv).intoArray(c, i + j * ldc + c_offset);
                    }
                    for (; i < m && i + j * ldc + c_offset<c.length; i++) c[i + j * ldc + c_offset] = beta * c[i + j * ldc + c_offset];
                }
            }

        }

        if (notb) {
            if (nota) {

// *           Form  C := alpha*A*B + beta*C.

                for (j = 0; j < n; j++) {
                    if (beta == 0.0) {
                        for (i = 0; (i + spec.length()) < m && (i + j * ldc + c_offset+spec.length())<c.length; i += spec.length()) {
                            zeroVec.intoArray(c, i + j * ldc + c_offset);
                        }
                        for (; i < m; i++) {
                            c[i + j * ldc + c_offset] = 0.0;
                        }
                    } else if (beta != 1.0) {
                        DoubleVector<Shapes.S512Bit> bv = spec.broadcast(beta);
                        for (i = 0; (i + spec.length()) < m && (i + j * ldc + c_offset+spec.length())<c.length; i += spec.length()) {
                            DoubleVector<Shapes.S512Bit> cv = spec.fromArray(c, i + j * ldc + c_offset);
                            cv.mul(bv).intoArray(c, i + j * ldc + c_offset);
                        }
                        for (; i < m && (i + j * ldc + c_offset)<c.length; i++) c[i + j * ldc + c_offset] = beta * c[i + j * ldc + c_offset];
                    }

                    for (l = 0; l < k; l++) {
                        if (b[l + j * ldb + b_offset] != 0.0) {
                            temp = alpha * b[l + j * ldb + b_offset];
                            DoubleVector<Shapes.S512Bit> tv = spec.broadcast(temp);
                            for (i = 0; (i + spec.length()) < m && (i + l * lda + a_offset+spec.length())<a.length && (i + j * ldc + c_offset+spec.length())<c.length; i += spec.length()) {
                                DoubleVector<Shapes.S512Bit> av = spec.fromArray(a, i + l * lda + a_offset);
                                DoubleVector<Shapes.S512Bit> cv = spec.fromArray(c, i + j * ldc + c_offset);
                                cv.add(av.mul(tv)).intoArray(c, i + j * ldc + c_offset); //tv.fma(av, cv).toDoubleArray(c, i+j*ldc+c_offset);
                            }
                            for (; i < m && (i + l * lda + a_offset)<a.length && (i + j * ldc + c_offset)<c.length; i++)
                                c[i + j * ldc + c_offset] = c[i + j * ldc + c_offset] + temp * a[i + l * lda + a_offset];
                        }
                    }
                }
            } else {
                for (j = 0; j < n; j++) {
                    for (i = 0; i < m; i++) {
                        temp = 0.0;
                        for (l = 0; (l + spec.length()) < k && (l + i * lda + a_offset+spec.length())<a.length && (l + j * ldb + b_offset+spec.length())<b.length; l += spec.length()) {
                            DoubleVector<Shapes.S512Bit> av = spec.fromArray(a, l + i * lda + a_offset);
                            DoubleVector<Shapes.S512Bit> bv = spec.fromArray(b, l + j * ldb + b_offset);
                            temp += av.mul(bv).sumAll();
                        }
                        for (; l < k && l + i * lda + a_offset<a.length && l + j * ldb + b_offset<b.length; l++) temp = temp + a[l + i * lda + a_offset] * b[l + j * ldb + b_offset];

                        if (beta == 0.0) {
                            c[i + j * ldc + c_offset] = alpha * temp;
                        } else {
                            c[i + j * ldc + c_offset] = alpha * temp + beta * c[i + j * ldc + c_offset];
                        }
                    }
                }

            }
        } else {
            if (nota) {
                // *           Form  C := alpha*A*B**T + beta*C
                for (j = 0; j < n; j++) {
                    if (beta == 0.0) {
                        for (i = 0; (i + spec.length()) < m && (i + j * ldc + c_offset+spec.length())<c.length; i += spec.length()) {
                            zeroVec.intoArray(c, i + j * ldc + c_offset);
                        }
                        for (; i < m && (i + j * ldc + c_offset)<c.length; i++) {
                            c[i + j * ldc + c_offset] = 0.0;
                        }
                    } else if (beta != 1.0) {
                        DoubleVector<Shapes.S512Bit> bv = spec.broadcast(beta);
                        for (i = 0; (i + spec.length()) < m && (i + j * ldc + c_offset+spec.length())<c.length; i += spec.length()) {
                            DoubleVector<Shapes.S512Bit> cv = spec.fromArray(c, i + j * ldc + c_offset);
                            cv.mul(bv).intoArray(c, i + j * ldc + c_offset);
                        }
                        for (; i < m && i + j * ldc + c_offset<c.length; i++) {
                            c[i + j * ldc + c_offset] = beta * c[i + j * ldc + c_offset];
                        }
                    }

                    for (l = 0; l < k; l++) {
                        if (b[j + l * ldb + b_offset] != 0.0) {
                            temp = alpha * b[j + l * ldb + b_offset];
                            DoubleVector<Shapes.S512Bit> tv = spec.broadcast(temp);
                            for (i = 0; (i + spec.length()) < m && (i + j * ldc + c_offset+spec.length())<c.length && (i + l * lda + a_offset+spec.length())<a.length; i += spec.length()) {
                                DoubleVector<Shapes.S512Bit> cv = spec.fromArray(c, i + j * ldc + c_offset);
                                DoubleVector<Shapes.S512Bit> av = spec.fromArray(a, i + l * lda + a_offset);
                                cv.add(tv.mul(av)).intoArray(c, i + j * ldc + c_offset); //tv.fma(av, cv).toDoubleArray(c, i + j * ldc + c_offset);
                            }
                            for (; i < m && (i + j * ldc + c_offset)<c.length && (i + l * lda + a_offset)<a.length; i++)
                                c[i + j * ldc + c_offset] = c[i + j * ldc + c_offset] + temp * a[i + l * lda + a_offset];
                        }
                    }
                }
            } else {
                // *           Form  C := alpha*A**T*B**T + beta*C
                for (j = 0; j < n; j++) {
                    for (i = 0; i < m; i++) {
                        temp = 0.0;
                        for (l = 0; (l + spec.length()) < k && (l + i * lda + a_offset+spec.length())<a.length && (j + l * ldb + b_offset+spec.length())<b.length; l += spec.length()) {
                            DoubleVector<Shapes.S512Bit> av = spec.fromArray(a, l + i * lda + a_offset);
                            DoubleVector<Shapes.S512Bit> bv = spec.fromArray(b, j + l * ldb + b_offset);
                            temp += av.mul(bv).sumAll();
                        }
                        for (; l < k && (l + i * lda + a_offset)<a.length && (j + l * ldb + b_offset)<b.length; l++) {
                            temp = temp + a[l + i * lda + a_offset] * b[j + l * ldb + b_offset];
                        }

                        if (beta == 0.0) {
                            c[i + j * ldc + c_offset] = alpha * temp;
                        } else {
                            c[i + j * ldc + c_offset] = alpha * temp + beta * c[i + j * ldc + c_offset];
                        }

                    }
                }
            }
        }

    }

}

Financial Services (FSI) algorithms

GetOptionPrice 

import jdk.incubator.vector.DoubleVector;
import jdk.incubator.vector.Shapes;
import jdk.incubator.vector.Vector;

public class FSI_getOptionPrice  {


    public static double getOptionPrice(double Sval, double Xval, double T, double[] z, int numberOfPaths, double riskFree, double volatility)
    {
        double val=0.0 , val2=0.0;
        double VBySqrtT = volatility * Math.sqrt(T);
        double MuByT = (riskFree - 0.5 * volatility * volatility) * T;

        //Simulate Paths
        for(int path = 0; path < numberOfPaths; path++)
        {
            double callValue  = Sval * Math.exp(MuByT + VBySqrtT * z[path]) - Xval;
            callValue = (callValue > 0) ? callValue : 0;
            val  += callValue;
            val2 += callValue * callValue;
        }

        double optPrice=0.0;
        optPrice = val / numberOfPaths;
        return (optPrice);
    }


    public static double VecGetOptionPrice(double Sval, double Xval, double T, double[] z, int numberOfPaths, double riskFree, double volatility) {
        DoubleVector.DoubleSpecies<Shapes.S512Bit> spec= (DoubleVector.DoubleSpecies<Shapes.S512Bit>) Vector.speciesInstance(Double.class, Shapes.S_512_BIT);
        double val = 0.0, val2 = 0.0;

        double VBySqrtT = volatility * Math.sqrt(T);
        DoubleVector<Shapes.S512Bit> VByVec = spec.broadcast(VBySqrtT);
        double MuByT = (riskFree - 0.5 * volatility * volatility) * T;
        DoubleVector<Shapes.S512Bit> MuVec = spec.broadcast(MuByT);
        DoubleVector<Shapes.S512Bit> SvalVec = spec.broadcast(Sval);
        DoubleVector<Shapes.S512Bit> XvalVec = spec.broadcast(Xval);
        DoubleVector<Shapes.S512Bit> zeroVec =spec.broadcast(0.0D);

        //Simulate Paths
        int path = 0;
        for (; (path + spec.length()) < numberOfPaths; path += spec.length()) {
            DoubleVector<Shapes.S512Bit> zv = spec.fromArray(z, path);
            DoubleVector<Shapes.S512Bit> tv = MuVec.add(VByVec.mul(zv)).exp(); //Math.exp(MuByT + VBySqrtT * z[path])
            DoubleVector<Shapes.S512Bit> callValVec = SvalVec.mul(tv).sub(XvalVec);
            callValVec = callValVec.blend(zeroVec, callValVec.greaterThan(zeroVec));
            val += callValVec.sumAll();
            val2 += callValVec.mul(callValVec).sumAll();
        }
        //tail
        for (; path < numberOfPaths; path++) {
            double callValue = Sval * Math.exp(MuByT + VBySqrtT * z[path]) - Xval;
            callValue = (callValue > 0) ? callValue : 0;
            val += callValue;
            val2 += callValue * callValue;
        }
        double optPrice = 0.0;
        optPrice = val / numberOfPaths;
        return (optPrice);
    }
}

BinomialOptions

import jdk.incubator.oracle.vector.*;

public class FSI_BinomialOptions  {



    public static void VecBinomialOptions(double[] stepsArray, int STEPS_CACHE_SIZE, double vsdt, double x, double s, int numSteps, int NUM_STEPS_ROUND, double pdByr, double puByr) {
        DoubleVector.DoubleSpecies<Shapes.S512Bit> spec= (DoubleVector.DoubleSpecies<Shapes.S512Bit>) Vector.speciesInstance(Double.class, Shapes.S_512_BIT);
        IntVector.IntSpecies<Shapes.S512Bit> ispec = (IntVector.IntSpecies<Shapes.S512Bit>) Vector.speciesInstance(Integer.class, Shapes.S_512_BIT);

        //   double stepsArray [STEPS_CACHE_SIZE];
        DoubleVector<Shapes.S512Bit> sv = spec.broadcast(s);
        DoubleVector<Shapes.S512Bit> vsdtVec = spec.broadcast(vsdt);
        DoubleVector<Shapes.S512Bit> xv = spec.broadcast(x);
        DoubleVector<Shapes.S512Bit> pdv = spec.broadcast(pdByr);
        DoubleVector<Shapes.S512Bit> puv = spec.broadcast(puByr);
        DoubleVector<Shapes.S512Bit> zv = spec.broadcast(0.0D);
        IntVector<Shapes.S512Bit> inc = ispec.fromArray(new int[]{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}, 0);
        IntVector<Shapes.S512Bit> nSV = ispec.broadcast(numSteps);
        int j;
        for (j = 0; (j + spec.length()) < STEPS_CACHE_SIZE; j += spec.length()) {
            IntVector<Shapes.S512Bit> jv = ispec.broadcast(j);
            Vector<Double,Shapes.S512Bit> tv = jv.add(inc).cast(Double.class).mul(spec.broadcast(2.0D)).sub(nSV.cast(Double.class));
            DoubleVector<Shapes.S512Bit> pftVec=sv.mul(vsdtVec.mul(tv).exp()).sub(xv);
            pftVec.blend(zv,pftVec.greaterThan(zv)).intoArray(stepsArray,j);
        }
        for (; j < STEPS_CACHE_SIZE; j++) {
            double profit = s * Math.exp(vsdt * (2.0D * j - numSteps)) - x;
            stepsArray[j] = profit > 0.0D ? profit : 0.0D;
        }

        for (j = 0; j < numSteps; j++) {
            int k;
            for (k = 0; k + spec.length() < NUM_STEPS_ROUND; k += spec.length()) {
                DoubleVector<Shapes.S512Bit> sv0 = spec.fromArray(stepsArray, k);
                DoubleVector<Shapes.S512Bit> sv1 = spec.fromArray(stepsArray, k + 1);
                pdv.mul(sv1).add(puv.mul(sv0)).intoArray(stepsArray, k); //sv0 = pdv.fma(sv1, puv.mul(sv0)); sv0.intoArray(stepsArray,k);
            }
            for (; k < NUM_STEPS_ROUND; ++k) {
                stepsArray[k] = pdByr * stepsArray[k + 1] + puByr * stepsArray[k];
            }
        }
    }

}

IoT Reference Implementation: How-to Build a Face Access Control Solution

$
0
0

The Face Access Control application is one of a series of IoT reference implementations aimed at instructing users on how to develop a working solution for a particular problem. The solution uses facial recognition as the basis of a control system for granting physical access. The application detects and registers the image of a person’s face into a database, recognizes known users entering a designated area and grants access if a person’s face matches an image in the database.

From this reference implementation, developers will learn to build and run an application that:

  • Detects and registers the image of a person’s face into a database
  • Recognizes known users entering a designated area
  • Grants access if a person’s face matches an image in the database

This article continues here on GitHub.

Writing own-vector algorithms in Oracle Java* for faster performance

$
0
0

In this paper, we discuss insights into Vector API, which is being developed as part of OpenJDK* under Project Panama. First, we’ll go over some Vector API fundamentals, basic functionalities, and tips. We’ll then show you some code samples of vector algorithms for standard Machine Learning routines and financial benchmarks, and go over some ways to increase performance. These examples should give you some guidelines and best practices for vector programming in Java*, to help you to write successful vector versions of your own compute-intensive algorithms

Download complete paper (PDF)
 

Data Bench: A New Proof-of-Concept Workload for Microservice Transactions

$
0
0

Data Bench is a new proof-of-concept workload that can be used to measure the response-time latency of microservice transactions. This is a new type of workload that places the focus and analysis of computing environments on the handling, processing, and movement of data. Currently in  Phase 1 proof-of-concept, this open-source implementation delivers all the required components for an out-of-the-box workload, in a format that is easy to download and easy to use. Data Bench provides a solid starting point for testing the interoperability and delivery mechanisms of future transaction benchmarks for microservices.

We introduce a new open-source workload: Data Bench. Data Bench is designed to help you tune, optimize, develop, and evaluate data-centric computing environments. This workload places the focus and analysis on the handling, processing and the movement of data. Specifically, Data Bench measures the response-time latency for two transactions using Kafka, Apache Spark*, and Apache Cassandra*. Note that other benchmarks use Twitter*, click-stream, or other unstructured data types for big data processing. Unlike those benchmarks, Data Bench uses contemporary online transaction processing (OLTP) structured data for the transactions and the data stored in Cassandra.

Download complete paper (PDF)

 

Code Sample: Access Control

$
0
0

Introduction

This access control system application is part of a series of how-to Internet of Things (IoT) code sample exercises using the Intel® IoT Developer Kit and a compatible Intel-based platform, cloud platforms, APIs, and other technologies.

This code sample is available in C++, Java*, JavaScript*, and Python*.

From this exercise, developers will learn how to:

  • Interface with sensors using MRAA and UPM from the Intel® IoT Developer Kit, a complete hardware and software solution to help developers explore the IoT and implement innovative projects.
  • Set up a web application server to let users enter the access code to disable the alarm system and store this alarm data using Azure Redis Cache* from Microsoft Azure*, Redis Store* from IBM Bluemix*, or Elasticache* using Redis* from Amazon Web Services (AWS)*, different cloud services for connecting IoT solutions including data analysis, machine learning, and a variety of productivity tools to simplify the process of connecting your sensors to the cloud and getting your IoT project up and running quickly.
  • Connect to a server using IoT Hub from Microsoft Azure*, IoT from IBM Bluemix*, IoT from Amazon Web Services (AWS)*, AT&T M2X*, Predix* from GE, or SAP Cloud Platform* IoT, different cloud-based IoT platforms for machine to machine communication.

This article continues here on GitHub.


Improve Performance Using Vectorization and Intel® Xeon® Scalable Processors

$
0
0

Introduction

Modern CPUs include different levels of parallelism. High-performance software needs to take advantage of all opportunities for parallelism in order to fully benefit from modern hardware. These opportunities include vectorization, multithreading, memory optimization, and more.

The need for increased performance in software continues to grow, but instead of getting better performance from increasing clock speeds as in the past, now software applications need to use parallelism in the form of multiple cores and, in each core, an increasing number of execution units, referred to as single instruction, multiple data (SIMD) architectures, or vector processors.

To take advantage of both multiple cores and wider SIMD units, we need to add vectorization and multithreading to our software. Vectorization in each core is a critical step because of the multiplicative effect of vectorization and multithreading. To get good performance from multiple cores we need to extract good performance from each individual core.

Intel’s new processors support the rising demands in performance with Intel® Advanced Vector Extensions 512 (Intel® AVX-512), which is a set of new instructions that can accelerate performance for demanding computational workloads.

Intel AVX-512 may increase the vectorization efficiency of our codes, both for current hardware and also for future generations of parallel hardware.

Vectorization Basics

Figure 1 illustrates the basics of vectorization using Intel AVX-512.

In the grey box at the top of Figure 1, we have a very simple loop performing addition of the elements of two arrays containing single precision numbers. In scalar mode, without vectorization, every instruction produces one result. If the loop is vectorized, 16 elements from each operand are going to be packed into 512-bit vector registers, and a single Intel AVX-512 instruction is going to produce 16 results. Or 8 results, if we use double-precision floating point numbers.

How do we know that this loop was vectorized? One way is to get an optimization report from the Intel® C compiler or the Intel® C++ compiler, as shown in the green box at the bottom of the figure. The next section in this article shows how to generate these reports.

However, not all loops can run in vector mode. This simple loop is vectorizable because different iterations of this loop can be processed independently from each other. For example, the iteration when the index variable i is 3 can be executed at the same time when i is 5 or 6, or has any other value in the iteration range, from 0 to N in this case. In other words, this loop has no data dependencies between iterations.


Figure 1: Simple loop adding two arrays of single precision floating point numbers. Operations are performed simultaneously on 512-bit registers. No dependencies present. Loop is vectorized.

Figure 2 illustrates the case where a loop might not be directly vectorizable. The figure shows a loop where each iteration produces one value in the array c, which is computed using another value of c just produced in the previous iteration.

If this loop is executed in vector mode, the old values of the array with name c (16 of those values in this case) are going to be loaded into the vector register (as shown by the red circles in Figure 2), so the result will be different compared to the execution in scalar mode.

And so, this loop cannot be vectorized. This type of data dependency is called a Read-After-Write, or a Flow dependency, and we can see that the compiler will detect it and will not vectorize this loop.

There are other types of data dependencies; this is just an example to illustrate one that will prevent the compiler from automatically vectorizing this loop. In these cases we need to modify the loop or the data layout to come up with a vectorizable loop.


Figure 2: Simple loop, yet not vectorized because a data dependency is present.

Intel® Advanced Vector Extensions 512 (Intel® AVX-512): Enhanced Vector Processing Capabilities

The Intel AVX-512 instruction set increases vector processing capabilities because the new instructions can operate on 512-bit registers. There is support in Intel AVX-512 for 32 of these vector registers, and each register can pack either 16 single-precision floating point numbers or 8 double-precision numbers.

Compared to the Intel® Advanced Vector Extensions 2 instruction set (Intel® AVX2), Intel AVX-512 doubles the number of vector registers, and each vector register can pack twice the number of floating point or double-precision numbers. Intel AVX2 offers 256-bit support. This means more work can be achieved per CPU cycle, because the registers can hold more data to be processed simultaneously.

As of today, Intel AVX-512 is currently available on Intel® Xeon Phi™ processors x200, and on the new Intel® Xeon® Scalable processors.

The full specification of the Intel AVX-512 instruction set consists of several separate subsets:

A. Some are present in both the Intel Xeon Phi processors x200 and in the Intel Xeon Scalable processors:

  • Intel AVX-512 Foundation Instructions (AVX512-F)
  • Intel AVX-512 Conflict Detection Instructions (AVX512-CD)

B. Some are supported by Intel Xeon Phi processors:

  • Intel AVX-512 Exponential and Reciprocal Instructions (AVX512-ER)
  • Intel AVX-512 Prefetch Instructions (AVX512-PF)

C. Some are supported by Intel Xeon Scalable processors:

  • Intel AVX-512 Byte (char/int8) and Word (short/int16) Instructions (AVX512-BW)
  • Intel AVX-512 Double-word (int32/int) and Quad-word (int64/long) Instructions (AVX512-DQ)
  • Intel AVX-512 Vector Length Extensions (AVX512-VL)

The subsets shown above can be accessed in different ways. The easiest way is to use a compiler option. As an example, Intel C++ compiler options that control which subsets to use are as follows:

  • Option -xCOMMON-AVX512 will use:
    • Intel AVX-512 Foundation Instructions (AVX512-F)
    • Intel AVX-512 Conflict Detection Instructions (AVX512-CD)
    • Instructions enabled with -xCORE-AVX2
  • Option -xMIC-AVX512 will use:
    • Intel AVX-512 Foundation Instructions (AVX512-F)
    • Intel AVX-512 Conflict Detection Instructions (AVX512-CD)
    • Intel AVX-512 Exponential and Reciprocal Instructions (AVX512-ER)
    • Intel AVX-512 Prefetch Instructions (AVX512-PF)
    • Instructions enabled with -xCORE-AVX2
  • Option -xCORE-AVX512 will use:
    • Intel AVX-512 Foundation Instructions (AVX512-F)
    • Intel AVX-512 Conflict Detection Instructions (AVX512-CD)
    • Intel AVX-512 Byte and Word Instructions (AVX512-BW)
    • Intel AVX-512 Double-word and Quad-word Instructions (AVX512-DQ)
    • Intel AVX-512 Vector Length Extensions (AVX512-VL)
    • Instructions enabled with -xCORE-AVX2
  • Option -xCORE-AVX2 will use:
    • Intel Advanced Vector Extensions 2 (Intel AVX2), Intel® Advanced Vector Extensions (Intel® AVX), Intel SSE 4.2, Intel SSE 4.1, Intel SSE 3, Intel SSE 2, Intel SSE, and Supplemental Streaming SIMD Extensions 3 instructions for Intel® processors

For example, if it is necessary to keep binary compatibility for both Intel Xeon Phi processors x200 and Intel Xeon Scalable processors, code can be compiled using the Intel C++ compiler as follows:

icpc Example.cpp -xCOMMON-AVX512 <more options>

But if the executable will run on Intel Xeon Scalable processors, the code can be compiled as follows:

icpc Example.cpp -xCORE-AVX512 <more options>

Vectorization First

The combination of vectorization and multithreading can be much faster than either one alone, and this difference in performance is growing with each generation of hardware.

In order to use the high degree of parallelism present on modern CPUs, like the Intel Xeon Phi processors x200 and the new Intel Xeon Scalable processors, we need to write new applications in such a way that they take advantage of vector processing on individual cores and multithreading on multiple cores. And we want to do that in a way that guarantees that the optimizations will be preserved as much as possible for future generations of processors, to preserve code and optimization efforts, and to maximize software development investment.

When optimizing code, the first efforts should be focused on vectorization. Data parallelism in the algorithm/code is exploited in this stage of the optimization process.

There are several ways to take advantage of vectorization capabilities on a single core on Intel Xeon Phi processors x200 and the new Intel Xeon Scalable processors:

  1. The easiest way is to use libraries that are already optimized for Intel processors. An example is the Intel® Math Kernel Library, which is optimized to take advantage of vectorization and multithreading. In this case we can get excellent improvements in performance just by linking with this library. Another example is if we are using Python*. Using the Intel® Distribution for Python* will automatically increase performance, because this distribution accelerates computational packages in Python, like NumPy* and others.
  2. On top of using optimized libraries, we can also write our code in a way that the compiler will automatically vectorize it, or we can modify existing code for this purpose. This is commonly called automatic vectorization. Sometimes we can also add keywords or directives to the code to help or guide automatic vectorization. This is a very good methodology because code optimized in this way will likely be optimized for future generations of processors, probably with minor or no modifications. Only recompilation might be needed.
  3. A third option is to directly call vector instructions using intrinsic functions or in assembly language.

In this article, we will focus on examples of optimizing applications using the first two methods shown above.

A typical vectorization workflow is as follows:

  1. Compile + Profile
  2. Optimize
  3. Repeat

The first step above can be performed using modern compilers, profilers, and optimization tools. In this article, an example of an optimization flow will be shown using the optimization reports generated by the Intel® compilers, as well as advanced optimization tools like Intel® Advisor and Intel® VTune™ Amplifier, which are part of the new Intel® Parallel Studio XE 2018.

Example: American Option Pricing

This example is taken from a book by James Reinders and Jim Jeffers titled High Performance Parallelism Pearls Volume Two1. In chapter 8, Shuo Li (the author of that chapter and the code that will be used in this article) describes a possible solution to the problem of American option pricing. It consists of finding an approximate solution to partial differential equations based on the Black-Scholes model using the Newton-Raphson method (based on work by Barone-Adesi and Whaley2). They have implemented this solution as C++ code and the source code is freely available at the authors’ website (http://lotsofcores.com/).

Figure 3 shows two fragments of the source code am_opt.cpp (downloaded from the link shown above), containing the two main loops in their code. The first loop initializes arrays with random values. The second loop performs the pricing operation for the number of options indicated in the variable OptPerThread, which in this case has a value of about 125 million options. In the rest of this article we will focus on Loop 2, which uses most of the CPU time. In particular, we will focus on the call to function baw_scalaropt (line 206), which performs option pricing for one option.


Figure 3: Fragments of the source code from the author's book and website showing the two main loops in the program.

The following code snippet shows the definition of function baw_scalaropt:

 90 float baw_scalaropt( const float S,
 91                  const float X,
 92                  const float r,
 93                  const float b,
 94                  const float sigma,
 95                  const float time)
 96 {
 97     float sigma_sqr = sigma*sigma;
 98     float time_sqrt = sqrtf(time);
 99     float nn_1 = 2.0f*b/sigma_sqr-1;
100     float m = 2.0f*r/sigma_sqr;
101     float K = 1.0f-expf(-r*time);
102     float rq2 = 1/((-(nn_1)+sqrtf((nn_1)*(nn_1) +(4.f*m/K)))*0.5f);
103
104     float rq2_inf = 1/(0.5f * ( -(nn_1) + sqrtf(nn_1*nn_1+4.0f*m)));
105     float S_star_inf = X / (1.0f - rq2_inf);
106     float h2 = -(b*time+2.0f*sigma*time_sqrt)*(X/(S_star_inf-X));
107     float S_seed = X + (S_star_inf-X)*(1.0f-expf(h2));
108     float cndd1 = 0;
109     float Si=S_seed;
110     float g=1.f;
111     float gprime=1.0f;
112     float expbr=expf((b-r)*time);
113     for ( int no_iterations =0; no_iterations<100; no_iterations++) {
114         float c  = european_call_opt(Si,X,r,b,sigma,time);
115         float d1 = (logf(Si/X)+
116                    (b+0.5f*sigma_sqr)*time)/(sigma*time_sqrt);
117         float cndd1=cnd_opt(d1);
118         g=(1.0f-rq2)*Si-X-c+rq2*Si*expbr*cndd1;
119         gprime=( 1.0f-rq2)*(1.0f-expbr*cndd1)+rq2*expbr*n_opt(d1)*
120                (1.0f/(sigma*time_sqrt));
121         Si=Si-(g/gprime);
122     };
123     float S_star = 0;
124     if (fabs(g)>ACCURACY) { S_star = S_seed; }
125     else { S_star = Si; };
126     float C=0;
127     float c  = european_call_opt(S,X,r,b,sigma,time);
128     if (S>=S_star) {
129         C=S-X;
130     }
131     else {
132         float d1 = (logf(S_star/X)+
133                    (b+0.5f*sigma_sqr)*time)/(sigma*time_sqrt);
134         float A2 =  (1.0f-expbr*cnd_opt(d1))* (S_star*rq2);
135         C=c+A2*powf((S/S_star),1/rq2);
136     };
137     return (C>c)?C:c;
138 };

Notice the following in the code snippet shown above:

  • There is a loop performing the Newton-Raphson optimization on line 113.
  • There is a call to function european_call_opt on line 114 (inside the loop) and on line 127 (outside the loop). This function performs a pricing operation for European options, which is needed for the pricing of American options (see details of the algorithm in1).

For reference, the following code snippet shows the definition of the european_call_opt function. Notice that this function only contains computation (and calls to math functions), but no loops:

75 float european_call_opt( const float S,
 76                 const float X,
 77                 const float r,
 78                 const float q,
 79                 const float sigma,
 80                 const float time)
 81 {
 82     float sigma_sqr = sigma*sigma;
 83     float time_sqrt = sqrtf(time);
 84     float d1 = (logf(S/X) + (r-q + 0.5f*sigma_sqr)*time)/(sigma*time_sqrt);
 85     float d2 = d1-(sigma*time_sqrt);
 86     float call_price=S*expf(-q*time)*cnd_opt(d1)-X*expf(-r*time)* cnd_opt(d2);
 87     return call_price;
 88 };

To better visualize the code, Figure 4 shows the structure of functions and loops in the am_opt.cpp program. Notice that we have labeled the two loops we will be focusing on as Loop 2.0 and Loop 2.1.


Figure 4: Structure of functions and loops in the program.

As a first approach, we can compile this code with the options shown at the top of the yellow section in Figure 5 (the line indicated by label 1). Notice that we have included the option “O2”, which specifies a moderate level of optimization, and also the option “-qopt-report=5”, which asks the compiler to generate an optimization report with the maximum level of information (possible levels range from 1–5).

A fragment of the optimization report is shown in the yellow section in Figure 5. Notice the following:

  • Loop 2.0 on line 201 was not vectorized. The compiler suggests a loop interchange and the use of the SIMD directive.
  • The compiler also reports Loop 2.1 was not vectorized because of an assumed dependence, and also reports that function baw_scalaropt has been inlined.
  • Inlining the function baw_scalaropt presents both loops (2.0 and 2.1) as a nested loop, which the compiler reports as an “Imperfect Loop Nest” (a loop with extra instructions between the two loops), and for which it suggests the permutation.


Figure 5: Compiling the code with optimization option "O2" and targeting the Intel® AVX-512 instructions.

Before trying the SIMD option (which would force vectorization), we can try compiling this code using the optimization flag “O3”, which specifies a higher level of optimization from the compiler. The result is shown in Figure 6. We can observe in the yellow section that:

  • The compiler reports that the outer loop (Loop 2.0) has been vectorized, using a vector length of 16.
  • The compiler’s estimated potential speedup for the vectorized loop 2.0 is 7.46.
  • The compiler reports a number of vectorized math library calls and one serialized function call.

Given that the vector length used for this loop was 16, we would expect a potential speedup close to 16 from this vectorized loop. Why was it only 7.46?

The reason seems to come from the serialized function call reported by the compiler (which refers to the call of the function european_call_opt inside the inner loop (Loop 2.1). One possible way to fix this would be to ask the compiler to recursively inline all the function calls. For this we can use the directive “#pragma inline recursive” right before the call to the function baw_scalaropt.

After compiling the code (using the same compiler options as in the previous experiment), we get the new optimization report informing that the compiler’s estimated potential speedup for the vectorized loop 2.0 (Figure 7) is now 14.180, which is closer to the ideal speedup of 16.


Figure 6: Compiling the code with optimization option "O3" and targeting the Intel® AVX-512 instructions.


Figure 7: Adding the "#pragma inline recursive" directive.


Figure 8: Fragment of optimization report showing both loops vectorized.

Figure 8 shows the sections of the optimization report confirming that both loops have been vectorized. More precisely, as we can see in the line with label 2, the inner loop (Loop 2.1) has been vectorized along with the outer loop, which means that the compiler has generated efficient code, taking into account all the operations involved in the nested loops.

Once the compiler reports a reasonable level of vectorization, we can perform a runtime analysis. For this, we can use Intel Advisor 2018.

Intel Advisor 2018 is one of the analysis tools in the Intel Parallel Studio XE 2018 suite that lets us analyze our application and gives us advice on how to improve vectorization in our code.

Intel Advisor analyzes our application and reports not only the extent of vectorization but also possible ways to achieve more vectorization and increase the effectiveness of the current vectorization.

Although Intel Advisor works with any compiler, it is particularly effective when applications are compiled using Intel compilers, because Intel Advisor will use the information from the reports generated by Intel compilers.

The most effective way to use Intel Advisor is via the graphical user interface. This interface gives us access to all the information and recommendations that Intel Advisor collects from our code. Detailed information can be found at Intel Advisor Support, where documentation, training materials, and code samples can be found. Product support and access to the Intel Advisor community forum can also be found there.

Intel Advisor also offers a command-line interface (CLI) that lets the user work on remote hosts, and/or generate information in a way that is easy to automate analysis tasks; for example, using scripts.

As an example, let us run the Intel Advisor CLI to perform a basic analysis of the example code am_opt.cpp.


Figure 9: Running Intel® Advisor to collect survey information.

The first step for a quick analysis is to create an optimized executable that will run on the Intel processor (in this case, an Intel Xeon Scalable processor 81xx). Figure 9 shows how we can run the CLI version of the Intel Advisor tool. The survey analysis is a good starting point because it provides information that lets us identify how our code uses vectorization and where the hotspots for analysis are.

The command labeled as 1 in Figure 9 runs the Intel Advisor tool and creates a project directory VecAdv-01. Inside this directory, Intel Advisor creates, among other things, a directory named e000, containing the results of the analysis. The command is:

$ advixe-cl --collect survey --project-dir ./VecAdv-01 --search-dir src:r=./. -- ./am_call

The next step is to view the results of the survey analysis performed by the Intel Advisor tool. We will use the CLI to generate the report. To do this, we replace the -collect option with the -report one (as shown in the command with label 2 in Figure 9), making sure we refer to the project directory where the data is collected. We can use the following command to generate a survey report from the data contained in the results directory in our project directory:

$ advixe-cl -report  survey -format=xml --project-dir ./VecAdv-01 --search-dir src:r=./. -- ./am_call

The above command creates a report named advisor-survey.xml in the project directory. If we do not use the –format=xml option, a text-formatted report will be generated. This report is in a column format and contains several columns, so it can be a little difficult to read on a console. One option for a quick read is to limit the number of columns to be displayed using the filter option.

Another option is to create an XML-formatted report. We can do this if we change the value for the -format option from text to XML, which is what we did in Figure 9.

The XML-formatted report might be easier to read on a small screen because the information in the report file is condensed into one column. Figure 9 (the area labeled with 4) shows a fragment of the report corresponding to the results of Loop 2.0.

The survey option in the Intel Advisor tool generates a performance overview of the loops in the application. For example, Figure 9 shows that the loop starting on line 200 in the source code has been vectorized using Intel AVX-512. It also shows an estimate of the improvement of the loop’s performance (compared to a scalar version) and timing information.

Also notice that the different loops have been assigned a loop ID, which is the way the Intel Advisor tool labels the loops in order to keep track of them in future analysis (for example, after looking at the performance overview shown above, we might want to generate more detailed information about a specific loop by including the loop ID in the command line).

Once we have looked at the performance summary reported by the Intel Advisor tool using the Survey option, we can use other options to add more specific information to the reports. One option is to run the Trip counts analysis to get information about the number of times loops are executed.

To add this information to our project, we can use the Intel Advisor tool to run a Trip Counts analysis on the same project we used for the survey analysis. Figure 10 shows how this can be done. The commands we used are:

$ advixe-cl --collect tripcounts --project-dir ./VecAdv-01 --search-dir src:r=./. -- ./am_call
$ advixe-cl –report tripcounts -format=xml --project-dir ./VecAdv-01 --search-dir src:r=./. -- ./am_call

Now the XML-formatted report contains information about the number of times the loops have been executed. Specifically, the Trip_Counts fields in the XML report will be populated, while the information from the survey report will be preserved. This is shown in Figure 10 in the area labeled 5.

In a similar way, we can generate other types of reports that give us other useful information about our loops. The –help collect and –help report options in the command-line Intel Advisor tool show what types of collections and reports are available. For example, to obtain memory access pattern details in our source code, we can run a Memory Access Patterns (MAP) analysis using the -map option. This is shown in Figure 11.

Figure 11 shows the results of running Advisor using the -map option to collect MAP information. The MAP analysis is an in-depth analysis of memory access patterns and, depending on the number of loops and complexity in the program, it can take some time to finish the data collection. The report might also be very long. For that reason, notice that in Figure 11 we selected only the loop we are focusing on in this example, which has a loop ID of 3. In this way, data will be collected and reported only for that loop. The commands we are using (from Figure 11) to collect and report MAP information for this loop are:

$ advixe-cl --collect map --project-dir ./VecAdv-01 --mark-up-list=3 --search-dir src:r=./. -- ./am_call
$ advixe-cl –report map -format=xml --project-dir ./VecAdv-01 --search-dir src:r=./. -- ./am_call

Also notice that the stride distribution for all patterns found in this loop are reported, and information for every memory pattern is reported also. In Figure 11, only one of them is shown (for pattern with pattern ID = “186”, showing a unit stride access).


Figure 10: Adding the "tripcounts" analysis.


Figure 11: Adding the "MAP" analysis.

Summary

Software needs to take advantage of all opportunities for parallelism present in modern processors to obtain top performance. As modern processors continue to increase the number of cores and to widen SIMD registers, the combination of vectorization, threading and efficient memory use will make our code run faster. Vectorization in each core is going to be a critical step because of the multiplicative effect of vectorization and multithreading.

Extracting good performance from each individual core will be a necessary step to efficiently use multiple cores.

This article presented an overview of vectorization using Intel compilers Intel optimization tools, in particular Intel Advisor XE 2018. The purpose was to illustrate a methodology for vectorization performance analysis that can be used as an initial step in a code modernization effort to get software running faster on a single core.

The code used in this example is from chapter 8 of1, which implements a Newton-Raphson algorithm to approximate American call options. The source code is available for download from the book’s website (http://www.lotsofcores.com/).

References

1. J. Reinders and J. Jeffers, High Performance Parallelism Pearls, vol. 2, Morgan Kaufmann, 2015.

2. Barone-Adesi and W. R. G., Efficient analytic approximation of american option values, J. Finance, 1987.

Building Large-Scale Image Feature Extraction with BigDL at JD.com

$
0
0

This article shares the experience and lessons learned from Intel and JD teams in building a large-scale image feature extraction framework using deep learning on Apache Spark* and BigDL*.

Background

Image feature extraction is widely used in image-similarity search, picture deduplication, and so on. Before adopting BigDL1, the JD team tried very hard to build the feature extraction application on both multi-graphics processing unit (GPU) servers and GPU cluster settings; however, our experience shows that there are many disadvantages in the above GPU solutions, including:

  • The resource management and allocation of individual GPU cards in the GPU cluster is very complex and error-prone (for example, frequent out of memory (OOM) errors and program crashes due to insufficient GPU memory).
  • In multi-GPU servers, developers need to put a lot of effort into manually managing data partitioning, task balancing, fault tolerance, and so on.
  • Applications that are based on GPU solutions (for example, Caffe*) have many dependencies, such as CUDA*, which greatly increases the complexities in production deployment and operations; for instance, one often needs to rebuild the entire environment for different versions of operating systems or different versions of the GNU Compiler Collection (GCC).

As a result, there are many architectural and technical challenges in building the GPU application pipelines.

Let’s examine the architecture of the image feature extraction application. As the background of many images can be very complex, and the main object in the image is often small, the main object needs to be separated from the picture’s background for correct feature extraction. Naturally, the framework of image feature extraction can be divided into two steps. First, the object detection algorithm is used to detect the main object, and then the feature extraction algorithm is used to extract the features of the identified object. Here, we use the Single Shot MultiBox Detector* (SSD)2 for object detection, and the DeepBit* model3 for feature extraction.

JD has a massive number (hundreds of millions) of merchandise pictures, which are stored in mainstream open-source distributed database storage; therefore, efficient data retrieval and processing on this large-scale, distributed infrastructure is a key requirement of the image feature extraction pipeline. GPU-based solutions have some additional challenges to meeting the requirement:

  • Reading out the image data takes a very long time, which cannot be easily optimized in the GPU solutions.
  • The image preprocessing on distributed database storage can be very complex in the GPU solutions, as in this case no existing software frameworks can be used for resource management, data processing, fault tolerance, and so on.
  • It is challenging to scale out the GPU solutions to analyze the massive number of pictures due to the software and hardware infrastructure constraints.

BigDL* Solutions

In the production environment, using existing software and hardware infrastructure can substantially improve efficiency (for example, time-to-product) and reduce cost. As the image data are already stored in the big data cluster (distributed database storage) in this case, the challenges mentioned above can be easily addressed if existing big data clusters (such as Hadoop* or Spark clusters) can be reused for deep learning applications.

BigDL1, an open source, distributed, deep learning framework from Intel, provides comprehensive deep learning algorithm support on Apache Spark. Built on the highly scalable Apache Spark platform, BigDL can be easily scaled out to hundreds or thousands of servers. In addition, BigDL uses Intel® Math Kernel Library (Intel® MKL) and parallel computing techniques to achieve very high performance on Intel® Xeon® processor-based servers (comparable to mainstream GPU performance).

In this use case, BigDL can provide support for various deep learning models (for example, object detection, classification, and so on); in addition, it also lets us reuse and migrate pre-trained models (in Caffe, Torch*, TensorFlow*, and so on), which were previously tied to specific frameworks and platforms, to the general purpose big data analytics platform through BigDL. As a result, the entire application pipeline can be fully optimized to deliver significantly accelerated performance.

We built the end-to-end image feature extraction pipeline in Apache Spark and BigDL as follows (see Figure 1):

  1. Read hundreds of millions of pictures from a distributed database in Spark as resilient distributed datasets (RDD) .
  2. Preprocess the RDD of images (including resizing, normalization, and batching) in Spark.
  3. Use BigDL to load the SSD model for large scale, distributed object detection on Spark, which will generate the coordinates and scores for the detected objects in the images.
  4. Keep the object with highest score as the target, and crop the original picture based on the object coordinates to get the target picture.
  5. Preprocess the RDD of target images (including resizing and batching).
  6. Use BigDL to load the DeepBit model for distributed feature extraction of the target images on Spark, which will generate the corresponding features.
  7. Store the result (RDD of extracted object features) in the Hadoop Distributed File System (HDFS).


Figure 1:Image feature extraction pipeline based on BigDL.

The entire data analytics pipeline, including data loading, partitioning, preprocessing, prediction, and storing the results, can be easily implemented on Spark and BigDL. By using BigDL, users can directly use existing big data (Hadoop/Spark) clusters to run deep learning applications without any changes to the cluster; in addition, the BigDL applications can easily take advantage of the scalability of the Spark platform to scale out to a large number of nodes and tasks, to greatly speed up the data analytics process.

In addition to distributed deep learning support, BigDL also provides a number of useful tools, such as image preprocessing libraries, model loading utilities (including loading models from third-party, deep learning frameworks), which make it convenient for users to build the entire deep learning pipeline.

Image Preprocessing

BigDL provides an image preprocessing library4 built on top of OpenCV*5, which provides support for common image transformation and augmentation operations, so that the users can easily construct an image-preprocessing pipeline using these operations. In addition, users can also build customized image transformation functions with OpenCV operations provided in the library.

val preProcessor =
   BytesToMat() ->
   Resize(300, 300) ->
   MatToFloats(meanRGB = Some(123, 117, 104)) ->
   RoiImageToBatch(10)

val transformed = preProcessor(dataRdd)

The sample image preprocessing pipeline above converts a raw RDD of byte arrays through a series of transformations. Among them, ByteToMat transforms the byte picture into Mat (the storage format used in OpenCV), Resize adjusts the picture size to 300 x 300, MatToFloats transforms the pixels of Mat into Float array format and subtracts the corresponding channel mean. Finally, RoiImageToBatch batches the data as the input to the model prediction or training.

Load the Model

Users can easily use BigDL to load pretrained models, which can then be directly used in the Spark program. Given a BigDL model file, one can call Module.load to load the model:

val model = Module.load[Float](bigdlModelPath)

In addition, BigDL also allows users to load models from third-party deep learning frameworks such as Caffe, Torch, and TensorFlow.

Users can easily load pretrained models for data prediction, feature extraction, fine-tuning, and so on. For instance, a Caffe model consists of two files, namely, the model prototxt definition file and the model parameter file. Users easily load pretrained Caffe models into the Spark and BigDL program as follows:

val model = Module.loadCaffeModel(caffeDefPath, caffeModelPath)

Performance

Performance benchmarking for both the Caffe-based GPU solutions and BigDL-based Intel Xeon processor solutions were conducted in JD’s internal clusters.

Test Process

The end-to-end image processing and analytics pipeline includes:

  1. Reading the pictures from the distributed database storage.
  2. Inputting to the object detection model and feature extraction model for feature extraction.
  3. Saving the results (image paths and features) to the distributed file system.

Note that the first step (reading the pictures from the distributed database storage) can take a lot of time in the end-to-end performance measurement. In this case, the first step takes about half of the total processing time (including image reading, object detection, and feature extraction) in GPU solutions, which cannot be easily optimized on multi-GPU servers or GPU clusters.

Test Environment

  • GPU: 20 * NVIDIA Tesla* K40
  • CPU: Intel® Xeon® processor E5-2650 v4 @ 2.20GHz, 1200 logical cores (each server has 24 physical cores with Intel® Hyper-Threading Technology (Intel® HT Technology) enabled, and is configured to support 50 logical cores in Yet Another Resource Negotiator (YARN).

Test Result

Figure 2 shows that the image processing throughput of Caffe on 20 * K40 is about 540 images per second, while the throughput of BigDL is about 2070 pictures per second in a YARN using an Intel Xeon processor cluster with 1200 logical cores. The throughput of BigDL on the Intel Xeon processor cluster is ~3.83X of the GPU cluster, which greatly speeds up the image processing and analytics tasks.

The test results show that BigDL provides much better support for the large-scale image feature extraction application. The high scalability, high performance, and ease of use of BigDL-based solutions make it easy for JD to deal with the massive and ever-growing number of images. As a result, JD is in the process of upgrading the GPU solution to the BigDL on the Intel Xeon processor solution, and deploying it to the production Spark cluster in JD.


Figure 2:Compares the throughput of K40 and Intel® Xeon® processors in the image feature extraction pipeline.

Conclusion

The high scalability, high performance, and ease of use of BigDL make it easy for JD to analyze the massive number of images using deep learning technologies. JD will continue to apply BigDL to a wider range of deep learning applications, including distributed model training.

References

  1. BigDL, https://github.com/intel-analytics/BigDL
  2. Liu, Wei, et al., SSD: Single Shot MultiBox Detector, European conference on computer vision. Springer, Cham, 2016.
  3. Lin, Kevin, et al., Learning compact binary descriptors with unsupervised deep neural networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  4. Vision: Image Augmentation, https://github.com/intel-analytics/analytics-zoo/tree/master/transform/vision
  5. Open Source Computer Vision Library, http://opencv.org/

Intel® RealSense™ D400 and Marshal Point launched!

$
0
0

Slightly delayed, but I am pleased to announce that Intel® RealSense™ Depth Camera D400-series launched on Thursday, September 28. In addition, the Intel® Secure Device Onboard (aka Marshal Point) launched on Friday, September 29.

Without the rock stars involving in these launches, the sites would not go live as smoothly during launch! I am very blessed to be able to work with an awesome team, and I wanted to take this opportunity to thank everyone that worked hard and went above and beyond!

 

Intel® RealSense™ Depth Camera D400-series launch

Claire Conant

Richard Clark

Renee Scheltiens

Jeremy Crawford

Tracy Johnson

--------------------- Details ---------------------------

3 new pages, 1 new PDF, 1 page update, and 1 navigation update

New: https://software.intel.com/en-us/realsense/d400

New: https://software.intel.com/en-us/realsense/d400/get-started

New: https://software.intel.com/en-us/realsense/sdk

New PDF: https://software.intel.com/en-us/realsense/d400/intel-realsense-depth-camera-d400-series-datasheet

Update: https://software.intel.com/en-us/realsense

 

Intel® Secure Device Onboard (aka Marshal Point) launch

Amy Neises

Oksana Benedick

Jeremy Crawford

Tracy Johnson

--------------------- Details ---------------------------

4 new pages, 5 new PDFs, 4 source download file packages, 1 new navigation, 1 license agreement, forced EULA setup, and one entitlement

https://software.intel.com/en-us/secure-device-onboard

https://software.intel.com/en-us/secure-device-onboard/request-access

https://software.intel.com/en-us/secure-device-onboard/downloads

https://software.intel.com/en-us/secure-device-onboard/partner-sign-up

 

Please let me know if you have any questions.

 

Thank you!

Nam Ngo | Project Manager

 

 

How to upgrade an existing floating license manager on Windows*

$
0
0

As of version 2.5, the Intel® Software License Manager download package for Windows* uses an installer similar to Intel software development tools.  This is a change from older versions.  If you already have a pre-2.5 version of the license manager installed and running on your server, follow the steps below to upgrade to version 2.5 or higher.  

Note: Starting the license manager may change the port number used by the INTEL vendor daemon.  Be sure to note the port number and exclude it from your firewall if necessary.

To install the new version using the recommended defaults:

  1. Download the latest version of the Intel Software License Manager for Windows package here
    1. Also see the User's Guide for more detailed installation instructions
  2. Run the self-extracting exe file you downloaded
  3. Enter the serial number for your license or path to existing license(s) when prompted
    1. If you see a warning that there are invalid license files in C:\Program Files (x86)\Common Files\Intel\ServerLicenses, make sure to delete all invalid license files.  If the folder is empty, ignore this warning.
  4. If you have not shut down the current license manager, the installer will detect this and offer to close it for you.  It should not require a reboot.

After the installation completes successfully, the new license manager will be started automatically.  The default installation folder is C:\Program Files\Intel\LicenseServer.  If you entered a serial number, it will create and use the license file in C:\Program Files (x86)\Common Files\Intel\ServerLicenses.  The default path for the license manager log file is C:\Program Files (x86)\Common Files\Intel\ServerLicenses\IFLEXlmLog.txt for version 2.6, and C:\Program Files\Intel\LicenseServer\IFLEXlm.log for version 2.5.  Check this log file to find the vendor daemon port and ensure it is not blocked.

Troubleshooting

If the license manager doesn't start, see this article.

Have questions?

Check out the Licensing FAQ
Please visit our Get Help page for support options.

Intel® Software Guard Extensions Tutorial Series: Part 10, Enclave Analysis and Debugging

$
0
0

In part 10 of the Intel® Software Guard Extensions (Intel® SGX) tutorial series we’ll examine two utilities in the Intel SGX Software Development Kit (SDK): the Intel SGX Debugger, and the Enclave Memory Measurement Tool (EMMT). First, we’ll learn how to use the Intel SGX Debugger to inspect enclaves in mixed-mode applications such as our Tutorial Password Manager. Then, we’ll use the EMMT to analyze the memory usage of our enclave in order to optimize its configuration. In the process of doing that, we’ll also find and fix a program bug that has been in the code since Part 3 of the series!

You can find a list of all the published tutorials in the article Introducing the Intel® Software Guard Extensions Tutorial Series. Source code is provided with this installment of the series.

The Intel® SGX Debugger

Intel SGX enclaves are intended to keep secrets from being exposed to unauthorized users and programs. Even code running in Ring 0 cannot bypass an enclave’s protections because the security perimeter is enforced by the CPU, regardless of the execution environment. These protections ensure that no untrusted code—whether that be another user, a superuser, the BIOS, or even a virtual machine manager—can inspect enclave memory. Attempts to map enclave memory outside the enclave will yield only the encrypted memory pages. The inability to inspect production enclaves is a key component of the Intel SGX security model.

Software developers, however, frequently use a debugger to analyze a program’s execution in order to find and fix runtime errors in their code, and a debugger can’t function if it doesn’t have access to an application’s memory space. This creates a conflict between security and usability for Intel SGX: The security of enclaves must be preserved, but software developers need the ability to run a debugger when building and testing applications.

Intel’s solution to this problem is twofold:

  • Only the Intel SGX Debugger can debug enclaves.
  • Only debug-mode enclaves can be debugged.

The Intel SGX Debugger is required to debug enclaves because Intel SGX defines a specific interface for debugging: It requires the execution of special Intel SGX supervisor instructions. Debuggers without Intel SGX support will not be able to inspect enclaves. The CPU enforces the requirement that only debug-mode enclaves can be inspected with a debugger. This allows developers to create debuggable enclaves in their development environment, and non-debuggable enclaves for distribution.

An enclave’s state as a debug-mode or release-mode enclave is set at build time, and this state is included in the enclave’s metadata. The enclave and its metadata are cryptographically signed using the developer’s signing key. This prevents malicious users from turning a release-mode enclave into a debug-mode enclave: Attempts to manipulate the enclave’s state will change its signature, and the CPU will not launch an enclave if the runtime signature does not match the one that was generated at build time. As an additional protection, debug-mode and release-mode enclaves derive different encryption keys for the Intel SGX sealing functions. A debug-mode enclave cannot access, and thus expose, secrets that were sealed by a release-mode enclave.

Running the Intel SGX Debugger on Mixed-Mode Applications

When building a native application on Windows*, using the Intel SGX Debugger is as easy as changing the debugger from the Local Windows Debugger to the Intel SGX Debugger, as shown in Figure 1, and setting the working directory to the same location as the enclave object files (in typical solutions, setting the working path to $(OutDir) will suffice, as shown in Figure 2).


Figure 1.  Selecting the Intel® SGX Debugger in Microsoft Visual Studio*.


Figure 2.  Setting the working directory for the Intel® SGX Debugger.

When running managed applications, however, you are not given a choice of which debugger to run, and the Intel SGX Debugger can’t debug managed code, anyway. That means you cannot use the Start Debugging feature in Microsoft Visual Studio* to directly debug enclaves that are part of a mixed-mode application. Instead, you need to use the Attach to Process command in the Debug menu. The procedure is as follows:

  1. In Visual Studio, go to the Debug menu and select Attach to Process…
  2. In the Attach to Process dialog box, where it says Attach to… click the Select… button. See Figure 3. This will bring up the Select Code Type dialog box.

    Attach to process
    Figure 3.  The Attach to Process window.

  3. In the Select Code Type dialog box, click Debug these code types and choose Intel® SGX, then click OK.


    Figure 4.  Selecting the code type to debug.

  4. Execute the application from the output directory via the Windows shell or Windows File Explorer. (Donot run it from within Visual Studio, as this will launch it under the Windows debugger.)
  5. Return to Visual Studio. In the Attach to Process dialog, hit Refresh to update the process list, then select the application, and click Attach.

    Selecting the process to debug
    Figure 5.  Selecting the process to debug.

Note that the Intel SGX Debugger can only inspect native code, whether it’s inside or outside the enclave. You cannot use it to debug managed code.

The Enclave Memory Measurement Tool

All enclaves are instantiated in a region of memory known as the enclave page cache, or EPC. The EPC is a shared resource, which means that all running enclaves on the system must fit inside of it. On Windows, the size of the EPC is fixed in the BIOS and cannot be changed (on Linux* systems, the EPC supports paging, which allows the operating system to expand it as needed, but this has performance implications and should be avoided). Because of this restriction, it is extremely important that enclaves be sized to their actual memory usage, and not be allocated more memory than they will use.

The amount of memory allocated to an enclave is set in the enclave’s configuration file; it defaults to 256 KB of stack space per thread and 1 MB of global heap space. The developer can change these values either by editing the .xml file directly, or via the user interface in Visual Studio. Though these values are entered in bytes, they must be 4 KB aligned, since EPC pages are made up of 4 KB chunks.

The enclave settings
Figure 6.  The enclave settings for the Tutorial Password Manager.

The purpose of the EMMT is to give Windows developers the ability to measure the real memory use of their enclaves so that they can adjust the allocated values accordingly.

Running the EMMT

The EMMT application is part of the Intel SGX SDK. Execution is straightforward:

sgx_emmt [ –-enclave=enclaves ] program [ arguments ... ]

The --enclaves parameter is optional: If your application has multiple enclaves, you can list one or more of them that you want to target; otherwise, the EMMT measures every enclave in your application.

After the application exits, the EMMT prints a memory usage summary for each measured enclave. Figure 7 shows results for an enclave that used 2 KB of stack space and 4 KB from the heap during execution.

Enclave: "Enclave.signed.dll"
[Peak stack use]: 0x2KB
[Peak heap use]: 0x4KB

Figure 7.  Output from the Enclave Memory Measurement Tool.

There are two important caveats that must be kept in mind when using the EMMT:

  • The results show the actual memory used during that specific execution of the application and its enclave. These values may not be typical for the enclave, or even a worst-case usage. It is important that the test run where the measurements are made be as representative of real-world usage as possible, and cover all possible functions. It may be necessary to write a dedicated front-end whose sole job is to stress-test the enclave by exercising its API.
  • The EMMT does not work on mixed-mode applications. If the main application is written in .NET*, you will not be able to use the EMMT directly on the final application. You will need to write a native front-end application to stress-test the enclave per the above.

Measuring the Tutorial Password Manager

The Tutorial Password Manager is a mixed-mode application: The front-end is written in C#, and it connects to the enclave via C++/CLI. Since the EMMT can’t be run on mixed-mode applications, we need to write a native application to call the enclave interface functions in order to measure the memory usage.

To do this, we’ll embed our reference password vault in the application directly as byte arrays (the disk input/output routines in the Tutorial Password Manager are in the C++ .NET layer, so they aren’t available to us to use) and call the wrapper functions in the EnclaveBridge DLL. For demonstration purposes, we’ll perform the following tasks:

  • Unlock the vault
  • Get account info for each account
  • Get the password for each account
  • Add an account
  • Change all the account passwords, randomly generating a new password each time
  • Change the master password

The main program is shown in Figure 8.

int main()
{
	int rv;
	uint32_t count;

	rv = ew_initialize();
	if (rv != 0) {
		fprintf(stderr, "ew_initialize: 0x%08x\n", rv);
		Exit(1);
	}

	rv = ew_initialize_from_header(header, header_size);
	if (rv != 0) {
		fprintf(stderr, "ew_initialize_from_header: 0x%08x\n", rv);
		Exit(1);
	}

	rv = ew_unlock(mpass);
	if (rv != 0) {
		fprintf(stderr, "ew_unlock: 0x%08x\n", rv);
		Exit(1);
	}

	rv = ew_load_vault(vault_data);
	if (rv != 0) {
		fprintf(stderr, "ew_load_vault: 0x%08x\n", rv);
		Exit(1);
	}

	rv = ew_accounts_get_count(&count);
	if (rv != 0) {
		fprintf(stderr, "ew_accounts_get_count: 0x%08x\n", rv);
		Exit(1);
	}
	for (uint32_t i = 0; i < count; ++i) {
		uint16_t sname, slogin, surl, spass;
		char *name, *login, *url, *pass;

		printf("Account %u:\n", i);

		rv = ew_accounts_get_info_sizes(i, &sname, &slogin, &surl);
		if (rv != 0) {
			fprintf(stderr, "ew_accounts_get_info_sizes[%u]: 0x%08x\n", i, rv);
			continue;
		}

		name = new char[sname+1];
		login = new char[slogin+1];
		url = new char[surl+1];

		rv = ew_accounts_get_info(i, name, sname, login, slogin, url, surl);
		if (rv != 0) {
			fprintf(stderr, "ew_accounts_get_info[%u]: 0x%08x\n", i, rv);
			continue;
		}
		name[sname] = 0;
		login[slogin] = 0;
		url[surl] = 0;

		printf("\tname= %s\n", name);
		printf("\tlogin= %s\n", login);
		printf("\turl= %s\n", url);

		rv = ew_accounts_get_password_size(i, &spass);
		if (rv != 0) {
			fprintf(stderr, "ew_accounts_get_password_size[%u]: 0x%08x\n", i, rv);
			continue;
		}

		pass = new char[spass + 1];

		rv = ew_accounts_get_password(i, pass, spass);
		if (rv != 0) {
			fprintf(stderr, "ew_accounts_get_password[%u]: 0x%08x\n", i, rv);
			continue;
		}

		pass[spass] = 0;

		printf("\tpass= %s\n", pass);

		delete[] name;
		delete[] login;
		delete[] url;
		delete[] pass;
	}

	// Start changing things

	// Add an account

	rv = ew_accounts_set_info(3, "Acme", 4, "wileye", 6, "http://acme.nodomain/", 21);
	if (rv != 0) {
		fprintf(stderr, "ew_accounts_set_info[3]: 0x%08x\n", rv);
		Exit(1);
	}

	// Change the passwords many times

	for (int i = 0; i < 4; ++i) {
		uint32_t idx = i % 4;
		char newpass[32];
		int rv;

		rv = ew_accounts_generate_password(31, 0xffff, newpass);
		if (rv != 0) {
			fprintf(stderr, "ew_accounts_generate_password[%d]: 0x%08x\n", i, rv);
			Exit(1);
		}

		rv = ew_accounts_set_password(idx, newpass, 31);
		if (rv != 0) {
			fprintf(stderr, "ew_accounts_set_password[%d]: 0x%08x\n", i, rv);
			Exit(1);
		}

	}

	// Change the master password

	rv = ew_change_master_password(mpass, new_mpass);
	if (rv != 0) {
		fprintf(stderr, "ew_change_master_password: 0x%08x\n", rv);
		Exit(1);
	}

	Exit(0);
}

Figure 8. Program listing for the native test application.

Initially, we’ll change each account password once, just to get some output and see where our memory usage stands.

After building and executing the test application under the EMMT, we get the following:

> sgx_emmt "CLI Native Test App.exe"
The command line is: "cli native Test App.exe ".
Enclave: "Enclave.signed.dll"
[Peak stack use]: 0x2KB
[Peak heap use]: 0x4KB

Figure 9. EMMT output for the Tutorial Password Manager's enclave.

Because the size of the password vault is fixed, we don’t expect the Tutorial Password Manager to use a lot of enclave memory, and according to the EMMT it doesn’t. However, changing passwords just once isn’t much of a stress test, and the encryption and decryption functions may need more memory than is shown in this simple test. So, we’ll modify the test program to change each password 1000 times and see how that affects our results.

After recompiling and executing, we get the output shown in Figure 10:

The command line is: "cli native Test App.exe ".
Enclave: "Enclave.signed.dll"
[Peak stack use]: 0x2KB
[Peak heap use]: 0xc0KB

Figure 10.  EMMT output after increasing the number of password changes.

This result, however, is unexpected. Originally, we used only 4 KB of stack space, but it has jumped to 192 KB. If there was a small increase that would possibly make sense, but a 48x increase does not! Increasing the number of password changes to 5000 causes yet another increase in the heap usage:

[Peak heap use]: 0x3acKB
Figure 11. EMMT output after further increasing the number of password changes.

This suggests we have a memory leak in the enclave.

The loop responsible for changing the password is invoking these two ECALLs:

  • ve_accounts_generate_password()
  • ve_accounts_set_password()

A quick review of E_Vault.cpp turns up the culprit. In E_Vault::accounts_set_password(), we are calling new without a corresponding delete. Thus, each call to change a password leaks a few bytes of memory. Since this code was derived from the non-Intel SGX code path, a review of Vault.cpp reveals the exact same problem.

After fixing the memory leak, we run our test again and obtain the results shown in Figure 12:

The command line is: "cli native Test App.exe ".
Enclave: "Enclave.signed.dll"
[Peak stack use]: 0x2KB
[Peak heap use]: 0x4KB

Figure 12.  EMMT output after fixing the memory leak.

This takes us back to our original heap usage of 4 KB.

We aren’t done, however. This is the base usage for a handful of entries, all with reasonable string lengths, but our vault can actually store much longer strings—up to 64 KB each, for the account name, URL, login name, and password. That means each account could require 256 KB, and with eight accounts total that means the vault could be up to 2 MB in memory.

How much memory we want to allocate to the enclave is now a judgement call. Realistically, users are probably not going to enter 65,000+ characters into any of the fields, so 2 MB of heap space for the highly unlikely, worst-case scenario seems excessive. If each field only stored 80 bytes, which is still a lot, the total storage need would be under 3 KB. To be safe, we’ll allocate an additional 4 KB to the heap above what the EMMT measures, which brings it to 8 KB. The stack is given the minimum of 4 KB. The final enclave settings are shown in Figure 13:

Final enclave settings
Figure 13. Final enclave settings for the Tutorial Password Manager.

Our enclave is now using only a fraction of the EPC that was allocated to it by the default configuration.

If we were building a commercially viable password manager, one that could have an unlimited number of accounts, we’d need a different design strategy for the application since the fixed heap size means an enclave’s memory footprint cannot grow without bound. Our enclave would not be able to store the entire password database unencrypted in memory, and would instead have to implement some form of just-in-time processing, where accounts were only decrypted as they were needed.

Summary

The Intel SGX SDK includes tools to aid in the development and debugging of Intel SGX enclaves, and in particular to address the unique challenges presented by the security model. Production applications must not be inspectable, but debuggers are a critical tool in the software development lifecycle, and Intel provides a solution in the Intel SGX Debugger. Special instructions in the Intel SGX instruction set allow developers to use the Intel SGX Debugger on debug-mode enclaves while denying the same to production applications. Additionally, developers are encouraged to minimize the memory footprint of their enclaves because the EPC is a shared resource. The EMMT gives developers the information they need to appropriately size their stack and heap allocations. The Intel SGX SDK provides a complete development ecosystem with libraries, APIs, and tools for aiding development, debugging, and management of enclaves.

Sample Code

The code sample for this part of the series builds against the Intel SGX SDK version 1.8 using Microsoft Visual Studio 2015.

Coming Up Next

In Part 11 of the series, we’ll prepare our Tutorial Password Manager for deployment by creating an installation package. Stay tuned!

Viewing all 3384 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>