Diagnostic 15520: loop with multiple exits cannot be vectorized unless it meets search loop idiom criteria

May 1, 2018, 4:13 pm

Latest and popular articles on Intel Technologies

≫ Next: Cute Cats and Spaceships: How Stardrop's Creator Conquered Burnout

≪ Previous: Array Shape Check: New in Intel® Fortran Compiler 19.0 BETA

This diagnostic message is emitted from Intel(R) C++ Compiler 15.0 and above

remark #15520: loop was not vectorized: loop with multiple exits cannot be vectorized unless it meets search loop idiom criteria

Cause:

More than one exit point in the loop. In most cases, a loop must have a single entry and a single exit point in order to be auto-vectorized.

Example:

  int no_vec(float a[], float b[], float c[])
  {
        int i = 0.; 
        while (i < 100) { 
          a[i] = b[i] * c[i]; 
   //  this is a data-dependent exit condition: 
          if (a[i] < 0.0) 
             break; 
          ++i; 
        } 
        return i;
   }

$ icc -c -qopt-report-file=stderr -qopt-report-phase=vec d15520.c
...
LOOP BEGIN at d15520.c(4,9)
remark #15520: loop was not vectorized: loop with multiple exits cannot be vectorized unless it meets search loop idiom criteria [ d15520.c(7,11) ]
LOOP END
===========================================================================

Workarounds:

Simple search loops are recognized by the compiler and can be vectorized if they are written as a "for" loop, e.g.:

int foo(float *a, int n){
  int i;
  for (i=0;i<n;i++){
    if (a[i] == 0){
      break;
    }
  }
  return i;
}

$ icc -c -qopt-report-file=stderr -qopt-report-phase=vec d15524.c
...
LOOP BEGIN at d15524.c(3,3)
remark #15300: LOOP WAS VECTORIZED
LOOP END

More complicated loops such as the original example can be vectorized using a "for" loop preceded by an OpenMP SIMD directive along with an early_exit clause:

int no_vec(float a[], float b[], float c[])
   {
        int i;
#pragma omp simd early_exit
        for(i=0; i<100; i++) {
          a[i] = b[i] * c[i];
   //  this is a data-dependent exit condition:
          if (a[i] < 0.0)
             break;
        }
        return i;
   }

$ icc -c -qopt-report-file=stderr -qopt-report-phase=vec d15520_b.c
...
LOOP BEGIN at d15520_b.c(5,13)
remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
LOOP END

↧

Cute Cats and Spaceships: How Stardrop's Creator Conquered Burnout

May 2, 2018, 8:29 am

Latest and popular articles on Intel Technologies

≫ Next: Icons: Combat Arena* Aims to Usher in a New Era of Platform Fighters

≪ Previous: Diagnostic 15520: loop with multiple exits cannot be vectorized unless it meets search loop idiom criteria

The original article is published by Intel Game Dev on VentureBeat*: Cute cats and spaceships: How Stardrop’s creator conquered burnout. Get more game dev news and related topics from Intel on VentureBeat.

Cat in Spaceship

Two years into making Stardrop, Joure Visser hit an impasse. He wanted to crank out the next chapter of the game's story, but he was having trouble trying to stay focused. Even though the Singapore-based indie developer rarely took breaks, he knew that it was time to take a step back.

Stardrop is a first-person sci-fi adventure game that follows Aryn Vance, a salvage and rescue operator who strips down abandoned spaceships for resources. One day, she and her partner decide to investigate a strange distress signal, and they soon get caught up in trying to solve the mystery behind it. The first three chapters of Stardrop are available now via Steam's early access program, and Visser and his team hope to complete it later this year. That's good news because Stardrop was awarded Best Game with 3D Graphics at last year's Intel® Level Up Game Developer Contest.

While narrative inspirations include games like Portal* and Firewatch*, it was SEGA*'s Alien*: Isolation that really pushed him to make Stardrop in the first place. Playing the survival horror game made him wonder what the experience would be like if it didn't have any of the violence or tension from being hunted by a predatory creature.

From there, he came up with a story and universe that he thought would work well for the project. Interestingly, you won't find any bad guys or fighting mechanics in the game. One of Visser's goals from the beginning was to tell "a more human story."

"The most important aspect was to have a game that at the end of the day, would hopefully stick with you and make you appreciate certain things in life a little more," he added.

Stardrop isn't Visser's first attempt at making a narrative-heavy game. He got his start in the PC modding scene, most notably working on 1187, a Half-Life* 2 mod that told a new story within Valve's popular franchise. That helped him land a job at Monochrome, where he worked on the zombie shooter Contagion* for four years before leaving the company to pursue his own projects.

Despite the challenges that come with being an indie dev, Visser knew he made the right choice for his career. He mostly works on Stardrop himself, doing a lot of the programming, design work, and writing. But he does have a small team that supports him, including a 3D modeler, a composer, and a few voice actors.

With their help, he launched a successful crowdfunding campaign on Kickstarter* in 2016, raising almost $14,000. While that wasn't nearly enough money to fund the entire development, it helped Visser scrape by before launching Stardrop as a paid early access game.

"It's a real fight to get this done because as an indie, especially now, it's so damn difficult," Visser said.

Spaceship

Battling Burnout

It was the fall of 2017 when Visser faced his biggest obstacle yet: burnout. The only solution was to take a lengthy break.

During this time, he noticed that his four-year-old daughter didn't really have anything to play on PC. So he decided to make something she could enjoy: a simple sandbox-style platformer where players can run around the world as a cute little cat. His daughter loved it.

After working on the game for a few weeks, Visser's wife suggested that he should spruce it up so that he could sell it on Steam*. The side project eventually became Play with Gilbert, which came out later that November.

"It was something I needed without realizing it. Working on Play with Gilbert was pure fun because I had nothing to lose. … I just did it for my daughter, and then I turned it into something that other kids could play," said Visser.

The response to Play with Gilbert was much better than he expected (it currently has a "Mostly Positive" rating from players), and the sales provided some much-needed funding. Tinkering with digital kittens had another unintended side effect: The experience taught him a few valuable lessons about Unreal Engine* 4, which is the same engine powering Stardrop.

Some of the technical problems he couldn't figure out on the sci-fi game suddenly became a lot easier when he returned to work on it.

"I did some work back in Stardrop and was like, 'Wow! I really do know a little bit more about Unreal*. That's pretty cool,'" said Visser.

After the release of Play with Gilbert, Visser and his family packed their bags to visit his parents in Holland and Spain, a trip that he described as being long overdue. They stayed abroad for three months. Visser couldn't completely remove himself from Stardrop, however: He ended up doing some prototyping work on the next chapter. But the time off helped immensely.

When the newly re-energized developer came back home, he was ready to tackle Stardrop once more.

Stardrop Game

Looking to the Future

For Visser, Stardrop is meaningful on a number of levels. He's trying to tell an emotionally resonant story that isn't too common in games. He wants to kick-off his indie dev career on a high note, something that'll help establish his name. And he hopes that Stardrop will also help his colleagues expand their own careers.

At the very least, Visser hopes that Stardrop will sell well enough so that he can continue to make games for a living.

"If I'm lucky enough, and I do my job well enough, I think that's a very realistic possibility," he said.

↧

Icons: Combat Arena* Aims to Usher in a New Era of Platform Fighters

May 2, 2018, 9:28 am

Latest and popular articles on Intel Technologies

≫ Next: Liver Patient Dataset Classification Using the Intel® Distribution for Python*

≪ Previous: Cute Cats and Spaceships: How Stardrop's Creator Conquered Burnout

The original article is published by Intel Game Dev on VentureBeat*: Icons: Combat Arena aims to usher in a new era of platform fighters. Get more game dev news and related topics from Intel on VentureBeat.

Screen of the Game Combat Arena

For Wavedash* CEO Matt Fairchild and creative director Jason Rice, building a new platform fighter makes more than just business sense: it's their way of giving back to the games and communities that brought them together.

"I literally owe my entire life — my career, my friends — to this genre. … I owe so much, and I have seen the transformative power that this genre has and its ability to draw in people from across the board and get them into a competitive community and build relationships there," said Fairchild.

"I have friends all over the world who — growing up as this east-coast, middle-class, white Protestant boy — I would not have made friends with if I had stayed in my little bubble, if gaming hadn't pulled me out of that and made me more open to new stuff. … But the platform fighting community is where my home was, where my people were at," Rice added.

The two friends first met more than a decade ago when Rice, who was working at Major League Gaming* at the time, hired Fairchild to host Super Smash Bros.* tournaments in Texas. Those early experiences were invaluable to them, and now they're leveraging that expertise for Icons: Combat Arena*.

Icons's gameplay is similar to Super Smash Bros. Players can choose from a number of different characters to fight with, and the goal is to deplete your opponents' lives by knocking them off the stage. One big difference from Nintendo*'s juggernaut series, however, is that Icons will be free-to-play when it comes out on PC (it's currently in an invite-only closed beta), which will enable it to grow and evolve alongside its players.

It's the kind of game Wavedash's founders wish existed 10 years ago. But it wasn't until a series of industry-wide changes — including the rise of esports, online streaming, and Amazon's $970 million acquisition of Twitch* — that Fairchild and Rice finally decided to make their dream a reality.

They recruited other like-minded developers. The team is a mix of triple-A veterans (some coming from big studios like Riot Games* and Blizzard*) and talented amateurs from the Super Smash Bros. modding community.

"So when we looked at all of these things converging together, we saw an opportunity to take the things that Riot did with League of Legends*, the things that Blizzard did with Hearthstone*, and more recently, the things that Bluehole* has done with Player Unknown's Battlegrounds*— to take a genre that's supported by passionate players and make it accessible and exciting to watch, and then take it to new audiences," said Rice.

The developers believe that fighting games haven't had its "League of Legends moment" yet, where a game breaks through the noise to reach mainstream-levels of awareness and success. And they hope Icons can fill that gap. But Wavedash faces several challenges, not the least of which is how do you create a new platform fighter that's accessible for casual players, yet
deep enough for hardcore fans?

Screen of the Game Combat Arena

Getting Out of Smash's Shadow

One way that the Oakland, California-based studio is wrestling with that question is through Icons's design philosophy. It involves three main goals: make the game easy to learn, hard to master, and an endlessly watchable experience. With the first part, Wavedash is staying true to the pick-up-and-play nature that made platform fighters so popular in the first place. You don't need to master complicated move sets to start having fun in Icons.

But if you do want to dive deeper, the company is planning on adding a series of tutorials that'll help beginners get comfortable with the more nuanced strategies in the game.

Seasoned platform fighting players will also feel at home in Icons's world. It has familiar character archetypes or roles, like Kidd the Space Goat (the swift "space animal" archetype) and the empress Zhurong (the beefy "sword fighter" archetype). Wavedash cited Super Smash Bros. Melee as an inspiration for the kind of gameplay depth it wants Icons to have: players are still trying to master the GameCube* fighter more than 16 years after its release.

The last goal is perhaps the most difficult one to engineer. The developer won't really know how Icons fares as a spectator sport until it's out in the wild. However, the team has seen some positive signs, like the time Fairchild took his fiancée, who hadn't played a platform fighter before, to an Icons tournament.

Screen of the Game Combat Arena

Toward the end of the competition, she grew more excited about the matches, and she even started to pick out players to root for based on their play style.

"Building a game that is inherently watchable makes it not only more fun to play, but it also means that it's super digestible as content for a platform like Twitch or YouTube*. And it makes it something that is 'esports ready,'" said Rice. "Now, we don't get to decide if our game is an esport or not in the end. The players will. You can host the tournaments, but they have to show up and support you."

While similarities do exist between Icons and other platform fighters, the team hopes that a mix of new and familiar features will give the game its own identity. One example of bringing something new to the genre is the elemental duo Afi and Galu. They represent what Rice called a duo archetype, where players can switch between them at any time (the inactive partner stays on the stage as a statue).

Both of them share the same basic moves. But they also have their own special attacks: Afi has fire powers and Galu can manipulate water. From private playtesting sessions, Wavedash found that players quickly grew fond of Afi and Galu, not only because of their strategic potential, but also because of their quirky personalities.

"I'm most proud of the work that Jason and the design team — Adam Oliver and Wes Ruttle — did on taking a familiar idea and making it new, and we're seeing that in the feedback with people that have gotten to beta test Afi and Galu. They say, 'This is what I wanted from a duo character. This is familiar and yet it's also unlike any duo character I've ever played,'" said Fairchild.

Icons at Game Dev Exposition

Enlisting the Help of the Community

Gathering community feedback so early in the development process has always been a part of the developers' plans. Those responses help them identify problems they should work on, as well as which aspects of the Icons' roster need improvement. Wavedash regularly travels to different gaming events and tournaments — like TwitchCon*, EVO* (the Evolution Championship Series), and CEO* (Community Effort Orlando) — to collect even more data by asking players to fill out surveys (some of who compete in fighting games for a living).

"Ultimately, Wavedash is a small team that's punching above its weight class. And we've only been able to do that because we've taken this kind of unorthodox approach to game development. But our players have helped us shape who these characters should be and what the game should feel like," said Rice.

Of course, the studio can't react to every individual piece of feedback; it still has a vision it wants to maintain. Rice and his colleagues have to carefully sift through comments (whether good or bad) to see if any of them actually helps reinforce the original design goals.

"Something Jason says a lot is we are in this business to create joy. Our job is to get people together playing games, having fun, and to always be that uniquely positive force in their lives. So as long as we're going towards that and people are invested in the outcome, then we're in good shape," said Fairchild.

↧

Liver Patient Dataset Classification Using the Intel® Distribution for Python*

May 3, 2018, 11:48 am

Latest and popular articles on Intel Technologies

≫ Next: C++ Extensions for Persistent Memory Programming

≪ Previous: Icons: Combat Arena* Aims to Usher in a New Era of Platform Fighters

Abstract

This paper focuses on the implementation of the Indian Liver Patient Dataset classification using the Intel® Distribution for Python* on the Intel® Xeon® Scalable processor. Various preprocessing steps were used to analyze the effect of the machine learning classification problems. With the help of various features, the liver patient classification aims to predict whether or not a person has liver disease. Early determination of the disease without the use of manual effort could be a great support for people in the medical field. Good results were obtained by using SMOTE as the preprocessing method and the Random Forest algorithm as the classifier.

Introduction

The liver, which is the largest solid organ in the human body, performs several important functions. Its major functions include manufacturing essential proteins and blood clotting factors, metabolizing fat and carbohydrates, eliminating harmful waste products and detoxifying alcohol and certain drugs, and secreting bile to aid digestion and internal absorption. Disorders of the liver can affect the smooth functioning of these activities.

Excessive consumption of alcohol, viruses, the intake of contaminated food and drugs and so on are the major causes of liver diseases. The symptoms may or may not be visible in the early stages. If not attended to properly, liver diseases can lead to life-threatening conditions. It is always better to diagnose the disease in an early stage in order to help ensure a high rate of survival for the patient.

Classification is an effective technique used to handle this kind of problem in the medical field. Using the available feature values, the classifier could predict whether or not a person has liver disease. This ability will help doctors identify the disease in advance. It is always recommended to reduce Type I error (occurs due to the rejection of null hypothesis (as false) when it is actually true). Because it is better to identify a non-liver patient as a patient rather than not identifying a liver patient as a patient.

In this experiment, various preprocessing methods were tried prior to model building and training for comparison. Computational libraries like scikit-learn*, numpy, and scipy* from the Intel Distribution for Python on the Intel Xeon Scalable Processor were used for predictive model creation.

Environment Setup

Table 1 describes the environment setup that was used to conduct the experiment.

Table 1. Environment setup.

Setup	Version
Processor	Intel® Xeon® Gold 6128 processor 3.40 GHz
System	CentOS* (7.4.1708)
Core(s) per socket	6
Anaconda* with Intel channel	4.3.21
Intel® Distribution for Python*	3.6.3
Scikit-learn*	0.19.0
Numpy	1.13.3
Pandas	0.20.3

Dataset Description

The Indian Liver Patient dataset was collected from the northeast area of the Andhra Pradesh state in India. This is a binary classification problem with the class labeled as liver patient (represented as 1 in the dataset) and not-liver patient (represented as 2). There are 10 features, which are listed in table 2.

Table 2. Dataset description.

Attribute Name	Attribute Description
V1	Age of the patient. Any patient whose age exceeded 89 is listed as being age 90.
V2	Gender of the patient
V3	Total bilirubin
V4	Direct bilirubin
V5	Alkphos alkaline phosphatase
V6	Sgpt alanine aminotransferase
V7	Sgot aspartate aminotransferase
V8	Total proteins
V9	Albumin
V10	A/G ratio albumin and globulin ratio
Class	Liver patient or not

Methodology

Figure 1. Methodology.

Data Analysis

Before performing any processing on the available data, a data analysis is recommended. This process includes visualization of the data, identifying the outliers, and skewed predictors. These tasks help to inspect the data and thereby spot the missing values and irrelevant information in the dataset. A data cleanup process is performed to handle these issues and to ensure data quality. Gaining a better understanding of the dataset helps to identify useful information and supports decision making.

The Indian Liver Patient dataset consists of 583 records in which 416 are records of people with liver disease, and the remaining are records of people without any liver disease. The dataset has 10 features in which there is only one categorical data (V2-Gender of the patient). The endmost column of the dataset represent the class in which each sample falls (liver patient or not). A value of 1 indicates the person has liver disease and a 2 indicates the person does not have the disease. There is no missing value in the dataset.

Figure 2. Visualization: liver patient dataset class.

Figure 3. Visualization: male and female population.

Figure 2 shows a visualization of the number of liver patients and non-liver patients in the dataset, whereas figure 3 represents a visualization of the male and female population in the dataset. Histograms of numerical variables are represented by figure 4.

Figure 4. Visualization of numerical variables in the dataset.

Data Preprocessing

Some datasets contain irrelevant information, noise, missing values, and so on. These datasets should be handled properly to get a better result for the data mining process. Data preprocessing includes data cleaning, preparation, transformation, and dimensionality reduction, which convert the raw data into a form that is suitable for further processing.

The major objective of the experiment is to show the effect of various preprocessing methods on the dataset prior to classification. Different classification algorithms were applied to compare the results.

Some of the preprocessing includes:

Normalization: This process scales each feature into a given range. The preprocessing.MinMaxScaler() function in the sklearn package is used to perform this action.
Assigning quantiles ranges: The pandas.qcut function is used for quantile-based discretization. Based on the sample quantiles or rank, the variables are discretized and assigned some categorical values.
Oversampling: This technique handles the unbalanced dataset. Oversampling is used to generate new samples in the under-represented class. SMOTE is used for oversampling the data. SMOTE proposes several variants by identifying specific samples. The SMOTE() function from imblearn.over_sampling is used to implement this.
Undersampling: Another technique to deal with unbalanced data is undersampling. This method is used to reduce the number of samples in the targeted class. ClusterCentroids is used for undersampling. The K-means algorithm is used in this method to reduce the number of samples. The ClusterCentroids() function from the imblearn.under_sampling package is used.
Binary encoding: This method converts the categorical data into a numerical form. It is used when the feature column has a binary value. In the liver patient dataset, column V2 (gender) has the values male/female, which is binary encoded into “0” and “1”.
One hot encoding: Categorical features are mapped onto a set of columns that have values “1” or “0” to represent the presence or absence of that feature. Here, after assigning the quantile ranges to some features (V1, V3, V5, V6, V7), one hot encoding is applied to represent the same in the form of 1s and 0s.

Feature Selection

Feature selection is mainly applied to large datasets to reduce high dimensionality. This helps to identify the most important features in the dataset that can be given for model building. In the Indian Liver Patient dataset, the random forest algorithm is applied in order to visualize feature importance. The ExtraTreesClassifier() function from the sklearn.ensemble package is used for calculation. Figure 5 shows the feature importance with forests of trees. From the figure, it is clear that the most important feature is V5 (alkphos alkaline phosphatase) and the least important is V2 (gender).

Removing the least significant features help to reduce the processing speed. Here V2 (gender of the patient), V8 (total proteins), V10 (A/G ratio albumin and globulin ratio), and V9 (albumin) are dropped in order to reduce the number of features for model building.

Feature importances
Figure 5. Feature importance with forests of trees.

Model Building

A list of classifiers was used for creating various classification models, which can be further used for prediction. A part of the whole dataset was given for training the model and the rest was given for testing. In this experiment, 90 percent of the data was given for training and 10 percent for testing. Since StratfiedShuffleSplit (a function in scikit-learn) was applied to split the train-test data, the percentage of samples for each class was preserved, that is, in this case, 90 percent of samples from each class was taken for training and the remaining 10 percent from each class was given for testing. Classifiers from the scikit-learn package were used for model building.

Prediction

The label of a new input can be predicted using the trained model. The accuracy and F1 score were analyzed to understand how well the model has learned during training.

Evaluation of the model

Several methods can be used to evaluate the performance of the model. Cross validation, confusion metrics, accuracy, precision, recall, and so on are some of the popular performance evaluation measures.

The performance of a model cannot be assessed by considering only the accuracy, because there is a possibility for misleading. Therefore this experiment considers the F1 score along with the accuracy for evaluation.

Observation and Results

In order to find out the effect of feature selection on the liver patient dataset, accuracy and F1 score were analyzed with and without feature selection (see table 3).

After analyzing the result, it was inferred that there was no remarkable change in the result by removing the least significant features except in the case of the Random Forest Classifier. Because feature selection helps to reduce the processing, it was applied before further processing techniques.

Table 3. Performance with and without feature selection.

Classifiers	Without Feature Selection			With Feature Selection
	Accuracy	F1 score		Accuracy	F1 score
		Patient	Non-Patient		Patient	Non-Patient
Random Forest Classifier	71.1186	0.81	0.37	74.5762	0.84	0.44
Ada Boost Classifier	74.5762	0.83	0.52	72.8813	0.82	0.43
Decision Tree Classifier	66.1016	0.76	0.41	67.7966	0.77	0.49
Multinomial Naïve Bayes	47.4576	0.47	0.47	49.1525	0.5	0.48
Gaussian Naïve Bayes	62.7118	0.65	0.61	61.0169	0.62	0.6
K-Neighbors Classifier	72.8813	0.83	0.33	72.8813	0.83	0.33
SGD Classifier	71.1864	0.83	0	67.7966	0.81	0
SVC	71.1864	0.83	0	71.1864	0.83	0
OneVsRest Classifier	62.7118	0.77	0.08	32.2033	0.09	0.46

After feature selection, some preprocessing techniques were applied, including normalization. Here, each feature was scaled and translated such that it is in the given range on the training set. Another preprocessing was done by assigning quantile ranges to some of the feature values. One hot encoding was done after this to represent each column in terms of 1s and 0s. The classification result after performing normalization and quantile assigning is given in table 4. After analysis, it was clear that the preprocessing could not improve the performance of the model. But one hot encoding of the column helped in faster model building and prediction.

Table 4. Performance with normalization and quantile ranges.

Classifiers	Normalization			Assigning quantiles ranges
	Accuracy	F1 score		Accuracy	F1 score
		Patient	Non-Patient		Patient	Non-Patient
Random Forest Classifier	72.8813	0.82	0.43	71.1864	0.82	0.32
Ada Boost Classifier	72.8813	0.82	0.43	76.2711	0.85	0.36
Decision Tree Classifier	67.7966	0.77	0.49	74.5762	0.84	0.35
Multinomial Naïve Bayes	71.1864	0.83	0	67.7966	0.75	0.56
Gaussian Naïve Bayes	57.6271	0.58	0.58	37.2881	0.21	0.48
K-Neighbors Classifier	72.8813	0.83	0.33	71.1864	0.78	0.59
SGD Classifier	71.1864	0.83	0	71.1864	0.83	0
SVC	71.1864	0.83	0	71.1864	0.83	0
OneVsRest Classifier	71.1864	0.83	0	71.1864	0.83	0

Another inference is that the F1 score for non-patients is zero in some cases, which is a major challenge. In such cases, the accuracy may be high, but the model will not be reliable because the classifier classifies the whole data into one class. The major reason for this could be data imbalance. To address this issue undersampling and oversampling techniques were introduced. Cluster centroids were used for undersampling and the SMOTE algorithm was used for oversampling. The results are shown in table 5.

Table 5. Performance with under sampling and SMOTE.

Classifiers	Cluster Centroid (Under Sampling)			SMOTE(Over sampling)
	Accuracy	F1 score		Accuracy	F1 score
		Patient	Non-Patient		Patient	Non-Patient
Random Forest Classifier	67.7966	0.73	0.6	86.4406	0.91	0.75
Ada Boost Classifier	66.1016	0.71	0.58	74.5762	0.81	0.63
Decision Tree Classifier	57.6271	0.65	0.47	72.8813	0.79	0.6
Multinomial Naïve Bayes	45.7627	0.41	0.5	49.1525	0.5	0.48
Gaussian Naïve Bayes	59.3220	0.6	0.59	62.7118	0.65	0.61
K-Neighbors Classifier	67.7966	0.72	0.63	71.1864	0.8	0.51
SGD Classifier	33.8983	0.13	0.47	69.4915	0.81	0.18
SVC	66.1016	0.69	0.63	66.1016	0.71	0.6
OneVsRest Classifier	52.5423	0.55	0.5	40.6779	0.29	0.49

Table 5 shows that undersampling and oversampling could handle the data imbalance problem. Using cluster centroids as the undersampling technique did not improve the accuracy, whereas SMOTE did give a tremendous improvement in the accuracy. The best accuracy was obtained for the Random Forest Classifier and Ada Boost Classifier. The processing was improved by running the machine learning problem in the Intel® Xeon® Scalable processor making use of computational libraries from the Intel Distribution for Python.

Figure 6. ROC of Random Forest 5-fold cross validation.

Figure 7. ROC curve for various classifiers.

Figure 6 shows the ROC curve of the best classifier (Random Forest Classifier) for 5-fold cross validation. Higher accuracy was obtained during the cross validation as the validation samples were taken from the training sample that was subjected to oversampling (SMOTE). The expected accuracy during cross-validation was not attained during testing because the test data was isolated from the train data before performing SMOTE.

The ROC curves for various classifiers are given in figure 7. The classifier output quality of different classifiers can be evaluated using this.

Conclusion

The preprocessing and classification methods did not improve the accuracy of the model. Handling the data imbalance using SMOTE gave better accuracy for the Random Forest and Ada Boost Classifier. A good model was created using the computational libraries from the Intel Distribution for Python on the Intel Xeon Scalable processor.

References

Author

Aswathy C is a technical consulting engineer working with the Intel® AI Academy Program.

↧

C++ Extensions for Persistent Memory Programming

May 1, 2018, 12:13 pm

Latest and popular articles on Intel Technologies

≫ Next: Machine Learning and Mammography

≪ Previous: Liver Patient Dataset Classification Using the Intel® Distribution for Python*

Overview

Interest is growing in persistent memory technologies. In addition to currently available products like nonvolatile dual in-line memory module (NVDIMM-N) and dynamic random-access memory (DRAM) with NAND flash-based storage, there are new technologies emerging, including 3D XPoint™ memory NVDIMMs from Intel.These new hardware types offer interesting new possibilities to developers, while at the same time introducing programming challenges.

Persistent memory programming is fundamentally different from traditional programming to volatile memory due to its requirement to ensure data retention after program completion, an application or system crash, or a power failure. Intel developed and open-sourced a set of libraries called the Persistent Memory Developer Kit (PMDK) to make it easier to convert an application to use persistent memory. This paper describes the C++ API for the libpmemobj library of the PMDK, along with other proposed changes to the C++ standard.

Download Technical Article (PDF)

Resources

Persistent Memory Programming on Intel Developer Zone

pmem.io - for programming with the Persistent Memory Developer Kit (PMDK)

Github site for persistent memory programming

Google Group for persistent memory programming

CppCon 2017: Tomasz Kapela's session C++ and Persistent Memory Technologies, Like Intel's 3D-XPoint

↧

Machine Learning and Mammography

May 3, 2018, 5:11 pm

Latest and popular articles on Intel Technologies

≫ Next: How Netrolix AI-WAN* Broke the SD-WAN Barrier

≪ Previous: C++ Extensions for Persistent Memory Programming

Abstract

This article, Machine Learning and Mammography, shows how existing deep learning technologies can be utilized to train artificial intelligence (AI) to be able to detect invasive ductal carcinoma (IDC)¹ (breast cancer) in unlabeled histology images. More specifically, I show how to train a convolutional neural network² using TensorFlow*³ and transfer learning⁴ using a dataset of negative and positive histology images. In addition to showing how artificial intelligence can be used to detect IDC, I also show how the Internet of Things (IoT) can be used in conjunction with AI to create automated systems that can be used in the medical industry.

Breast cancer is an ongoing concern and one of the most common forms of cancer in women. In 2018 there is expected to be an estimated 266,120 new diagnoses in the United States alone. The use of Artificial Intelligence can drastically reduce the need for medical staff to examine mammography slides manually, saving not only time, but money, and ultimately lives. In this articles I show how we can use Intel technologies to create a deep learning neural network that is able to detect IDC.

Introducing the IDC Classifier

To create the IDC classifier, I use the Intel® AI DevCloud⁵ to train the neural network, an Intel® Movidius™ product⁶ for carrying out inference on the edge, and an UP Squared*⁷ device to serve the trained model making it accessible via an API, and an IoT connected alarm system built using a Raspberry Pi*⁸ device that demonstrates the potential of using the IoT via the IoT JumpWay*⁹ combined with AI to create intelligent, automated medical systems.

The project evolved from a computer vision project that I have been developing for a number of years named TASS¹⁰. TASS is an open source facial recognition project that has been implemented using a number of different techniques, frameworks, and software developer kits (SDKs).

Invasive Ductal Carcinoma

IDC is one of the most common forms of breast cancer. The cancer starts in the milk duct of the breast and invades the surrounding tissue. This form of cancer makes up around 80 percent of all breast cancer diagnosis, with more than 180,000 women a year in the United States alone being diagnosed with IDC, according to the American Cancer Society.

Convolutional Neural Networks

Inception v3 architecture diagram

Figure 1. Inception v3 architecture (Source).

Convolutional neural networks are a type of deep learning¹¹ neural network. These types of neural nets are widely used in computer vision and have pushed the capabilities of computer vision over the last few years, performing exceptionally better than older, more traditional neural networks; however, studies show¹² that there are trade-offs related to training times and accuracy.

Transfer Learning

Inception v3 model diagram

Figure 2. Inception V3 Transfer Learning (Source)

Transfer learning allows you to retrain the final layer of an existing model, resulting in a significant decrease in not only training time, but also the size of the dataset required. One of the most famous models that can be used for transfer learning is the Inception V3 model created by Google*.¹³ This model was trained on thousands of images from 1,001 classes on some very powerful devices. Being able to retrain the final layer means that you can maintain the knowledge that the model had learned during its original training and apply it to your smaller dataset, resulting in highly accurate classifications without the need for extensive training and computational power. In one version of TASS, I retrained the Inception V3 model using transfer learning on a Raspberry Pi 3 device, so that should give you some idea of the capabilities of transfer learning.

Intel® AI DevCloud

The Intel AI DevCloud is a platform for training machine learning and deep learning models. The platform is made up of a cluster of servers using Intel® Xeon® Scalable processors. The platform is free and provides a number of frameworks and tools including TensorFlow, Caffe*, Keras*, and Theano*, as well as the Intel® Distribution for Python*. The Intel AI DevCloud is great for people getting started with learning how to train machine learning and deep learning models, as graphics processing units (GPUs) can be quite expensive, and access to the DevCloud is free.

In this project I use the Intel AI DevCloud to sort the data, train the model, and evaluate it. To accompany this article I created a full tutorial and provided all of the code you need to replicate the entire project; read the full tutorial and access the source code.

Intel® Movidius™ Neural Compute Stick

The Intel® Movidius™ Neural Compute Stick is a fairly new piece of hardware used for enhancing the inference process of computer vision models on low-powered edge devices. The Intel Movidius product is a USB appliance that can be plugged into devices such as Raspberry Pi and UP Squared, and basically takes the processing power off the device and onto the Intel Movidius brand chip, making the classification process a lot faster. Developers can train their models using their existing TensorFlow and Caffe scripts and, by installing the Intel Movidius Neural Compute Stick SDK on their development machine, can compile a graph that is compatible with the Intel Movidius product. A less-bulky API can be installed on the lower-powered device allowing inference to be carried out via the Intel Movidius product.

Ready to Code

Hopefully, by now you are eager to get started with the technical walkthrough of creating your own computer vision program for classifying negative and positive breast cancer cells, so let’s get to the nitty gritty. Here I walk you through the steps for training and compiling the graph for the Intel Movidius product. For the full walkthrough, including the IoT connected device, please follow the GitHub* repository. Before following the rest of this tutorial, please follow the steps in the repository regarding setting up your IoT JumpWay device, as this step is required before the classification test happens.

Installing the Intel Movidius Neural Compute Stick SDK on Your Development Device

The first thing you need to do is to install the Intel Movidius Neural Compute Stick SDK on your development device. This is used to convert the trained model into a format that is compatible with the Intel Movidius product.

 $ mkdir -p ~/workspace
 $ cd ~/workspace
 $ git clone https://github.com/movidius/ncsdk.git
 $ cd ~/workspace/ncsdk
 $ make install

Next, plug your Intel Movidius product into your device and issue the following commands:

$ cd ~/workspace/ncsdk
$ make examples

Installing the Intel Movidius Neural Compute Stick SDK on Your Inference Device

Next, you need to install the Intel Movidius Neural Compute Stick SDK on your Raspberry Pi 3/UP Squared device. This is used by the classifier to carry out inference on local images or images received via the API we will create. Make sure you have the Intel Movidius product plugged in.

 $ mkdir -p ~/workspace
 $ cd ~/workspace
 $ git clone https://github.com/movidius/ncsdk.git
 $ cd ~/workspace/ncsdk/api/src
 $ make
 $ sudo make install
 $ cd ~/workspace
 $ git clone https://github.com/movidius/ncappzoo
 $ cd ncappzoo/apps/hello_ncs_py
 $ python3 hello_ncs.py

Preparing Your Training Data

For this tutorial, I used a dataset from Kaggle* (Predict IDC in Breast Cancer Histology Images), but you are free to use any dataset you like. I have uploaded the collection I used for positive and negative images that you will find in the model/train directory. Once you decide on your dataset you need to arrange your data into the model/train directory. Each subdirectory should be named with integers; I used 0 and 1 to represent positive and negative. In my testing I used 4400 positive and 4400 negative examples, giving an overall training accuracy of 0.8596 (See Training Results below) and an average confidence of 0.96 on correct identifications. The data provided is 50px x 50px; as Inception V3 was trained on images of size 299px x 299px, the images are resized to 299px x 299px. Ideally the images would be that size already so you may want to try different datasets and see how your results vary.

Fine-Tuning Your Parameters

You can fine-tune the settings of the network at any time by editing the classifier settings in the model/confs.json file.

"ClassifierSettings":{
    "dataset_dir":"model/train/",
    "log_dir":"model/_logs",
    "log_eval":"model/_logs_eval",
    "classes":"model/classes.txt",
    "labels":"labels.txt",
    "labels_file":"model/train/labels.txt",
    "validation_size":0.3,
    "num_shards":2,
    "random_seed":50,
    "tfrecord_filename":"200label",
    "file_pattern":"200label_%s_*.tfrecord",
    "image_size":299,
    "num_classes":2,
    "num_epochs":60,
    "dev_cloud_epochs":60,
    "test_num_epochs":1,
    "batch_size":10,
    "test_batch_size":36,
    "initial_learning_rate":0.0001,
    "learning_rate_decay_factor":0.96,
    "num_epochs_before_decay":10,
    "NetworkPath":"",
    "InceptionImagePath":"model/test/",
    "InceptionThreshold": 0.54,
    "InceptionGraph":"igraph"
}

Time to Start Training

Now you are ready to upload the files and folders outlined below to the Intel AI DevCloud.

model
tools
DevCloudTrainer.ipynb
DevCloudTrainer.py
Eval.py

Once uploaded, follow the instructions in DevCloudTrainer.ipynb, this notebook will help you sort your data, train your model and evaluate it.

Training Results

Training Accuracy Tensorboard graph

Figure 3. Training Accuracy Tensorboard

Training Total Loss graph

Figure 4. Training Total Loss

Evaluate Your Model

Once you have completed your training on the Intel AI DevCloud, complete the notebook by running the evaluation job.

Evaluation Results

INFO:tensorflow:Global Step 1: Streaming Accuracy: 0.0000 (2.03 sec/step)
INFO:tensorflow:Global Step 2: Streaming Accuracy: 0.8889 (0.59 sec/step)
INFO:tensorflow:Global Step 3: Streaming Accuracy: 0.8750 (0.67 sec/step)
INFO:tensorflow:Global Step 4: Streaming Accuracy: 0.8981 (0.65 sec/step)
INFO:tensorflow:Global Step 5: Streaming Accuracy: 0.8681 (0.76 sec/step)
INFO:tensorflow:Global Step 6: Streaming Accuracy: 0.8722 (0.64 sec/step)
INFO:tensorflow:Global Step 7: Streaming Accuracy: 0.8843 (0.64 sec/step)

-------------------------------------------------------------------------

INFO:tensorflow:Global Step 68: Streaming Accuracy: 0.8922 (0.81 sec/step)
INFO:tensorflow:Global Step 69: Streaming Accuracy: 0.8926 (0.70 sec/step)
INFO:tensorflow:Global Step 70: Streaming Accuracy: 0.8921 (0.63 sec/step)
INFO:tensorflow:Global Step 71: Streaming Accuracy: 0.8929 (0.84 sec/step)
INFO:tensorflow:Global Step 72: Streaming Accuracy: 0.8932 (0.75 sec/step)
INFO:tensorflow:Global Step 73: Streaming Accuracy: 0.8935 (0.61 sec/step)
INFO:tensorflow:Global Step 74: Streaming Accuracy: 0.8942 (0.67 sec/step)
INFO:tensorflow:Final Streaming Accuracy: 0.8941

So here we can see that the evaluation shows a final streaming accuracy of 0.8941.

evaluation accuracy graph

Figure 5. Evaluation Accuracy

evaluation total loss graph

Figure 6. Evaluation Total Loss

Download Your Model

When the training completes you need to download model/DevCloudIDC.pb and model/classes.txt to the model directory on your development machine. Ensure that the Intel Movidius product is set up and connected, and then run the following commands on your development machine:

$ cd ~/IoT-JumpWay-Intel-Examples/master/Intel-Movidius/IDC-Classification
$ ./DevCloudTrainer.sh

The contents of DevCloudTrainer.sh are as follows:

#IDC Classification Trainer
mvNCCompile model/DevCloudIDC.pb -in=input -on=InceptionV3/Predictions/Softmax -o igraph
python3.5 Classifier.py InceptionTest

Compile the model for the Intel Movidius product
Test

Testing on Unknown Images

Once the shell script has finished the testing program will start. In my example I had two classes, 0 and 1 (IDC negative and IDC positive); a classification of 0 shows that the AI thinks the image is not IDC positive, and a classification of 1 is positive.

-- Loaded Test Image model/test/negative.png

-- DETECTION STARTING
-- STARTED: :  2018-04-24 14:14:26.780554

-- DETECTION ENDING
-- ENDED:  2018-04-24 14:14:28.691870
-- TIME: 1.9114031791687012

*******************************************************************************
inception-v3 on NCS
*******************************************************************************
0 0 0.9873
1 1 0.01238
*******************************************************************************

-- Loaded Test Image model/test/positive.png

-- DETECTION STARTING
-- STARTED: :  2018-04-24 14:14:28.699254

-- DETECTION ENDING
-- ENDED:  2018-04-24 14:14:30.577683
-- TIME: 1.878432035446167ß

TASS Identified IDC with a confidence of 0.945

-- Published to Device Sensors Channel

*******************************************************************************
inception-v3 on NCS
*******************************************************************************
1 1 0.945
0 0 0.05542
*******************************************************************************

-- INCEPTION V3 TEST MODE ENDING
-- ENDED:  2018-04-24 14:14:30.579247
-- TESTED:  2
-- IDENTIFIED:  1
-- TIME(secs): 3.984593152999878

So, on the development machine you should see results similar to the ones above. We can see in my results that the program has successfully classified both the negative and the positive. Now it is time to test this out on the edge.

Inference on the Edge

Now that it is all trained and tested, it is time to set up the server that will serve the API. For this I have provided Server.py and Client.py.

The following instructions will help you set up your server and test a positive and negative prediction:

If you used the Predict IDC in Breast Cancer Histology Images dataset, you can use the positive.png and negative.png as they are from that dataset; if not, you should choose a positive and negative example from your testing set and replace these images.
The server is currently set to start up on localhost. If you would like to change this you need to edit line 281 of Server.py and line 38 of Client.py to match your desired host. Once you have things working, if you are going to be leaving this running and access it from the outside world, you should secure it with Let's Encrypt* or similar.
Upload the following files and folders to the UP Squared or Raspberry Pi 3 device that you are going to use for the server.
```
model/test/
model/classes.txt
model/confs.json
tools
igraph
Server.py
```
Open up a terminal and navigate to the folder containing Server.py, then issue the following command. This starts the server and waits to receive images for classification.
```
$ python3.5 Server.py
```
If you have followed all of the above steps, you can now start the client on your development machine with the following commands:

$ python3.5 Client.py

This sends a positive and negative histology slide to the Raspberry Pi 3 or UP Squared device, which will return the predictions.

!! Welcome to IDC Classification Client, please wait while the program initiates !!

-- Running on Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]

-- Imported Required Modules
-- IDC Classification Client Initiated

{'Response': 'OK', 'ResponseMessage': 'IDC Detected!', 'Results': 1}
{'Response': 'OK', 'ResponseMessage': 'IDC Not Detected!', 'Results': 0}
* Running on http://0.0.0.0:7455/ (Press CTRL+C to quit)

-- IDC CLASSIFIER LIVE INFERENCE STARTING
-- STARTED: :  2018-04-24 14:25:36.465183

-- Loading Sample
-- Loaded Sample
-- DETECTION STARTING
-- STARTED: :  2018-04-24 14:25:36.476371

-- DETECTION ENDING
-- ENDED:  2018-04-24 14:25:38.386121
-- TIME: 1.9097554683685303

TASS Identified IDC with a confidence of 0.945

-- Published: 2
-- Published to Device Warnings Channel

-- Published: 3
-- Published to Device Sensors Channel

*******************************************************************************
inception-v3 on NCS
*******************************************************************************
1 1 0.945
0 0 0.05542
*******************************************************************************

-- IDC CLASSIFIER LIVE INFERENCE ENDING
-- ENDED:  2018-04-24 14:25:38.389217
-- TESTED:  1
-- IDENTIFIED:  1
-- TIME(secs): 1.9240257740020752

192.168.1.40 - - [24/Apr/2018 14:25:38] "POST /api/infer HTTP/1.1" 200 -

-- IDC CLASSIFIER LIVE INFERENCE STARTING
-- STARTED: :  2018-04-24 14:25:43.422319

-- Loading Sample
-- Loaded Sample
-- DETECTION STARTING
-- STARTED: :  2018-04-24 14:25:43.432647

-- DETECTION ENDING
-- ENDED:  2018-04-24 14:25:45.310354
-- TIME: 1.877711534500122

-- Published: 4
-- Published to Device Warnings Channel

-- Published: 5
-- Published to Device Sensors Channel

*******************************************************************************
inception-v3 on NCS
*******************************************************************************
0 0 0.9873
1 1 0.01238
*******************************************************************************

-- IDC CLASSIFIER LIVE INFERENCE ENDING
-- ENDED:  2018-04-24 14:25:45.313174
-- TESTED:  1
-- IDENTIFIED:  0
-- TIME(secs): 1.89084792137146

192.168.1.40 - - [24/Apr/2018 14:25:45] "POST /api/infer HTTP/1.1" 200 -

Here we can see that, using the Intel Movidius product on an UP Squared device, there is no difference in classification accuracy to the development machine; which in my case was a Linux* device with NVIDIA* GTX 750ti, and only a slight difference in the time it took the classification process to complete. It is interesting to note here that the results above were actually more accurate than training the model on my GPU.

IoT Connectivity

To set up the IoT device you are welcome to complete the tutorial on the GitHub repo, but I will go through in some detail here on exactly what this part of the project does, and explain how the proof of concept provided could be used in other medical applications.

The device we create is an IoT connected alarm system built on a Raspberry Pi device. Once set up, the results that are captured from the classification of images sent to the server trigger actions on the IoT that communicate with the Raspberry Pi device. In this case, the actions are turning on a red LED and a buzzer when cancer is detected, and turning on a blue LED when the classification results in no cancer being detected. Obviously this is a very simple proof of concept, but it shows a possibility for powerful applications that can save time for medical staff and, hopefully, in the right hands could help save lives through early and accurate detection.

References

↧

How Netrolix AI-WAN* Broke the SD-WAN Barrier

May 2, 2018, 10:35 am

Latest and popular articles on Intel Technologies

≫ Next: Code Sample: Introduction to Java* API for Persistent Memory Programming

≪ Previous: Machine Learning and Mammography

Introduction – Fulfilling WAN Demand

Demand for lower cost wide-area networks (WANs) has increased dramatically in recent years. Industry research shows that in 2016 alone, WAN traffic grew from more than 150 percent in the Americas to nearly 250 percent in APAC. The same data shows that the greatest growth was in new 100-Mbps services, and it reveals a major shift toward Software-Defined Wide-Area Network (SD-WAN) solutions.

It's not difficult to see what is driving this demand:

More organizations are moving their IT from on-premises to cloud and hybrid infrastructures.
Even the largest business software providers are turning to SaaS delivery models for their flagship products.
There are more data-intensive applications that routinely use big data analytics, video, and connected Internet of Things (IoT) devices.
Changing WAN topologies are placing more computing power at the network edge, to the point where many connected devices are becoming mini data centers.

Although the cost of dedicated Multiprotocol Label Switching (MPLS) lines has dropped in recent years, it remains prohibitively high for meeting all the demand. This has led to a boom in SD-WAN services that offer a much lower cost and more agile approach to WAN connectivity. But SD-WAN is not an ideal solution. It suffers from performance, reliability, and security issues that make it less suitable for critical business operations.

Working to address these limitations, Netrolix* has created an entirely new kind of WAN. Their AI-WAN* delivers MPLS reliability and security with SD-WAN agility and cost advantages, all while guaranteeing exceptional end-to-end throughput. Adopting an AI-WAN solution is like trading in your economical Chevy for a self-driving, bullet-proof Ferrari, at no extra cost.

How does Netrolix's AI-WAN accomplish this? To understand that, let's look at why typical SD-WANs fall short.

The Promise and Limitations of SD-WAN

MPLS circuits have distinct advantages. Like private roads running directly between branch offices, they are fast, secure, and reliable. To get from one office to another, you just hop in your Ferrari and go as fast as the road will take you. But it takes a long time and a lot of money to build that private road, and it's hard to change once you've built it.

As WAN usage has grown, the idea of replacing high-cost, dedicated MPLS connections with low-cost SD-WANs has appeal that goes beyond just the cost savings. Compared to MPLS services, SD-WANs are easy to set up and configure, which simplifies the task of adding WAN segments to an existing network infrastructure. SD-WANs also centrally manage how applications use the network, enabling some optimization at the network edge for given network conditions.

SD-WANs are great because the roads are already built. They not only run between your branch offices, they go everywhere. You can open a new office anywhere and quickly set up an SD-WAN connection. To get from one place to another, you just hop in your Ferrari and, … well… , maybe you sit in traffic. Or maybe you get robbed while you're sitting in traffic. And that's the downside of SD-WANs. Compared to MPLS, they introduce new security issues, and they have performance limitations.

Because SD-WANs sit at the network edge, they have no control over traffic flow in the cloud. They rely on Internet service provider (ISPs) whose business models are based on over-subscribing capacity and best-effort delivery services. This means there can be, and often is, network congestion somewhere along the data path. Furthermore, ISPs often share network infrastructure. Poor connections across this infrastructure often result in jitter, packet loss, and latency issues. For all of these reasons, SD-WANs are unable to guarantee a quality of service.

From a security perspective, SD-WANs offer built-in security features such as native support for encryption and easy application-specific WAN segmentation. However, their use of encryption is often limited by computing power, and many SD-WAN appliances are not adequately hardened against unauthorized access.

Because of these limitations, many businesses see an SD-WAN as a low-cost supplement to their existing MPLS connections rather than a replacement. So, does this mean organizations are trapped into living with their costly MPLS connections?

Not according to Netrolix.

Netrolix's Unique WAN Solution: AI-WAN*

To understand Netrolix's AI-WAN, it's best to start with the story of how they created it.

It began in 2014 with the idea of solving the challenge of large-scale, centralized firewall, Internet-based networking. Netrolix envisioned building a high-performance Internet WAN that would be compatible with any existing connection protocol or appliance, and it would work by optimizing traffic across the Internet. They wanted to create a solution that service providers and businesses could use to build their own networks. "In fact, that was our original goal. We wanted to empower the end user to architect and build their own network. When we began, we wanted to give complete access and control to that end user," says Wes Jensen, CEO of Netrolix.

Netrolix began by building a network between host data centers and monitoring a multitude of performance metrics, which they put into their own proprietary algorithm for optimizing flow between the data centers. By using IP transit connections, they could also look at to and from downstream service providers.

"We initially deployed on six data centers in Seattle, Los Angeles, Chicago, Dallas, New York, and Atlanta. That allowed us to leverage just about every ISP in the U.S.," says Jensen. "By early 2015, our network grew to about 18 data centers, 9 of which were in the U.S. At that point, we realized our own algorithms weren't going to cut it. We were looking at things statically, on a per data center basis."

That's when they had the idea of applying machine learning and artificial intelligence (AI) to the mass of Internet performance data they were collecting. Suddenly, they were able to analyze and correlate the Internet traffic in all the data centers simultaneously, in real time, and that was a total game changer.

Netrolix used their AI capability to build a model that would look at millions of data points from every ISP on multiple performance factors – latency, jitter, packet loss, throughput, and availability, for real-time and historic events, and how these changed at specific times of day. They developed a suite of low-cost endpoint devices to connect to their AI network, and they extended their analysis to theses endpoints. Continuously monitoring and analyzing all the data paths across this AI fabric became the foundation for using proprietary algorithms to optimize Internet traffic.

Today, Netrolix hardware and software is in 65 data centers globally, leveraging 20,000 nodes just to collect data on the global Internet. "We're collecting data on all the ISPs on the planet to determine optimal paths not only to any endpoint, but also across our core. That is the AI fabric itself. That is the foundation over the Internet that we have created. We have eliminated the whole ‘best-effort' mantra and solved for Internet performance issues, and we're seeing performance that is on a par or better than traditional private networks from your global service providers," Jensen says. And that is what the patent-protected AI-WAN from Netrolix is all about.

So, say you have a Netrolix AI-WAN connection and you want to go from one branch office to another. You hop in your Ferrari, sit back, and let it take you there. With its eyes on the entire global Internet, the AI-WAN has already determined your best route. You take off at top speed. All the lights turn green just as you hit the intersections. There's no congestion. And bam! You're there, every time, at an SD-WAN cost and with much higher security.

How does Netrolix do this? Let's see what's inside the AI-WAN.

Inside Netrolix's AI-WAN

The Netrolix AI-WAN consists of the AI-WAN fabric, which is a vast network of ISPs and host data centers around the globe whose traffic is continuously analyzed and monitored by a proprietary deep-learning analytical engine. To connect to this AI-WAN fabric, Netrolix has developed a suite of low-cost endpoint devices, which are software-defined gateways (SDGs) that run on either their own bare-metal based Intel® architecture platforms or appropriate client-owned equipment.

The AI engine monitors the global Internet while monitoring and communicating with every endpoint device connected to the AI-WAN fabric. All of Netrolix's services, including MPLS, Virtual Private LAN Service (VPLS), Ethernet private line, SD-WAN, global Virtual Private Network (VPN), cloud services, and other offerings are layered over the AI-WAN fabric.

Netrolix SDGs

Netrolix offers a suite of SDGs that are built on Intel® chipsets. They differ from one another based on their rated throughputs and the network functions they perform. They can provide simple connections between existing network appliances and the AI-WAN fabric. They can also act as routers, switches, firewalls, and other edge compute devices, and they can be configured to deliver MPLS, VPLS, and Virtual Private Enterprise (VPE) connections.

All Netrolix SDGs share similar physical characteristics in that they use low-power Intel® components and they don't have any moving parts, such as fans, which enables them to operate in complete silence. Figure 1 and the accompanying descriptions show different ways standard Netrolix SDGs connect to the Netrolix AI-WAN fabric for optimum network performance.

Different ways Netrolix S D G connects to A I W A N
Figure 1. Netrolix software-defined gateways (SDGs) connect to the Netrolix AI-WAN* fabric in many ways.

1. The Netrolix SDG is a simple network interface device (NID) that basically terminates a circuit. If a Netrolix AI-WAN user wants to keep their existing Fortinet*, Juniper*, Cisco*, or whatever network devices they have in place today, they can do so.

2. The Netrolix SDG can be more than just an NID. It can also be the network access point plus a router, a switch, and a firewall. It can provide all those functions in one solution.

3. The Netrolix SDG can also be a software-defined multi-access and mobile edge compute device (SD-MEC). This combines network access, router, switch, firewall, and edge compute capabilities into one solution.

4. In this scenario, the Netrolix SDG provides cloud access, allowing direct connections to cloud infrastructure. Rather than paying a lot of money for an Microsoft Azure* Express Route or a Amazon AWS* Direct Connect product, a user can spin up a virtual machine (VM) immediately, deploy appropriate Netrolix virtual router software, and immediately connect to a private global network. Jensen explains, "You don't have to go through the hard-to-understand firewall guys, or fight with the VM guy, or have the VM guy pointing to the firewall guy. And by the way, that costs about $5,000 more than our solution. The fact is, we just want to simplify that."

5. Here, the Netrolix SDG is being used as a hub to move aggregated IoT device data and autonomous application data over the AI-WAN. This is an important capability because many IoT applications and remote devices that involve data and control functions, such as drones and industrial control systems, are being built with little knowledge or regard for security. Being able to aggregate sensor and control data and then send it to a control center with absolute security and reliability becomes incredibly important. That is now possible over the AI-WAN at speeds that enable real-time control and the highest levels of data protection possible.

6. This is the user portal that enables users to see and control everything in their AI-WAN. Netrolix's goal is to eliminate centrally configured stacks that require vendor and equipment manufacturer intervention to set up. Netrolix empowers users to do it themselves. "We want you to do everything from a portal," Jensen says. "A device shows up and you plug it in. You have multiple configuration templates you're pushing to different types of network elements and pieces. It just becomes so simple."

Easy setup and automatic, continuous data path optimization.

When a Netrolix SDG is connected to the user's Internet service, the Netrolix AI-WAN detects and identifies that Netrolix device and immediately determines the six most optimal data centers for connection. Then, from those six data centers, the AI-WAN further selects the three most optimal data centers. It automatically connects the newly installed Netrolix SDG to those three data centers using three separate Netrolix gateways that are part of the Netrolix AI-WAN fabric.

Once connected, the new device shows up on the user's AI-WAN portal. The user then configures the SDG with the functionality required by their application.

During normal operation the AI-WAN fabric ranks a device's three active data center connections from most to least optimal and moves data over the most optimal path. If that path becomes impaired, the AI-WAN continues to seamlessly operate over the other two connections. Whether you have one or four Internet connections, the AI-WAN platform views it as one single port into the network. As long as the Netrolix device is connected, the AI-WAN fabric goes through a complete path re-optimization process every five minutes. With three different continuously optimized data center connections, that network link becomes more reliable than a dedicated private line.

Guaranteed throughput at service provider connection speeds.

Netrolix has architected their AI-WAN to guarantee end-to-end throughput at full-duplex Internet service provider connection speeds. This is different than the way SD-WANs operate.

For example, an SD-WAN provider might provide you with a box that is licensed for 200-Mbps throughput. You will pay for the box, the connection speed, and the gateway they set up for you. But in reality, you are not guaranteed 200-Mbps end-to-end throughput because once you reach the network edge (for example, the cloud), the SD-WAN has no control over traffic.

The Netrolix AI-WAN uses its placement in host data center locations and Internet traffic monitoring to optimize data paths based on two basic principles:

The "last mile" of Internet connectivity is where congestion is likely to occur due to ISPs over-subscribing their service. Conversely, traffic between host data centers happens in extremely high-bandwidth connections.
Based on AI traffic analysis, most disruption in Internet connections happens where the connections jump between major data paths or between service providers.

To most effectively optimize data paths and minimize latency, Netrolix has strategically located its AI-WAN system in key data centers based on actual traffic flow rather than geography. For example, Netrolix placed AI-WAN components in multiple data centers around Chicago rather than putting nodes in surrounding cities that all route through Chicago.

By carefully selecting host data centers, analyzing downstream ISP traffic, and having full knowledge of traffic between data centers, Netrolix is able to guarantee end-to-end throughput at wire speeds. If you have a 200-Mbps full-duplex Internet service, the Netrolix AI-WAN will deliver 200 Mbps of secure end-to-end throughput.

Making the Netrolix AI-WAN as secure as an MPLS connection.

For many users who are considering WAN options, throughput is their primary consideration. But security is just as important, especially in today's environment of non-stop intrusion.

Netrolix has built security into the Netrolix AI-WAN fabric in the following ways:

Data encryption– All data passing through the Netrolix AI-WAN is encrypted using IKEv2, which is the most powerful encryption standard currently in use.
Key management – The Netrolix AI-WAN uses a robust Key Management System (KMS) to generate encryption keys for every device, every element of the AI-WAN network, every storage instance, and every network configuration. Unlike typical SD-WAN solutions, these are not shared keys. Every network element has its own key, and every key in the global AI-WAN is automatically swapped every five minutes.
Hardware Security Module (HSM) authentication– This is the same hardware-based authentication used in credit and debit card chips. It prevents reconfiguration of any Netrolix SDG unless the device is connected over the AI-WAN to a Netrolix management console, which prohibits unauthorized access.
RADIUS attributes– These are used to authenticate any devices connecting to the AI-WAN.
The AI analytics engine – The same AI engine that monitors and optimizes Internet traffic is also continuously monitoring every device connected to the AI-WAN for any anomalous data patterns. It not only monitors the AI-WAN fabric itself, but also data coming from or going to IoT devices, for example.

Intel Inside® – Why Netrolix Chose Intel® Technology for Its Bare-Metal Platform

Netrolix had several choices when designing their AI-WAN hardware. One was whether to build their solution around another company's hardware platform versus building from the ground up using a bare-metal solution. They chose the bare-metal approach because standard commercial equipment was too heavy, too expensive, and did not offer the flexibility or computing power they needed to deliver the many different network functions they had in mind.

Next came the choices for platform hardware, and that also was a quick decision. They chose Intel chipsets because of their broad compatibility in the networking world, the consistency in the Linux* kernel across different chip sets, and their ability to run x86 software.

There were other factors, too. From an engineering perspective, Intel offered chipsets with low power consumption, which helped enable Netrolix to build rigid boxes with no moving parts. Selecting the interface was another big factor. When choosing a standard router, users are bound by whatever interfaces that company manufactures. Netrolix was already doing a lot of work with the IoT and unmanned aerial vehicles (UAVs), which require nonstandard interfaces. Having this flexibility was a key consideration, and the availability of Intel's open software library became critical.

Ultimately, the flexibility of Intel chipsets in supporting Netrolix's architectural needs and the supporting software was the deciding factor. "We are an open stack shop," says Jensen. "All of our services are software-driven on an open platform, and Intel just became a very easy-to-use and reliable chipset."

The following table lists the different Netrolix hardware platforms, the Intel chipsets used to power them, along with performance and use cases.

Table 1. Netrolix hardware platforms, Intel® chipsets, and use cases.

Netrolix Platform	Intel® Chipset	Performance and Use Cases
Mobile Rigid (OBD) MR1	Broadwell	A rigid network interface device used for mobile connectivity.
SDG100	Gemini Lake	Standard software-defined gateway providing up to 100-Mbps secured throughput.
SDG400	Apollo Lake	Up to 400 Mbps of secured throughput, often used for network headends.
SDG PE/Core	Kaby Lake	Used in data centers as a network edge device but also as a network core device. It also acts as a VPN gateway. Up to 2-Gbps secured throughput.
SDG1000 SMRP Gateway	Coffee Lake	VNF cluster used as a customer headend. Often used by large organizations that need to run multiple separate networks in every location to isolate different types of data falling under different regulatory or security regimes.
SDG2000	Intel® Xeon® Scalable processor	Used in large network edge or data center deployments where it’s necessary to interface to multiple 10-, 40-, or 100-Gbps connections.
UAV	Cherry Trail	Used in very lightweight, low power consumption applications such as drones. Provides full SD-MEC capabilities such as multiple connection failover between multiple providers and edge compute, but on a drone.

In addition to these Intel chipsets, Netrolix used the following features to support virtualization, secure hardware sharing, and hardware-based encryption:

Intel® Virtualization Technology (Intel® VT) for IA-32, Intel® 64 and Intel® Architecture (Intel® VT-x)
Intel® Virtualization Technology (Intel® VT) for Directed I/O (Intel® VT-d)
Intel® AES New Instructions (Intel® AES-NI)

Netrolix AI-WAN Delivers Higher Throughput More Securely at a Lower Cost

"That was our goal, and that's what we achieved," Jensen says, who points out that a 10-Mbps private line between New York and London would cost about $1,500 per month. A similar Netrolix AI-WAN connection delivering guaranteed secure throughput of 10 Mbps would cost about $300 per month. The AI-WAN connection will also be something users themselves can quickly set up and configure, without relying on vendors or service providers.

And whereas the private line will only have one fixed data path, the AI-WAN connection can take many paths. At any given time, it will always have three optimum data paths, and those will be re-tested every five minutes. This means the AI-WAN connection will be much more reliable over the long term compared to a dedicated private line.

The AI-WAN will also be more reliable, more secure, and deliver higher throughput than an SD-WAN, which has little or no visibility into data paths beyond its own network edge. Some SD-WAN vendors attempt optimized data routing based on a limited view of performance metrics. "Most devices that do look at performance metrics are statically configured by architects," Jensen notes. "With our neural network, we don't have to statically architect that at all. The fabric itself is what self-heals and maintains and optimizes constantly. Once we realized that the AI-WAN architecture was more powerful than anything we could humanly architect, that's when the light went on for us."

To learn more about the Netrolix AI-WAN and Netrolix's many networking services built on the AI-WAN fabric, visit the Netrolix website.

Also, visit the Intel® Network Builders website and explore a vibrant resource center that provides support to anyone interested in software defined-networking and network function virtualization.

↧

Code Sample: Introduction to Java* API for Persistent Memory Programming

May 4, 2018, 10:16 am

Latest and popular articles on Intel Technologies

≫ Next: Clone of Installing the Intel® Computer Vision SDK 2018 on Linux, Without FPGA

≪ Previous: How Netrolix AI-WAN* Broke the SD-WAN Barrier

File(s):	Download
License:	3-Clause BSD License

Optimized for...
OS:	Linux* kernel version 4.3 or higher
Hardware:	Emulated: See How to Emulate Persistent Memory Using Dynamic Random-access Memory (DRAM)
Software: (Programming Language, tool, IDE, Framework)	C++ Compiler, JDK, Persistent Memory Developers Kit (PMDK) libraries and Persistent Collections for Java (PCJ)
Prerequisites:	Familiarity with C++ and Java

Introduction

In this article, I present an introduction to the Persistent Collections for Java (PCJ) API for persistent memory programming. This API emphasizes persistent collections, since collection classes map well to the use cases observed for many persistent memory applications. I show how it is possible to instantiate and store a persistent collection (without serialization), as well as fetch it later after a power cycle. A full example, comprised of a persistent array of employees (being an Employee a custom persistent class implemented from scratch) is described in detail (including source code). I finalize the article by showing how Java programs using PCJ can be compiled and run.

Why Do We Need APIs?

At the core of the NVM Programming Model (NPM) standard, developed by key players in the industry through the Storage and Networking Industry Association (SNIA), we have memory-mapped files. This model was chosen primarily to avoid reinventing the wheel, given that the majority of problems they were trying to solve (such as how to collect, name, and find memory regions, or how to provide access control, permissions, and so on) were already solved by file systems (FS). In addition, memory-mapped files have been around for decades. Thus, they are stable, well understood, and widely supported. Using a special FS, processes running in user space can (after opening and mapping a file) access this mapped memory directly without involving the FS itself, in turn avoiding expensive block caching/flushing and context switching to/from the operating system (OS).

Programming directly against memory-mapped files, however, is not trivial. Even if we avoid block caching on dynamic random-access memory (DRAM), some of the most recent data writes may still reside—unflushed—in the CPU caches. Unfortunately, these caches are not protected against a sudden loss of power. If that were to happen while part of our writes are still sitting unflushed in the caches, we may end up with corrupted data structures. To avoid that, programmers need to design their data structures in such a way that temporary torn-writes are allowed, and make sure that the proper flushing instructions are issued at exactly the right time (too much flushing is not good either because it impacts performance).

Fortunately, Intel has developed the Persistent Memory Developer Kit (PMDK), an open source collection of libraries and tools that provide low-level primitives as well as useful high-level abstractions to help persistent memory programmers overcome these obstacles. Although these libraries are implemented in C, there has been a significant effort to provide APIs for other popular languages, including C++, Java* (which is the topic of this article) and Python*. Although the APIs for Java and Python are still early (experimental) solutions, work is in progress and evolving quickly.

Persistent Collections for Java* (PCJ)

The API provided for persistent memory programming in Java emphasizes persistent collections. The reason is that collection classes map well to the use cases observed for many persistent memory applications. Instances of these classes persist (that is, remain reachable) beyond the life of the Java virtual machine (JVM) instance. In addition to built-in classes, programmers can define their own persistent classes (as we will see next). There is even the possibility to create our own abstractions through a low-level accessor API (in the form of a MemoryRegion interface), but that is out of the scope of this article.

The following are the persistent collections supported in the API:

Persistent primitive arrays: PersistentBooleanArray, PersistentByteArray, PersistentCharArray, PersistentDoubleArray, PersistentFloatArray, PersistentIntArray, PersistentLongArray, PersistentShortArray
Persistent array: PersistentArray<AnyPersistent>
Persistent tuple: PersistentTuple<AnyPersistent, …>
Persistent array list: PersistentArrayList<AnyPersistent>
Persistent hash map: PersistentHashMap<AnyPersistent, AnyPersistent>
Persistent linked list: PersistentLinkedList<AnyPersistent>
Persistent linked queue: PersistentLinkedQueue<AnyPersistent>
Persistent skip list map: PersistentSkipListMap<AnyPersistent, AnyPersistent>
Persistent FP tree map: PersistentFPTreeMap<AnyPersistent, AnyPersistent>
Persistent SI hash map: PersistentSIHashMap<AnyPersistent, AnyPersistent>

Similar to the C/C++ libpmemobj library in PMDK, we need a common root object to anchor all the other objects created in the persistent memory pool. In the case of PCJ, this is accomplished through a singleton class called ObjectDirectory. Internally, this class is implemented using a hash map object of type PersistentHashMap<PersistentString, AnyPersistent>, which means that we can use human-readable names to store and fetch our objects, as shown in the following code snippet:

...
PersistentIntArray data = new PersistentIntArray(1024);
ObjectDirectory.put("My_fancy_persistent_array", data);   // no serialization
data.set(0, 123);
...

This code first allocates a persistent array of integers of size 1024. After that, it inserts a reference to it into ObjectDirectory named "My_fancy_persistent_array". Finally, the code writes an integer to the first position of the array. Realize that, in this case, if we were to not insert a reference into the object directory and then lose the last reference to the object (for example, by doing data = null), the Java garbage collector (GC) would collect the object and free its memory region from the persistent pool (that is, the object would be lost forever). This does not happen in the C/C++ libpmemobj library; in a similar situation, a persistent memory leak would occur (although leaked objects can be recovered).

The following code snippet shows how we can fetch our object after a power cycle:

...
PersistentIntArray old_data = ObjectDirectory.get("My_fancy_persistent_array", 
                                                        PersistentIntArray.class);
assert(old_data.get(0) == 123);
...

As you can see, there is no need to instantiate a new array. The variable old_data is directly assigned to the object named "My_fancy_persistent_array" stored in persistent memory. An assert() is used here to make sure that this is, in fact, the same array.

A Full Example

Let's take a look now at a full example to see how all the pieces fall into place (you can download the source from GitHub*).

import lib.util.persistent.*;

@SuppressWarnings("unchecked")
public class EmployeeList {
        static PersistentArray<Employee> employees;
        public static void main(String[] args) {
                // fetching back main employee list (or creating it if it is not there)
                if (ObjectDirectory.get("employees", PersistentArray.class) == null) {
                        employees = new PersistentArray<Employee>(64);
                        ObjectDirectory.put("employees", employees);
                        System.out.println("Storing objects");
                        // creating objects
                        for (int i = 0; i < 64; i++) {
                                Employee employee = new Employee(i,
                                                   new PersistentString("Fake Name"),
                                                   new PersistentString("Fake Department"));
                                employees.set(i, employee);
                        }
                } else {                        
                        // reading objects
                        for (int i = 0; i < 64; i++) {
                                assert(employees.get(i).getId() == i);
                        }
                }
        }
}

The above code listing corresponds to the class EmployeeList (defined in the EmployeeList.java file), which contains the main() method for the program. This method tries to get a reference for the persistent array "employees". If the reference does not exist (that is, return value is null), then a new PersistentArray object of size 64 is created and a reference stored in the ObjectDirectory. Once that is done, the array is filled with 64 employee objects. If the array exists, we iterate it making sure the values of the employee IDs are the ones we inserted before.

Some details regarding this code are worth mentioning here. First, there is the need to import the needed classes contained in the package under lib.util.persistent.*. Included in those are not only the persistent collections themselves, but also basic classes for persistent memory such as PersistentString. If you have some experience with the C/C++ interface, you may be wondering where we are passing to the library the location of the pool file and its size. In the case of PCJ, this is done with a configuration file called config.properties (which needs to reside on the current working directory). The following example sets the pool's path to /mnt/mem/persistent_heap and its size to 2GB (this assumes that a persistent memory device—real or emulated using RAM—is mounted at /mnt/mem):

$ cat config.properties
path=/mnt/mem/persistent_heap
size=2147483648
$

As I mentioned above, we can define our own persistent classes for the cases where simple types (such as integers, strings, and so on), are not enough and more complex types are needed. This is exactly what we have done with the class Employee in this example. The class is shown in the following listing (you can find it in the file Employee.java):

import lib.util.persistent.*;
import lib.util.persistent.types.*;

public final class Employee extends PersistentObject {
        private static final LongField ID = new LongField();
        private static final StringField NAME = new StringField();
        private static final StringField DEPARTMENT = new StringField();
        private static final ObjectType<Employee> TYPE = 
                             ObjectType.withFields(Employee.class, ID, NAME, DEPARTMENT);

        public Employee (long id, PersistentString name, PersistentString department) {
                super(TYPE);
                setLongField(ID, id);
                setName(name);
                setDepartment(department);
        }
        private Employee (ObjectPointer<Employee> p) {
                super(p);
        }
        public long getId() {
                return getLongField(ID);
        }
        public PersistentString getName() {
                return getObjectField(NAME);
        }
        public PersistentString getDepartment() {
                return getObjectField(DEPARTMENT);
        }
        public void setName(PersistentString name) {
                setObjectField(NAME, name);
        }
        public void setDepartment(PersistentString department) {
                setObjectField(DEPARTMENT, department);
        }
        public int hashCode() {
                return Long.hashCode(getId());
        }
        public boolean equals(Object obj) {
                if (!(obj instanceof Employee)) return false;

                Employee emp = (Employee)obj;
                return emp.getId() == getId() && emp.getName().equals(getName());
        }
        public String toString() {
                return String.format("Employee(%d, %s)", getId(), getName());
        }
}

The first thing you may notice by looking at this code is that it looks very similar to any regular class definition in Java. First, we have the class fields, defined as private (as well as static final, but more on that below). There are also two constructors. The first one, which builds a new persistent object from the parameters id, name and department, needs to pass its type definition (as an instance of ObjectType<Employee>) to its parent class PersistentObject (all custom persistent classes need to have this class as an ancestor in their inheritance path). The second constructor builds a new persistent object by copying itself from another employee object (p) passed as parameter. In this case, it is possible to just pass the whole object p to the parent class. Finally, we have the getters and setters and all the other public methods.

You may have also noticed the strange way in which fields are declared. Why are we not declaring ID as a regular long? Or NAME as a string? Also, why are fields declared as static final? The reason is that these are not fields in the traditional way, but meta fields. They only serve as a guidance to PersistentObject, which is going to access the real fields as offsets in persistent memory. They are declared as static final so we only have one copy of the meta fields for all the objects of the same class.

This need for meta fields is just an artifact of persistent objects not being supported natively in Java. PCJ uses meta fields to lay out persistent objects on the persistent heap and relies on the PMDK libraries for memory allocation and transaction support. Native code from PMDK libraries is called using the Java Native Interface (JNI). For a high-level overview of the implementation stack, see Figure 1.

Figure 1. High-level overview of the Persistent Collections for Java* (PCJ) implementation stack.

I do not want to finish this section without talking about transactions. With PCJ, any modifications done to persistent fields through the provided accessor methods (such as setLongField() or setObjectField()) are automatically transactional. This means that if a power failure were to happen in the middle of a field write, changes would be rolled back (so data corruption, such as a torn write on a long string, would not happen). If atomicity for more than one field at a time is required, however, explicit transactions are needed. A detailed explanation of these transactions is out of the scope of this article. If you are interested in how transactions work in C/C++, you can read the following introduction to pmemobj transactions in C as well as the introduction to pmemobj transactions in C++.

The following snippet shows how PCJ transactions look:

...
Transaction.run(()->{
        // transactional code
});
...

How to Run

If you download the sample from GitHub, a Makefile is provided which will download the latest versions of both PCJ as well as PMDK from their respective repositories. All you need is a C++ compiler (and, of course, Java) installed on your system. Nevertheless, I will show you here what steps you should follow in order to compile and run your persistent memory Java programs. For these, you need to have PMDK and PCJ installed on your system.

To compile the Java classes, you need to specify the PCJ class path. Assuming you have PCJ installed on your home directory, do the following:

$ javac -cp .:/home/<username>/pcj/target/classes Employee.java
$ javac -cp .:/home/<username>/pcj/target/classes EmployeeList.java
$

After that, you should see the generated *.class files. In order to run the main() method inside EmployeeList.class, you need to (again) pass the PCJ class path. You also need to set the java.library.path environment variable to the location of the compiled native library used as a bridge between PCJ and PMDK:

$ java -cp .:/…/pcj/target/classes -Djava.library.path=/…/pcj/target/cppbuild EmployeeList

Summary

In this article, I presented an introduction to the Java API for persistent memory programming. This API emphasizes persistent collections, since collection classes map well to the use cases observed for many persistent memory applications. I showed how it is possible to instantiate and store a persistent collection (without serialization), as well as fetch it later after a power cycle. A full example, comprised of a persistent array of employees (being an Employee a custom persistent class implemented from scratch) is described in detail. I finalize the article by showing how Java programs using PCJ can be compiled and run.

About the Author

Eduardo Berrocal joined Intel as a Cloud Software Engineer in July 2017 after receiving his PhD in Computer Science from the Illinois Institute of Technology (IIT) in Chicago, Illinois. His doctoral research interests were focused on (but not limited to) data analytics and fault tolerance for high-performance computing. In the past he worked as a summer intern at Bell Labs (Nokia), as a research aide at Argonne National Laboratory, as a scientific programmer and web developer at the University of Chicago, and as an intern in the CESVIMA laboratory in Spain.

Resources

↧

Clone of Installing the Intel® Computer Vision SDK 2018 on Linux, Without FPGA

May 8, 2018, 9:54 am

Latest and popular articles on Intel Technologies

≫ Next: Challenges and Tradeoffs on the Road to AR

≪ Previous: Code Sample: Introduction to Java* API for Persistent Memory Programming

Installing the Intel® Computer Vision SDK 2018 on Linux, without FPGA

NOTE: These steps apply to Ubuntu*, CentOS*, and Yocto*.

The Intel® Computer Vision SDK (Intel® CV SDK) is a comprehensive toolkit for quickly deploying applications and solutions that emulate human vision. Based on Convolutional Neural Networks (CNN), the SDK extends CV workloads across Intel® hardware, maximizing performance. The Intel® CV SDK includes the Deep Learning Deployment Toolkit.

The version of the Intel® CV SDK that you downloaded:

Enables CNN-based deep learning inference on the edge
Supports heterogeneous execution across Intel CV accelerators: CPU, Intel® Integrated Graphics, Intel® Movidius™ Neural Compute Stick
Speeds time-to-market via an easy-to-use library of CV functions and pre-optimized kernels
Includes optimized calls for CV standards including OpenCV*, OpenCL™, and OpenVX*

The installation package is free and comes as an archive that contains the software and installation scripts.

These instructions describe:

What is included in the free download
System requirements
Software dependencies
Installing the Intel® CV SDK on Linux* OS
Next steps

What's Included

Component	Description
Deep Learning Model Optimizer	Model import tool. Imports trained models and converts to IR format for use by Deep Learning Inference Engine. This is part of the Intel® Deep Learning Deployment Toolkit.
Deep Learning Inference Engine	Unified API to integrate the inference with application logic. This is part of the Intel® Deep Learning Deployment Toolkit.
Drivers and runtimes for OpenCL™ version 2.1	Enables OpenCL 1.2 on the GPU/CPU for Intel® processors
Intel® Media SDK	Offers access to hardware accelerated video codecs and frame processing
OpenCV version 3.4.1	OpenCV* community version compiled for Intel® hardware. Includes PVL libraries for computer vision
OpenVX* version 1.1	Intel's implementation of OpenVX* 1.1 optimized for running on Intel® hardware (CPU, GPU, IPU).
Documents and tutorials	https://software.intel.com/en-us/computer-vision-sdk/documentation/featured

Where to Download This Release

https://software.intel.com/en-us/computer-vision-sdk/choose-download/free-download-linux

System Requirements

This guide covers the Linux* version of the Intel® Computer Vision SDK that does not includes FPGA support. For the Intel Computer Vision SDK with FPGA support, see https://software.intel.com/en-us/articles/CVSDK-Install-FPGA.

Development and Target Platform

The development and target platforms have the same requirements, but you can select different components during the installation, based on your intended use.

Processors

6th-8th Generation Intel® Core™
Intel® Xeon® v5 family, Xeon® v6 family
Intel® Pentium® processor N4200/5, N3350/5, N3450/5 with Intel® HD Graphics
Intel® Movidius™ Neural Compute Stick

Processor Notes:

Processor graphics are not included in all processors. See https://ark.intel.com/ for information about your processor.
A chipset that supports processor graphics is required for Intel® Xeon® processors.

Operating Systems:

Ubuntu* 16.04.3 long-term support (LTS), 64-bit
CentOS* 7.4, 64 bit
Yocto Project* Poky Jethro* v2.0.3, 64-bit (intended for target only)

Pre-Installation

Use these steps to prepare your development machine for the Intel® CV SDK software.

Download the Intel® CV SDK. By default, the file is saved as l_intel_cv_sdk_p_2018.0.<version>.tgz

Unpack the .tgz file:

tar -xf l_intel_cv_sdk_p_2018.0.<version>.tgz

The files are unpacked to a directory named l_intel_cv_sdk_p_2018.0.<version>
Go to the l_intel_cv_sdk_p_2018.0.<version> directory:
```
cd l_intel_cv_sdk_p_2018.0.<version>
```

External Software Dependencies

These dependencies are the packages required for Intel-optimized OpenCV 3.4, the Deep Learning Inference Engine, and the Deep Learning Model Optimizer tools. Before installing the Intel® CV SDK, install these dependencies running the script from the installation package directory l_intel_cv_sdk_p_2018.0.<version>:

./install_cv_sdk_dependencies.sh

Installation Steps

Go to the intel_cv_sdk_2018.0.<version> directory and start the GUI-based installation wizard:
```
./install_GUI.sh
```
Or the CLI installer:
```
./install.sh
```
You see GUI installation wizard or command-line installation instructions. The steps below are the same or similar. The only difference is that the command-line installer is text-based.
The Prerequisites screen tells you if you are missing any required or recommended components, and the effect the missing component has on installing or using the product.
Click Next to begin the installation, or make final changes to your component selections and choose your installation directory.
The Installation summary screen shows you the options that will be installed if you make no changes.
Click Install if you are ready to start the installation, or if you want to change the selected components and/or specify the installation directory, click Customize. To proceed with the standard installation, click Install.
A Complete screen indicates the software is installed. Click Finish to close the wizard and open the Getting Started page in the browser.
Go to the install directory. For example, for the default installation in sudo mode:
```
cd /opt/intel/computer_vision_sdk_2018.0.<version>/
```

Set Environment Variables

Updates to several environment variables are required to compile and run Intel® CV SDK applications. You can permanently set environment variables in a way matching your system's conventions. A method to set these variables temporarily (lasting only as long as the shell) is provided. For a standard Intel CV SDK installation:

source /opt/intel/computer_vision_sdk_2018.0.<version>/bin/setupvars.sh

Post-Installation

Set External Software Dependencies (Processor Graphics)

Installation automatically creates the install_dependencies directory under /opt/intel/computer_vision_sdk with additional scripts to enable components to utilize processor graphics on your system:

The install_4_14_kernel.sh script contains steps to move your kernel forward to 4.14. This is the minimum kernel supported, and the configuration that is validated, but you can choose newer kernels as well. Please note that these are mainline kernels, not the kernels officially validated with your OS. To check if you need to run this script, check your kernel version with uname –r.
The install_NEO_OCL_driver.sh script installs the OpenCL™ NEO driver components needed to use the clDNN GPU plugin and write custom layers for GPU. For full functionality from this driver you must be running a 4.14 or newer kernel.
The ./MediaStack/install_media.sh script installs the Intel® Media SDK. This SDK offers access to hardware accelerated video codecs and frame processing. This version of Intel Media SDK also requires a 4.14 or newer kernel. For more information, see https://github.com/Intel-Media-SDK/MediaSDK/releases.

NOTE: After running the scripts, a reboot is required.

Run these scripts and then reboot:

cd /opt/intel/computer_vision_sdk/install_dependencies/
sudo –E su
./install_4_14_kernel.sh
./install_OCL_driver.sh
MediaStack/install_media.sh
reboot

The OpenCL components here can be made more useful by installing header files to allow compiling new code. These can be obtained from from https://github.com/KhronosGroup/OpenCL-Headers.git

To make libOpenCL easier to find you may also want to consider adding some symbolic links:

ln -s /usr/lib/x86_64-linux-gnu/libOpenCL.so.1 /usr/lib/x86_64-linux-gnu/libOpenCL.so
ln -s /usr/lib/x86_64-linux-gnu/libOpenCL.so.1 /opt/intel/opencl/libOpenCL.so

To run clDNN, libOpenCL.so.1 will need to be in the library search path (LD_LIBRARY_PATH). You can do this by the many standard ways of setting environment variables, or update the /opt/intel/computer_vision_sdk_2018.0.<version>/bin/setupvars.sh script to include the directory where libOpenCL.so.1 can be found.

USB Rules (Intel® Movidius™ Neural Compute Stick)

To perform inference on Intel® Movidius™ Neural Compute Stick, install USB rules by running the following commands:

cat <<EOF > 97-usbboot.rules
SUBSYSTEM=="usb", ATTRS{idProduct}=="2150", ATTRS{idVendor}=="03e7", GROUP="users", MODE="0666", ENV{ID_MM_DEVICE_IGNORE}="1"
SUBSYSTEM=="usb", ATTRS{idProduct}=="f63b", ATTRS{idVendor}=="03e7", GROUP="users", MODE="0666", ENV{ID_MM_DEVICE_IGNORE}="1"
EOF
sudo cp 97-usbboot.rules /etc/udev/rules.d/
sudo udevadm control --reload-rules
sudo udevadm trigger
sudo ldconfig
rm 97-usbboot.rules

Next Steps

Learn About the Intel® CV SDK

Before using the Intel® CV SDK, read through the product overview to gain a better understanding of how the product works.

Compile the Extensions Library

Some topology-specific layers, like DetectionOutput used in the SSD*, are delivered in source code that assumes the extensions library is compiled and loaded. The extensions are required for pre-trained models inference. While you can build the library manually, the best way to compile the extensions library is to execute the demo scripts.

Run the Demonstration Applications

To verify the installation, run the demo apps in <INSTALL_DIR>/deployment_tools/demo. For demo app documentation, see README.txt in <INSTALL_DIR>/deployment_tools/demo.

The demo apps and their functions are:

demo_squeezenet_download_convert_run.sh. This demo Illustrates the basic steps used to convert a model and run it. This enables the Intel® Deep Learning Deployment Toolkit to perform a classification task with the SqueezeNet model. This demo:
- Downloads a public SqueezeNet model.
- Installs all prerequisites to run the Model Optimizer.
- Converts the model to an Intermediate Representation.
- Builds the Inference Engine Image Classification Sample from the <INSTALL_DIR>/deployment_tools/inference_engine/samples/classification_sample
- Runs the sample using cars.png from the demo folder.
- Shows the label and confidence for the top-10 categories.
demo_security_barrier_camera_sample.sh. This demo shows an inference pipeline using three of the pre-trained models included with the Intel CV SDK. The region found by one model becomes the input to the next. Vehicle regions found by object recognition in the first phase become the input to the vehicle attributes model, which locates the license plate. The region identified in this step becomes the input to a license plate character recognition model. This demo:
- Builds the Inference Engine Security Barrier Camera Sample from the <INSTALL_DIR>/deployment_tools/inference_engine/samples/security_barrier_camera_sample.
- Runs the sample using car_1.bmp from the demo folder.
- Displays the resulting frame with detections rendered as bounding boxes and text.

For documentation on the demo apps, see the README.txt file in the <INSTALL_DIR>/deployment_tools/demo folder.

Other Important Information

See the <INSTALL_DIR>/deployment_tools/inference_engine/samples/ folder and the samples overview documentation to learn about the range of samples available for the Inference Engine.
Before using the Model Optimizer to start working with your trained model, make sure your Caffe*, TensorFlow*, or MXNet* framework is prepared for any custom layers you have in place.
For developer guides, API references, tutorials, and other online documentation, see the Intel® CV SDK documentation.

Helpful Links

NOTE: Links open in a new window.

Intel® CV SDK Home Page: https://software.intel.com/en-us/computer-vision-sdk

Intel® CV SDK Documentation: https://software.intel.com/en-us/computer-vision-sdk/documentation/featured

Legal Information

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at http://www.intel.com/ or from the OEM or retailer.

No computer system can be absolutely secure.

Intel, Arria, Core, Movidius, Pentium, Xeon, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

*Other names and brands may be claimed as the property of others.

↧

Challenges and Tradeoffs on the Road to AR

April 30, 2018, 8:55 am

Latest and popular articles on Intel Technologies

≫ Next: Unreal Engine* 4/Intel® VTune™ Amplifier Usage Guide

≪ Previous: Clone of Installing the Intel® Computer Vision SDK 2018 on Linux, Without FPGA

The original article is published by Intel Game Dev on VentureBeat*: Challenges and tradeoffs on the road to AR. Get more game dev news and related topics from Intel on VentureBeat.

Anyone involved in virtual reality over the course of the past few years, whether as a developer of VR, as a user of VR, or simply tracking the industry's progress, will agree there's a word they've heard a few times too many: Holodeck*. The well-trod Star Trek concept has become a threadbare metaphor for a supposed end-point for VR technology.

While aspirational visions serve a purpose, they can also do us disservice. The reality is that we are a very long way from that Holodeck vision and that's OK. VR is already serving many useful purposes with near-term solutions that don't attempt to fool all our senses to the point of a complete suspension of disbelief. Most of the industry, it seems, has come to accept this, as have most VR users. We have, collectively, come to terms with the fact that great product solutions can exist in the near term, that deliver some portion of the Holodeck promise, while leaving other portions to the fictions of Star Trek and other sci-fi.

It is surprising then, when looking at augmented reality¹, that so many believe in the promise of a "Holodeck of AR"— sleek and stylish glasses delivered via hardware and software magic that rather than bringing us to any imaginable universe, instead bring any imaginable augmentation of the senses to our real world. Moreover, many believe this is deliverable in the near-term time horizon.

While solutions spanning the immersive technologies domain (AR, VR) will share dependence on common underlying technologies, augmented reality is in many ways a harder problem. AR can be thought of as a whole bouquet of thorny technical problems, each of which is its own very deep rabbit hole.

As with VR, AR involves an input-output loop that needs to execute sufficiently quickly to fool the conscious and subconscious to a degree where the results seem congruous with the surrounding world and the user's sense of what seems natural. What's more, in order to dovetail with the surrounding world, the solution may need to communicate with and draw from surrounding information sources. The sophistication of the processing that the solution may need to perform may vary by use case. And the solution needs to be embodied in something that a user can wear or carry in a manner suitable to their situation.

This is where the challenge becomes apparent. The sheer number of possible inputs and outputs that one can imagine, the depth of each that might be required, the sophistication of the processing that may be required for a given task, and the desired attributes for the embodiment of that solution (price, form factor, etc), make this a boundless problem.

Attributes of AR Solutions

For a sampling of the technical challenges facing AR, see the Illustration below, which attempts to present the wide variety of attributes that an AR solution may embody. Titled the 'Attributes of Augmented Reality², this — while almost certainly incomplete — is meant to illustrate the breadth of challenging problems to address. I've divided them into four main areas:

Sensing: Seeing, hearing, sampling, and otherwise deriving the state of the world surrounding the user.
Processing: Making sense of all of that data, what it means in the context of the computational tasks, simulations and/or and applications at hand, and making decisions about what to do next.
Augmenting: Taking the output of this processing and presenting it back to the user's senses in a way that augments their sense of their environs.
Embodying: The attributes of the physical manifestation of the device or devices that deliver this solution.

This is an admittedly over-simplified division; and the sub-categories within each area are only a subset, to which many working within the field could add. This, in a way, is the point: Solutions that do all of these things, let alone do them well, cheaply, and unobtrusively, are a long way off.

Attributes of Augmented Reality

Even more challenging still is the number of problems in the space that are ones for which solutions do not yet even exist. I like to think of the problems as falling within three distinct domains:

Problems at the intersection of power, performance, and time. For those of us that work in Silicon Valley, these are the easiest to understand. For known solutions to problems, they are simply a matter of "how long before Moore's Law allows us to do this in real-time, within a certain power envelope?"

Problems requiring breakthroughs in science. This is a more challenging category of problems, requiring breakthroughs in limitations of existing technologies — or more often — multiple breakthroughs. Examples in recent years include image-based inside-out 6DOF-tracking, or Waveguide display technologies. Lightfield displays are an example that feels further out on the edge of today's R&D. While predicting when these problems will be solved is much harder, there's a certain faith that people in the field have enough smart people in labs around the world working on these problems to make progress in solving them.

Problems requiring breakthroughs in design, user experience, and social norms. I sometimes encounter folks who believe that if we tackle problems in the two above categories, the third category of problems will be resolved in short order. Personally, I think this is the hardest category of the three. We can look at many technology transitions and see that there was a sort of "maximum rate of absorption" at which the design community could adapt to the new paradigm (e.g. the half-decade of attempts at 3-finger swipes, swirly gestures, and other touchscreen UI attempts before the dust settled on what most apps use today on smartphones).

Similarly, there's an analogous societal component — it takes time for people to get used to intrusions of technology (real or perceived) on their lives. (Google Glass* learned this lesson painfully.)

Specialization Versus Jack of All Trades

Until a point in the far future where we can deliver all of the attributes of AR at extremely high quality, inexpensively, and seamlessly, we're going to see interim solutions that are forced to make tradeoffs between them. This is a Good Thing. I hold a strong conviction that the path to success in this space is in doing fewer things extremely well, not many things in a compromised fashion.

It's likely we'll see AR solutions that tackle particular problems in point solution devices. We'll see solutions that make compromises on some attributes in order to exceed expectations on others. We'll see solutions that complement existing screens rather than replace them. And like with VR, we'll see solutions that leverage the power of existing devices (PCs, game consoles, smartphones, etc.).

Fostering an Environment for Progress

If we take the view that solutions will need to decide on different tradeoffs for different optimal solutions for particular problems, customer segments or form factors — and that we want many solutions to make attempts at different flavors of AR solutions — then how to encourage this?

The first step is to acknowledge that the "AR Holodeck" is not likely to arrive in the near term, and that interim, specialized solutions are not only OK, but may be preferred. Second is to foster an environment that allows a multitude of solutions to materialize — through open platforms and open standards. Finally, the industry requires collaboration — as entrants solve a problem in one domain, to share that solution with others to allow cross-pollination. Through these kinds of efforts, we may get our "holodeck of AR" eventually, but we'll have been using AR for years already by the time that happens.

About the Author

Kim Pallister manages the VR Center of Excellence at Intel. The opinions expressed in this article are his own and not necessarily represent the view of Intel Corporation

Footnotes

1. I'm going to avoid getting into the AR/MR nomenclature debate. For purposes of this article and the illustrative Attributes of AR poster – I'm covering the full spectrum of cases where a solution would supplement a user's environment with spatial elements, regardless of how seamlessly or realistically the solution attempts to integrate them into the environment.

2. To give credit where it's due: I owe thanks to the folks at Ziba Design for helping lay out the design in a far more cohesive way than I originally had it on my whiteboard. Also, a huge thanks to John Manoogian III for his creation of the *brilliant* Cognitive Bias Codex, from which I took inspiration.

↧

Unreal Engine* 4/Intel® VTune™ Amplifier Usage Guide

May 7, 2018, 8:19 am

Latest and popular articles on Intel Technologies

≫ Next: Grove* Sensors, AWS Greengrass* Group and Device-to-Cloud Communication

≪ Previous: Challenges and Tradeoffs on the Road to AR

Whether you’re tuning development code for the first time or conducting advanced performance optimizations, Intel® VTune™ Amplifier turns raw profiling data into performance insight. If you need to determine bottlenecks, sync points, and CPU hotspots in your PC game code developed with the Unreal* Engine, you can take advantage of the graphical user interface to sort, filter, and visualize data from a local or remote target, with low overhead.

In Unreal Engine* 4.19, Intel® software engineers worked with Unreal* to add support for Intel VTune Amplifier instrumentation and tracing technology (ITT) markers. This guide shows the user how to take advantage of the new integration to generate annotated traces of Unreal Engine 4 (UE4) inside the Intel VTune Amplifier 2018 UI. Download UE4 from Unreal Engine. Download a free trial of Intel VTune Amplifier.

Capturing Unreal Engine* 4 Traces

Scoped events are cumulative CPU timings of blocks of code analyzed frame by frame. Scoped events at the function or “between braces” level can now be captured and viewed in the Intel VTune Analyzer profiler using ITT events. Setting up scoped events can help you track standard engine statistics.

To get started, run the Intel VTune profiler as an Administrator.

For the application, choose the UE4 Editor by including the entire path.

For application parameters, specify the game with any necessary settings, such as resolution. In the example below, the UE4 Particle Effects demo is profiled. Make sure you add “-VTune” at the end of the application parameters command line (see figure 1). If you need help with the command-line arguments in addition to the -VTune switch, refer to the Command-Line Arguments section of the UE4 documentation.

Select the checkbox to set the application directory as the working directory. If you need help with any of the other settings on this screen, use the F1 key to access VTune’s help system.

Figure 1. Setting up the application, game, and application parameters under the analysis target tab.

Next, move to the Analysis Type tab and choose “Advanced Hotspots” under the Algorithm Analysis heading (see figure 2).

Set the CPU sampling interval at 1ms.

For this example, to keep overhead down, at “Select a level of details provided with event-based sampling collection” click “Hotspots.”

Set the Event mode to “All.”

Select the checkbox for “Analyze user tasks, events, and counters.”

Setting advanced hotspots
Figure 2. Setting up advanced hotspots under the analysis type tab.

Next, start the game through the Intel VTune profiler.

In the Unreal Engine dev console, which you can open with the ~ (tilde) key while the workload is running, type “stat NamedEvents.” Scoped events will now be tracked. Note that you need a Development build to make this feature work. For more information, refer to the Build Configurations section of the UE4 help system.

When finished collecting statistics, stop the profiler.

Viewing Unreal Engine 4 Traces

After processing the results, the summary will show captured Top Task types statistics, similar to figure 3.

Figure 3. Statistics gathered for the top tasks.

At the Advanced Hotspots screen, move to the “Bottom-up” tab (see figure 4). The Bottom-up view will show an in-depth look at the tasks. Use the “Grouping” pull-down menu to select the view for “Task Domain / Task Type /Task Duration Type / Function / Call Stack.”

Hotspots screen Bottom-up
Figure 4. The bottom-up view shows an in-depth look at the reported tasks.

You can keep exploring the report for your code profile from additional tabs on the Advanced Hotspot screen. For example, the “Platform” view will depict timing for named events (see figure 5).

Figure 5. Timing for named events seen from the platform view.

There is a lot of information in these reports for you to inspect. For more information, see the documentation for Intel VTune Amplifier Tutorials. You’ll find HTML and PDF documents to walk you through examples, as well as sample code to solve issues with Windows*, Linux*, C++, Fortran, OpenMP*; Android* challenges surrounding energy usage; detecting hotspots; identifying locks and waits that prevent parallelization, and more.

Custom Events

Any code snippet inside UE4 that you want to optimize may be investigated by encapsulating it with cycle counters as described in this guide. This gives you the ability to define custom events and follow their execution on the thread timeline in the Intel VTune Analyzer UI.

Conclusion

Performance on modern processors requires much more than optimizing single-thread performance. High-performing code must be:

Threaded and scalable to utilize multiple CPUs
Vectorized for efficient use of SIMD units
Tuned to take advantage of non-uniform memory architectures and caches

With Intel VTune Amplifier, you get advanced profiling capabilities with a single, user-friendly analysis interface. UE4 and Intel VTune Amplifier work together to let you investigate your code and profile it to run smoothly across multiple cores. In addition, the optimization tools allow you to create faster code, get more accurate data about the CPU and GPU, and investigate threading and memory usage—all with low overhead. Plus, you’ll get answers more quickly thanks to easy analysis that turns data into insights. Download the most recent versions of the Unreal Engine and the Intel VTune Amplifier today to get ready to take your game-dev efforts to the next level.

Additional Resources

↧

Grove* Sensors, AWS Greengrass* Group and Device-to-Cloud Communication

May 8, 2018, 7:56 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® Xeon® Processor D-2100 Product Family Technical Overview

≪ Previous: Unreal Engine* 4/Intel® VTune™ Amplifier Usage Guide

Introduction

This article explores a method for monitoring the environment using sensor data rules and issuing alerts for abnormal readings. To that end, we will setup a continuous MQTT communication for passing sensor data using the UP Squared* board and Grove* shield, the AWS Greengrass* group and device-to-cloud communication. The Grove sensors used in this article are the loudness sensor, the barometer sensor, and the IR distance interrupter. The environment’s loudness, ambient temperature, barometric pressure, and altitude will be the data collected from the sensors. The rules are set according to the sensor readings to determine the abnormal values and to issue alerts. First, we will configure the Greengrass devices via the AWS* console, and then we will configure them on both the UP Squared* board and another Linux Ubuntu* 16.04 machine. On the UP Squared board, we will run a Python* script to collect the sensor data, and filter it out based on rules. Then, the publisher device will send the sensor data to the subscriber device (the Linux Ubuntu machine), as well as send the alerts for abnormal readings to the AWS IoT cloud via MQTT topics.

Learn more about the AWS Greengrass

Learn more about the UP Squared board

Prerequisites

Hardware:

UP Squared board
Linux Ubuntu* Server for UP Squared 16.04
Grove interface board, part of the UP Squared board kit
Grove loudness sensor
Grove barometer sensor
Grove IR distance interrupter sensor
4-pin Grove cables
Machine with Linux Ubuntu 16.04 OS

AWS Greengrass:

AWS account; free trial was used for this article

Grove* Sensors

On the UP Squared board, install the MRAA and UPM libraries to interface with Grove sensors:

sudo add-apt-repository ppa:mraa/mraa
sudo apt-get update
sudo apt-get install libmraa1 libmraa-dev mraa-tools python-mraa python3-mraa
sudo apt-get install libupm-dev libupm-java python-upm python3-upm node-upm upm-example

Code 1. Commands to install Grove dependencies

To enable non-privileged access to the Grove sensors, run the following commands:

sudo add-apt-repository ppa:ubilinux/up
sudo apt update
sudo apt install upboard-extras
sudo usermod -a -G i2c ggc_user
sudo usermod -a -G grio ggc_user
sudo reboot

Code 2. Commands to enable non-privileged access to Grove sensors

AWS Greengrass* Setup

To install AWS Greengrass on the UP Squared board, follow these instructions

Check that you have installed all the needed dependencies:

sudo apt update
git clone https://github.com/aws-samples/aws-greengrass-samples.git
cd aws-greengrass-samples
cd greengrass-dependency-checker-GGCv1.3.0
sudo ./check_ggc_dependencies

Code 3. Commands to check AWS dependencies

Go to the AWS IoT console. Choose Greengrass from leftside menu, select Groups underneath it, and select your group from main window:

Commands to install Grove Dependencies
Figure 1. AWS Greengrass Groups view

Select Devices from the left-side menu. Click Add Device button, on the top right corner:

Commands to enable non-privileged access to Grove Sensors
Figure 2. Greengrass devices view

Choose Create New Device:

Figure 3. Creating a new device view

Enter the name, pub, in the field and clickNext:

Figure 4. Creating a Registry entry for a device

Click on Use Defaults button:

Figure 5. Set up security view

Download the security credentials, we will use them in the next module. Click Finish:

Figure 6. Download security credentials

You should see the new device on the screen:

Figure 7. Greengrass Devices view

Add another new device and name it sub. When you’re done, you should see the following screen, with two new devices:

Figure 8. Updated Greengrass Devices view

On the left-side menu, select Subscriptions. Click on Add Subscription:

Figure 9. Greengrass Subscriptions view

For the source, go to Devices tab and select pub. For the target, go to Devices tab and select sub. Click Next:

Figure 10. Creating subscription: selecting source and target view

Add topic, sensors/data/pubsub:

Figure 11. Creating subscription: adding topic view

Review the subscription and click Next:

Figure 12. Confirm and save Subscription view

You can see the subscription:

Figure 13. Subscriptions view

Create another subscription by following the steps below. Choose pub as a source and IoT Cloud as a target. For topic, enter sensors/data/alerts. After you done, you should see a similar screen:

Figure 14. Updated Subscriptions view

On the group header, click Actions, select Deploy and wait until it is successfully completed:

Figure 15. Deploying the Greengrass Group view

Publisher Setup

In this module, we will configure the Greengrass device to be a MQTT publisher. In this case, we are using the UP Squared board both as a Greengrass core, and as a publisher device, so we can get the sensor data from a device. The other Linux machine will be used as a subscriber device and configured in the next module.

Copy pub’s tar file, which was saved in a previous module, and untar it. Save the files in the publisher device (UP Squared board) and rename them for readability:

tar –xzvf <pub-credentials-id>-setup.tar.gz
mv <pub-credentials-id>.cert.pem pub.cert.pem
mv <pub-credentials-id>.private.pem pub.private.pem
mv <pub-credentials-id>.public.pem pub.public.pem

Code 4. Commands to Save Publisher’s Credentials

In the publisher folder, get a root certificate and save it as root-ca-cert.pem:

wget https://www.symantec.com/content/en/us/enterprise/verisign/roots/VeriSign-Class%203-Public-Primary-Certification-Authority-G5.pem –O root-ca-cert.pem

Code 5. Commands to get a root certificate

On Up Squared board, install AWS IoT SDK for Python:

python
>>> import ssl
>>> ssl.OPENSSL_VERSION
# output should be version of OpenSSL 1.0.1+:‘OpenSSL 1.0.2g 1 Mar 2016’
>>> exit()
cd ~ 
git clone https://github.com/aws/aws-iot-device-sdk-python.git
cd aws-iot-device-sdk-python
python setup.py install

Code 6. Commands to install AWS IoT SDK for Python

The following rules will allow us to filter the sensor data values, and capture messages if the values are abnormal. We will determine the range of normal readings. For temperature, we define the normal range to be below 25 and above 20 degrees Celsius:

class TemperatureOver25(Rule):
    def predicate(self, sensorValue):
        return sensorValue > 25

    def action(self, sensorValue):
        message = “Temperature Over 25 Rule activated on “ + self.sensorId + “ with sensor value “ + str(sensorValue)
        return message

class TemperatureUnder20(Rule):
    def predicate(self, sensorValue):
        return sensorValue < 20

    def action(self, sensorValue):
        message = “Temperature Under 20 Rule activated on “ + self.sensorId + “ with sensor value “ + str(sensorValue)
        return message

Code 7. Code snippet with temperature classes

We will filter the rules by sensor:

def filterBySensorId(sensorId, rules):
    “Filter a list of rules by sensorId”
    return [rule for rule in rules if rule.sensorId == sensorId]

Code 8. Code snippet for filtering rules by sensor

The rules will be imported and instantiated in the next script, greengrassCommunication.py. Save the following Python script as sensor_rules.py to the publisher folder:

class Rule:
    “””
    A Base Class for defining IoT automation rules.
    “””
    
    def __init__(self, sensorId):
        “””
        Constructor function that takes an id 
        that uniquely identifies a sensor.
        “””
        self.sensorId = sensorId
    
    def predicate(self, sensorValue):
        “In the base Rule class, the predicate always returns False”
        return False
                    
    def action(self, sensorValue):
        message = “Generic Rule activiation on “ + self.sensorId + “ with senor value “ + str(sensorValue)
        return message

class TemperatureOver25(Rule):
    def predicate(self, sensorValue):
        return sensorValue > 25

    def action(self, sensorValue):
        message = “Temperature Over 25 Rule activated on “ + self.sensorId + “ with sensor value “ + str(sensorValue)
        return message

class TemperatureUnder20(Rule):
    def predicate(self, sensorValue):
        return sensorValue < 20

    def action(self, sensorValue):
        message = “Temperature Under 20 Rule activated on “ + self.sensorId + “ with sensor value “ + str(sensorValue)
        return message

class PressureOver96540(Rule):
    def predicate(self, sensorValue):
        return sensorValue > 96540
    def action(self, sensorValue):
        message = “Pressure Over 96540 Rule activated on “ + self.sensorId + “ with sensor value “ + str(sensorValue)
        return message

class PressureUnder96534(Rule):
    def predicate(self, sensorValue):
        return sensorValue < 96534
    def action(self, sensorValue):
        message = “Pressure Under 96534 Rule activated on “ + self.sensorId + “ with sensor value “ + str(sensorValue)
        return message

class AltitudeOver1214(Rule):
    def predicate(self, sensorValue):
        return sensorValue > 1214
    def action(self, sensorValue):
        message = “Altitude Over 1214 Rule activated on “ + self.sensorId + “ with sensor value “ + str(sensorValue)
        return message

class AltitudeUnder1214(Rule):
    def predicate(self, sensorValue):
        return sensorValue < 1214
    def action(self, sensorValue):
        message = “Altitude Under 1214 Rule activated on “ + self.sensorId + “ with sensor value “ + str(sensorValue)
        return message

class ObjectDetected(Rule):
    def predicate(self, sensorValue):
        return sensorValue == True
    def action(self, sensorValue):
        message = “Object Detected Rule activated on “ + self.sensorId + “ with sensor value “ + str(sensorValue)
        return message

class LoudnessOver3(Rule):
    def predicate(self, sensorValue):
        return sensorValue > 3
    def action(self, sensorValue):
        message = “Loudness Over 3 Rule activated on “ + self.sensorId + “ with sensor value “ + str(sensorValue)
        return message

class LoudnessUnder05(Rule):
    def predicate(self, sensorValue):
        return sensorValue < 0.5
    def action(self, sensorValue):
        message = “Loudness Under 0.5 Rule activated on “ + self.sensorId + “ with sensor value “ + str(sensorValue)
        return message

def filterBySensorId(sensorId, rules):
    “Filter a list of rules by sensorId”
    return [rule for rule in rules if rule.sensorId == sensorId]

Code 9. sensor_rules.py, Python script to create rules

We will need the following imports to get Grove sensor data:

from upm import pyupm_bmp280 as bmp280
from upm import pyupm_rfr359f as rfr359f
from upm import pyupm_loudness as loudness
import mraa

Code 10. Import statements for Grove sensors

We will interface with Grove shield and instantiate the sensors. Loudness sensor is connected to pin A2 which needs to be offset by 512. The IR distance interrupter, which will be used for object detection, is connected to pin D2 and needs to be offset by 512:

mraa.addSubplatform(mraa.GROVEPI, “0”)
# bmp is a barometer sensor
bmp = bmp280.BMP280(0, 0x76)
loudness_sensor = loudness.Loudness(514, 5.0)
# IR distance interruptor
object_detector = rfr359f.RFR359F(514)

Code 11. Interfacing with Grove shield and instantiating its sensors

Rules will be instantiated for each one class and will be saved in rules list:

r1 = sensor_rules.Rule(“generic_Sensor”)  # Generic Rule assigned to a sensor name “generic_Sensor”
r2 = sensor_rules.TemperatureOver25(“temperature”) # Derived rule assigned to a sensor name “temperature”
r3 = sensor_rules.TemperatureUnder20(“temperature”) # Derived rule assigned to a sensor name “temperature”
r4 = sensor_rules.PressureOver96540(“pressure”) # PressureOver96534 assigned to a sensor name “pressure”
r5 = sensor_rules.PressureUnder96534(“pressure”) # PressureUnder96534 assigned to a sensor name “pressure”
r6 = sensor_rules.AltitudeOver1214(“altitude”) # AltitudeOver1214 assigned to a sensor name “altitude”
r7 = sensor_rules.AltitudeUnder1214(“altitude”) # AltitudeUnder1214 assigned to a sensor name “altitude”
r8 = sensor_rules.ObjectDetected(“object detection”) # ObjectDetected is assigned to a sensor “object detection”
r9 = sensor_rules.LoudnessOver3(“loudness”) # LoudnessOver3 is assigned to a sensor “loudness”
r10 = sensor_rules.LoudnessUnder05(“loudness”) # LoudnessUnder05 is assigned to a sensor “loudness”

rules = [r1, r2, r3, r4, r5, r6, r7, r8, r9, r10]

Code 12. Instantiating rules

The following code snippet reads the sensor data and saves them in a JSON format:

def get_sensor_data():
    # Getting new readings from barometer sensor
    bmp.update()
    pressure_value = bmp.getPressurePa()
    temperature_value = bmp.getTemperature()
    # Translating altitude value from meters to feet
    altitude_value = int(bmp.getAltitude() * 3.2808)
    # Get IR object detection data

    # returns True or False
    object_detection_value = object_detector.objectDetected()

    loudness_value = loudness_sensor.loudness()

    timestamp = time.time()

    sensor_data = {“values”:[
                     {“sensor_id”: “pressure”, “value”: pressure_value, “timestamp”: timestamp}, 
                     {“sensor_id”: “temperature”, “value”: temperature_value, “timestamp”: timestamp},
                     {“sensor_id”: “altitude”, “value”: altitude_value, “timestamp”: timestamp},
                     {“sensor_id”: “object detection”, “value”: object_detection_value, “timestamp”: timestamp},
                     {“sensor_id”: “loudness”, “value”: loudness_value, “timestamp”: timestamp}
                    ]
                   }
    sensor_data_json = json.loads(json.dumps(sensor_data[“values”]))

    return sensor_data_json

Code 13. Code snippet to get sensor data

This code snippet filters the JSON object with sensor data and creates alert messages for abnormal readings:

def apply_rules(sensor_data_json):
    rules_message = []
    
    for item in sensor_data_json:
        sensor_id = item[‘sensor_id’]
        sensor_value = item[‘value’]
        filteredRules = sensor_rules.filterBySensorId(sensor_id, rules)
        for r in filteredRules:            
            if r.predicate(sensor_value) == True:                
                rules_message.append(r.action(sensor_value))
    return rules_message

Code 14. Code snippet to filter rules by sensor

The while loop will run continuously to publish the sensor data in the sensors/data/pubsub topic and alerts in sensors/data/alerts topic:

while True:
    if args.mode == ‘both’ or args.mode == ‘publish’:
        message = {}
        sensor_data_json = get_sensor_data()
        # message[‘message’] = args.message
        message[‘message’] = get_message(sensor_data_json)
        message[‘alerts’] = apply_rules(sensor_data_json)
        message[‘sequence’] = loopCount
        messageJson = json.dumps(message)
        myAWSIoTMQTTClient.publish(topic, messageJson, 0)
        if args.mode == ‘publish’:
            print(‘Published topic %s: %s\n’ % (topic, messageJson))
        cloud_topic = “sensors/data/alerts”
        alerts_json = json.dumps(message[‘alerts’])
        myAWSIoTMQTTClient.publish(cloud_topic, alerts_json, 0)
        loopCount += 1
    time.sleep(1)

Code 15. Continuous while loop for publishing sensor data messages and alerts

The Code 16 Python script will create the sensor rules, collect the sensor data, and send MQTT messages with the sensor data values to the Greengrass subscriber device as well as send alerts with abnormal data to the IoT cloud. Save this Python script as greengrassCommunication.py to the same folder:

from __future__ import print_function
import os
import sys
import time
import uuid
import json
import argparse
from AWSIoTPythonSDK.core.greengrass.discovery.providers import DiscoveryInfoProvider
from AWSIoTPythonSDK.core.protocol.connection.cores import ProgressiveBackOffCore
from AWSIoTPythonSDK.MQTTLib import AWSIoTMQTTClient
from AWSIoTPythonSDK.exception.AWSIoTExceptions import DiscoveryInvalidRequestException
import signal, atexit
from upm import pyupm_bmp280 as bmp280
from upm import pyupm_rfr359f as rfr359f
from upm import pyupm_loudness as loudness
import mraa
import sensor_rules


mraa.addSubplatform(mraa.GROVEPI, “0”)
# bmp is a barometer sensor
bmp = bmp280.BMP280(0, 0x76)
loudness_sensor = loudness.Loudness(514, 5.0)
# IR distance interruptor
object_detector = rfr359f.RFR359F(514)

r1 = sensor_rules.Rule(“generic_Sensor”)  # Generic Rule assigned to a sensor name “generic_Sensor”
r2 = sensor_rules.TemperatureOver25(“temperature”) # Derived rule assigned to a sensor name “temperature”
r3 = sensor_rules.TemperatureUnder20(“temperature”) # Derived rule assigned to a sensor name “temperature”
r4 = sensor_rules.PressureOver96540(“pressure”) # PressureOver96534 assigned to a sensor name “pressure”
r5 = sensor_rules.PressureUnder96534(“pressure”) # PressureUnder96534 assigned to a sensor name “pressure”
r6 = sensor_rules.AltitudeOver1214(“altitude”) # AltitudeOver1214 assigned to a sensor name “altitude”
r7 = sensor_rules.AltitudeUnder1214(“altitude”) # AltitudeUnder1214 assigned to a sensor name “altitude”
r8 = sensor_rules.ObjectDetected(“object detection”) # ObjectDetected is assigned to a sensor “object detection”
r9 = sensor_rules.LoudnessOver3(“loudness”) # LoudnessOver3 is assigned to a sensor “loudness”
r10 = sensor_rules.LoudnessUnder05(“loudness”) # LoudnessUnder05 is assigned to a sensor “loudness”

rules = [r1, r2, r3, r4, r5, r6, r7, r8, r9, r10]

def get_sensor_data():
    # Getting new readings from barometer sensor
    bmp.update()
    pressure_value = bmp.getPressurePa()
    temperature_value = bmp.getTemperature()
    # Translating altitude value from meters to feet
    altitude_value = int(bmp.getAltitude() * 3.2808)
    # Get IR object detection data

    # returns True or False
    object_detection_value = object_detector.objectDetected()

    loudness_value = loudness_sensor.loudness()

    timestamp = time.time()

    sensor_data = {“values”:[
                     {“sensor_id”: “pressure”, “value”: pressure_value, “timestamp”: timestamp}, 
                     {“sensor_id”: “temperature”, “value”: temperature_value, “timestamp”: timestamp},
                     {“sensor_id”: “altitude”, “value”: altitude_value, “timestamp”: timestamp},
                     {“sensor_id”: “object detection”, “value”: object_detection_value, “timestamp”: timestamp},
                     {“sensor_id”: “loudness”, “value”: loudness_value, “timestamp”: timestamp}
                    ]
                   }
    sensor_data_json = json.loads(json.dumps(sensor_data[“values”]))

    return sensor_data_json


def get_message(sensor_data_json):    
    sensor_data_message = []
    for item in sensor_data_json:
        sensor_id = item[‘sensor_id’]
        sensor_value = item[‘value’]        
        sensor_data_message.append(item)
    return sensor_data_message

def apply_rules(sensor_data_json):
    rules_message = []
    
    for item in sensor_data_json:
        sensor_id = item[‘sensor_id’]
        sensor_value = item[‘value’]
        filteredRules = sensor_rules.filterBySensorId(sensor_id, rules)
        for r in filteredRules:            
            if r.predicate(sensor_value) == True:                
                rules_message.append(r.action(sensor_value))
    return rules_message

AllowedActions = [‘both’, ‘publish’, ‘subscribe’]

# General message notification callback
def customOnMessage(message):
    print(‘Received message on topic %s: %s\n’ % (message.topic, message.payload))

MAX_DISCOVERY_RETRIES = 10
GROUP_CA_PATH = “./groupCA/”

# Read in command-line parameters
parser = argparse.ArgumentParser()
parser.add_argument(“-e”, “–endpoint”, action=”store”, required=True, dest=”host”, help=”Your AWS IoT custom endpoint”)
parser.add_argument(“-r”, “–rootCA”, action=”store”, required=True, dest=”rootCAPath”, help=”Root CA file path”)
parser.add_argument(“-c”, “–cert”, action=”store”, dest=”certificatePath”, help=”Certificate file path”)
parser.add_argument(“-k”, “–key”, action=”store”, dest=”privateKeyPath”, help=”Private key file path”)
parser.add_argument(“-n”, “–thingName”, action=”store”, dest=”thingName”, default=”Bot”, help=”Targeted thing name”)
parser.add_argument(“-m”, “–mode”, action=”store”, dest=”mode”, default=”both”,
                    help=”Operation modes: %s”%str(AllowedActions))

args = parser.parse_args()
host = args.host
rootCAPath = args.rootCAPath
certificatePath = args.certificatePath
privateKeyPath = args.privateKeyPath
clientId = args.thingName
thingName = args.thingName
topic = “sensors/data/pubsub”

if args.mode not in AllowedActions:
    parser.error(“Unknown –mode option %s. Must be one of %s” % (args.mode, str(AllowedActions)))
    exit(2)

if not args.certificatePath or not args.privateKeyPath:
    parser.error(“Missing credentials for authentication.”)
    exit(2)

# Progressive back off core
backOffCore = ProgressiveBackOffCore()

# Discover GGCs
discoveryInfoProvider = DiscoveryInfoProvider()
discoveryInfoProvider.configureEndpoint(host)
discoveryInfoProvider.configureCredentials(rootCAPath, certificatePath, privateKeyPath)
discoveryInfoProvider.configureTimeout(10)  # 10 sec

retryCount = MAX_DISCOVERY_RETRIES
discovered = False
groupCA = None
coreInfo = None
while retryCount != 0:
    try:
        discoveryInfo = discoveryInfoProvider.discover(thingName)
        caList = discoveryInfo.getAllCas()
        coreList = discoveryInfo.getAllCores()

        # We only pick the first ca and core info
        groupId, ca = caList[0]
        coreInfo = coreList[0]
        print(“Discovered GGC: %s from Group: %s” % (coreInfo.coreThingArn, groupId))

        print(“Now we persist the connectivity/identity information…”)
        groupCA = GROUP_CA_PATH + groupId + “_CA_” + str(uuid.uuid4()) + “.crt”
        if not os.path.exists(GROUP_CA_PATH):
            os.makedirs(GROUP_CA_PATH)
        groupCAFile = open(groupCA, “w”)
        groupCAFile.write(ca)
        groupCAFile.close()

        discovered = True
        print(“Now proceed to the connecting flow…”)
        break
    except DiscoveryInvalidRequestException as e:
        print(“Invalid discovery request detected!”)
        print(“Type: %s” % str(type(e)))
        print(“Error message: %s” % e.message)
        print(“Stopping…”)
        break
    except BaseException as e:
        print(“Error in discovery!”)
        print(“Type: %s” % str(type(e)))
        print(“Error message: %s” % e.message)
        retryCount -= 1
        print(“\n%d/%d retries left\n” % (retryCount, MAX_DISCOVERY_RETRIES))
        print(“Backing off…\n”)
        backOffCore.backOff()

if not discovered:
    print(“Discovery failed after %d retries. Exiting…\n” % (MAX_DISCOVERY_RETRIES))
    sys.exit(-1)

# Iterate through all connection options for the core and use the first successful one
myAWSIoTMQTTClient = AWSIoTMQTTClient(clientId)
myAWSIoTMQTTClient.configureCredentials(groupCA, privateKeyPath, certificatePath)
myAWSIoTMQTTClient.onMessage = customOnMessage

connected = False
for connectivityInfo in coreInfo.connectivityInfoList:
    currentHost = connectivityInfo.host
    currentPort = connectivityInfo.port
    print(“Trying to connect to core at %s:%d” % (currentHost, currentPort))
    myAWSIoTMQTTClient.configureEndpoint(currentHost, currentPort)
    try:
        myAWSIoTMQTTClient.connect()
        connected = True
        break
    except BaseException as e:
        print(“Error in connect!”)
        print(“Type: %s” % str(type(e)))
        print(“Error message: %s” % e.message)

if not connected:
    print(“Cannot connect to core %s. Exiting…” % coreInfo.coreThingArn)
    sys.exit(-2)

# Successfully connected to the core
if args.mode == ‘both’ or args.mode == ‘subscribe’:
    myAWSIoTMQTTClient.subscribe(topic, 0, None)
time.sleep(2)

loopCount = 0
while True:
    if args.mode == ‘both’ or args.mode == ‘publish’:
        message = {}
        sensor_data_json = get_sensor_data()
        # message[‘message’] = args.message
        message[‘message’] = get_message(sensor_data_json)
        message[‘alerts’] = apply_rules(sensor_data_json)
        message[‘sequence’] = loopCount
        messageJson = json.dumps(message)
        myAWSIoTMQTTClient.publish(topic, messageJson, 0)
        if args.mode == ‘publish’:
            print(‘Published topic %s: %s\n’ % (topic, messageJson))
        cloud_topic = “sensors/data/alerts”
        alerts_json = json.dumps(message[‘alerts’])
        myAWSIoTMQTTClient.publish(cloud_topic, alerts_json, 0)
        loopCount += 1
    time.sleep(1)

Code 16. greengrassCommunication.py, Python script to get Sensor Data and Publish the MQTT Messages

Subscriber Setup

In this module, we will configure the Greengrass device to be a MQTT subscriber. On the subscriber device, the non-UP Squared Linux machine, do the same: copy sub’s tar file which was saved in a previous module and untar it, save the files in the publisher device, and rename them for readability:

tar –xzvf <sub-credentials-id>-setup.tar.gz
mv <sub-credentials-id>.cert.pem sub.cert.pem
mv <sub-credentials-id>.private.pem sub.private.pem
mv <sub-credentials-id>.public.pem sub.public.pem

Code 17. Commands to save subscriber credentials

In the subscriber folder, get a root certificate and save it as root-ca-cert.pem:

wget https://www.symantec.com/content/en/us/enterprise/verisign/roots/VeriSign-Class%203-Public-Primary-Certification-Authority-G5.pem –O root-ca-cert.pem

Code 18. Commands to get a root certificate

On the subscriber device, install AWS IoT SDK for Python:

python
>>> import ssl
>>> ssl.OPENSSL_VERSION
# output should be version of OpenSSL 1.0.1+:‘OpenSSL 1.0.2g 1 Mar 2016’
>>> exit()
cd ~ 
git clone https://github.com/aws/aws-iot-device-sdk-python.git
cd aws-iot-device-sdk-python
python setup.py install

Code 19. Commands to install AWS IoT SDK for Python

The subscriber device will listen to sensors/data/pubsub topic continuously, printing “Waiting for the message.” every 10 seconds to confirm the script is still running:

while True:
    print(“Waiting for the message.”)
    time.sleep(10)

Code 20. Code snippet to continuously wait for the message

Copy the following Python script into the publisher device’s folder where publisher keys are stored, and save it as subscription.py:

from __future__ import print_function
import os
import sys
import time
import uuid
import json
import argparse
from AWSIoTPythonSDK.core.greengrass.discovery.providers import DiscoveryInfoProvider
from AWSIoTPythonSDK.core.protocol.connection.cores import ProgressiveBackOffCore
from AWSIoTPythonSDK.MQTTLib import AWSIoTMQTTClient
from AWSIoTPythonSDK.exception.AWSIoTExceptions import DiscoveryInvalidRequestException
import signal, atexit


AllowedActions = [‘both’, ‘publish’, ‘subscribe’]

# General message notification callback
def customOnMessage(message):
    print(‘Received message on topic %s: %s\n’ % (message.topic, message.payload))

MAX_DISCOVERY_RETRIES = 10
GROUP_CA_PATH = “./groupCA/”

# Read in command-line parameters
parser = argparse.ArgumentParser()
parser.add_argument(“-e”, “–endpoint”, action=”store”, required=True, dest=”host”, help=”Your AWS IoT custom endpoint”)
parser.add_argument(“-r”, “–rootCA”, action=”store”, required=True, dest=”rootCAPath”, help=”Root CA file path”)
parser.add_argument(“-c”, “–cert”, action=”store”, dest=”certificatePath”, help=”Certificate file path”)
parser.add_argument(“-k”, “–key”, action=”store”, dest=”privateKeyPath”, help=”Private key file path”)
parser.add_argument(“-n”, “–thingName”, action=”store”, dest=”thingName”, default=”Bot”, help=”Targeted thing name”)
parser.add_argument(“-m”, “–mode”, action=”store”, dest=”mode”, default=”both”,
                    help=”Operation modes: %s”%str(AllowedActions))

args = parser.parse_args()
host = args.host
rootCAPath = args.rootCAPath
certificatePath = args.certificatePath
privateKeyPath = args.privateKeyPath
clientId = args.thingName
thingName = args.thingName
topic = “sensors/data/pubsub”

if args.mode not in AllowedActions:
    parser.error(“Unknown –mode option %s. Must be one of %s” % (args.mode, str(AllowedActions)))
    exit(2)

if not args.certificatePath or not args.privateKeyPath:
    parser.error(“Missing credentials for authentication.”)
    exit(2)

# Progressive back off core
backOffCore = ProgressiveBackOffCore()

# Discover GGCs
discoveryInfoProvider = DiscoveryInfoProvider()
discoveryInfoProvider.configureEndpoint(host)
discoveryInfoProvider.configureCredentials(rootCAPath, certificatePath, privateKeyPath)
discoveryInfoProvider.configureTimeout(10)  # 10 sec

retryCount = MAX_DISCOVERY_RETRIES
discovered = False
groupCA = None
coreInfo = None
while retryCount != 0:
    try:
        discoveryInfo = discoveryInfoProvider.discover(thingName)
        caList = discoveryInfo.getAllCas()
        coreList = discoveryInfo.getAllCores()

        # We only pick the first ca and core info
        groupId, ca = caList[0]
        coreInfo = coreList[0]
        print(“Discovered GGC: %s from Group: %s” % (coreInfo.coreThingArn, groupId))

        print(“Now we persist the connectivity/identity information…”)
        groupCA = GROUP_CA_PATH + groupId + “_CA_” + str(uuid.uuid4()) + “.crt”
        if not os.path.exists(GROUP_CA_PATH):
            os.makedirs(GROUP_CA_PATH)
        groupCAFile = open(groupCA, “w”)
        groupCAFile.write(ca)
        groupCAFile.close()

        discovered = True
        print(“Now proceed to the connecting flow…”)
        break
    except DiscoveryInvalidRequestException as e:
        print(“Invalid discovery request detected!”)
        print(“Type: %s” % str(type(e)))        print(“Error message: %s” % e.message)
        print(“Stopping…”)
        break
    except BaseException as e:
        print(“Error in discovery!”)
        print(“Type: %s” % str(type(e)))
        print(“Error message: %s” % e.message)
        retryCount -= 1
        print(“\n%d/%d retries left\n” % (retryCount, MAX_DISCOVERY_RETRIES))
        print(“Backing off…\n”)
        backOffCore.backOff()

if not discovered:
    print(“Discovery failed after %d retries. Exiting…\n” % (MAX_DISCOVERY_RETRIES))
    sys.exit(-1)

# Iterate through all connection options for the core and use the first successful one
myAWSIoTMQTTClient = AWSIoTMQTTClient(clientId)
myAWSIoTMQTTClient.configureCredentials(groupCA, privateKeyPath, certificatePath)
myAWSIoTMQTTClient.onMessage = customOnMessage

connected = False
for connectivityInfo in coreInfo.connectivityInfoList:
    currentHost = connectivityInfo.host
    currentPort = connectivityInfo.port
    print(“Trying to connect to core at %s:%d” % (currentHost, currentPort))
    myAWSIoTMQTTClient.configureEndpoint(currentHost, currentPort)
    try:
        myAWSIoTMQTTClient.connect()
        connected = True
        break
    except BaseException as e:
        print(“Error in connect!”)
        print(“Type: %s” % str(type(e)))
        print(“Error message: %s” % e.message)

if not connected:
    print(“Cannot connect to core %s. Exiting…” % coreInfo.coreThingArn)
    sys.exit(-2)

# Successfully connected to the core
if args.mode == ‘both’ or args.mode == ‘subscribe’:
    myAWSIoTMQTTClient.subscribe(topic, 0, None)
    print(“after subscription”)
time.sleep(2)

loopCount = 0
while True:
    print(“Waiting for the message.”)
    time.sleep(10)

Code 21. subscription.py, Python script to subscribe to MQTT messages

Run the Scripts

In this module, we will run the Python scripts and view the MQTT messages with sensor data.

On the UP Squared board, start the Greengrass service:

cd <path-to-greengrass>/greengrass/ggc/core
sudo ./greengrassd start

Code 22. Commands to start Greengrass service

Go to the publisher folder:

cd <path-to-publisher-folder>

Code 23. Command to navigate to publisher folder

Get your AWS IoT endpoint ID by going to the AWS console, then to IoT Core page. On the bottom left-side menu, select Settings. Copy your endpoint value:

Figure 16. Settings view

Substitute your AWS IoT endpoint ID and run the following command:

python greengrassCommunication.py --endpoint <your-aws-iot-endpoint-id>.iot.us-west-2.amazonaws.com --rootCA root-ca-cert.pem --cert pub.cert.pem --key pub.private.key --thingName pub --mode publish

Code 24. Command to run greengrassCommunication.py

You should see the following screen:

Figure 17. Command to run greengrassCommunication.py and MQTT messages

On the subscriber device, go to the subscriber folder:

cd <path-to-subscriber-folder>

Code 25. Command to navigate to subscriber folder

Substitute your AWS IoT endpoint ID and run the following command:

python subscription.py --endpoint <your-aws-iot-endpoint-id>.iot.us-west-2.amazonaws.com --rootCA root-ca-cert.pem --cert sub.cert.pem --key sub.private.key --thingName sub --mode subscribe

Code 26. Command to run subscription.py

You should see the following screen:

Figure 18. Command to run subscription.py and received MQTT messages

Go to the AWS IoT console. Select Test from the left-side menu. Type sensors/data/alerts in the topic field, change MQTT payload display to display it as strings, and click Subscribe to topic:

Figure 19. MQTT subscription view

After some time, messages should display on the bottom of the screen:

Figure 20. MQTT messages view

As you can see, the rules were activated for some abnormal sensor readings. You may wish to create some action logic to return the values back to normal. This setup will allow you to monitor your environment and ensure the data readings are in the normal range.

Learn More on UP Squared

About the Author

Rozaliya Everstova is a software engineer at Intel in the Core and Visual Computing Group working on scale enabling projects for Internet of Things.

↧

Intel® Xeon® Processor D-2100 Product Family Technical Overview

May 8, 2018, 11:56 am

Latest and popular articles on Intel Technologies

≫ Next: Stellaris 2.0*: Rebuilding the Galaxy

≪ Previous: Grove* Sensors, AWS Greengrass* Group and Device-to-Cloud Communication

The Intel® Xeon® processor D-2100 product family, formerly code named Skylake-D, is Intel's latest generation 64-bit server system-on-chip (SoC). It is manufactured using the Intel low-power SoC 14 nm process, with up to 18 cores, and from 60 to 110 watts of power consumption. It brings the architectural innovations from the Intel® Xeon® Scalable processor platform to an SoC processor.

Figure 1. Comparison of server and microserver product lines.

In Intel's product line the Intel Xeon D processor is positioned between the Intel Xeon Scalable processor, which is focused on server-level performance and the Intel Atom® processor 3000, which is focused on providing the lowest power of the three product lines. The Intel Xeon processor D-2100 product family provides lower power than the Intel Xeon Scalable processor product family and higher performance than the Intel Atom processor C3000 product family.

Table 1.Summary of segments that can benefit from the Intel® Xeon® processor D-2100 product family.

Business Processing	Cloud Services	Visualization & Audio	Communication	Storage
Dynamic Web Serving	Dynamic Front End Web	Media Delivery and Transcode	Wired Networking	Scale-Out/ Distributed DB
File & Print	Memory Caching		Edge Routing	Warm Cloud/Object Storage
File & Print	Memory Caching		Edge Routing	Active-Archive
	Dedicated Hosting		Edge Security / Firewall	Enterprise SAN/NAS
			Virtual Switching	Cold Storage Backup/Disaster Recovery
			Wireless Base Station

The Intel Xeon processor D-2100 product family is optimized for parallel software that benefits mostly from more individual servers with sufficient input/output (I/O) between nodes including dynamic web servers, hot or cold storage, network routing, enterprise storage area network/network attached storage (SAN/NAS), virtual switching, edge firewall security, wireless LAN controllers, distributed memory caching (memcached), distributed database, and any of the aforementioned uses that have an additional need for acceleration of cryptographic communications such as security appliances and switches.

Typically, the Intel Xeon processor D-2100 product family will be found populating a microserver chassis, which is comprised of multiple Intel Xeon SoC D-2100 product family nodes sharing a common chassis, fans, and power supplies, and is interconnected to achieve improved flexibility, higher efficiency, and optimization of rack density. Microservers based on Intel Xeon SoC D-2100 product family nodes can meet different usage models, such as combining with lots of disk storage to provide a hot storage solution, or to provide a low power, high density network solution.

generic overview a microserver chassis composition
Figure 2. Generic, high-level overview of how a microserver chassis is composed of multiple SoC nodes, along with shared components (such as power supply, fans, and chassis).

There are three separate product SKUs for the Intel Xeon processor D-2100 product family. When the processor model number ends in the letter "I", the SKUs are more focused on computation and cloud segments. Model numbers ending in "IT" are associated with network and enterprise storage segments. And model numbers ending in "NT" include Integrated Intel® QuickAssist Technology (Intel® QAT) to help with acceleration of cryptographic workloads. To see a list of the processor models with more specifications, see the Intel Xeon processor D-2100 product family brief.

SoC Architecture Overview

Table 2. Provides a high-level summary of the hardware differences between the Intel® Xeon® processor D-2100 product family and the Intel Xeon processor D-1500 product family.

	Intel® Xeon® SoC D-1500	Intel® Xeon SoC D-2100
Thermal Design Point (TDP)	20–65W	60–110W
Cores	Up to 16C with Intel® Hyper-Threading Technology (Intel® HT Technology)	Up to 18C with Intel® HT Technology
Micro-Architecture	Broadwell	Skylake
Package Size	37.5mm x 37.5mm	45mm x 52.5mm
Key Target	Network/Storage/Compute	Storage/Compute/Network
Intel® Advanced Vector Extensions (Intel® AVX)	Intel® Advanced Vector Extensions 2 (Intel® AVX2)	Intel® Advanced Vector Extensions 512 (Intel® AVX-512) New Instructions
Cache	LLC: 1.5MB/core	LLC: 1.375MB/core
Cache	MLC: 256k/core	MLC: 1MB/core
Memory	2 channels DDR4 2400 MHz per SoC Up to 128G memory capacity	4 channels† DDR4 2666 MHz per SoC Up to 512GB memory capacity
Ethernet	Up to four 10GbE/1GbE ports	Up to four 10GbE/1GbE ports with accelerated Remote Direct Memory Access (RDMA) and native Software Fault Isolation (SFI)
PCIe†	PCIe 3.0 (2.5, 5.0, 8.0 GT/s)	PCIe 3.0 (2.5, 5.0, 8.0 GT/s)
PCIe†	32 Gen3 lanes + Up to 20 Gen3 (through Flexible High Speed I/O)	32 Gen3 lanes + Up to 20 Gen3 (through Flexible High Speed I/O)
SATA	6 SATA ports	Up to: 14 SATA (through Flexible High Speed I/O)
Integrated Crypto / Encrypt / Decrypt Offload Acceleration	Integrated Intel® QuickAssist Technology (Intel® QAT): Up to 40G Crypto/20G Compression, 40kOps Public Key Encryption (PKE) 2K	Integrated Intel QAT: Up to 100G Crypto/Compression + 100kOps Public Key Encryption (PKE) 2K

New capabilities relative to previous generations vary with SKUs.

The Intel Xeon processor D-2100 product family is a new microarchitecture with many additional features compared to the previous-generation of the Intel Xeon processor D-1500 product family (formerly Broadwell microarchitecture). These features include increased processor cores, increased memory channels, a non-inclusive cache, Intel® AVX-512, Intel® Memory Protection Extensions (Intel® MPX), Intel® Speed Shift Technology, and Internet Wide Area RDMA Protocol (iWARP). A flexible I/O interface provides up to 20 configurable high-speed lanes that allow original equipment manufacturers (OEMs) the ability to make customized I/O choices for the baseboard. The rest of this paper will cover these various technologies in greater detail.

Table 3. Overview of product technologies for the Intel® Xeon® processor D-2100 product family.

Product Technology
Intel Xeon Mesh Architecture	Iwarp RDMA
Cache Hierarchy Changes	RAS
Intel® MPX	Intel® Volume Management Device (Intel® VMD)
Mode based Execution Control (XU/XS bits)	Intel® Platform Storage Extensions
Intel® AVX-512	Intel® Boot Guard
Intel® Speed Shift Technology and PMax	Innovation Engine
Intel® QuickAssist Technology (Intel® QAT)	Intel® Node Manager (Intel® NM)

Intel® Xeon® mesh architecture

On the previous generation of Intel Xeon D processor the cores, last-level cache (LLC), memory controller, I/O controller, and inter-socket pathways are connected using a ring architecture. This ring architecture has been around for many years on the different product lines offered by Intel.

The Intel Xeon SoC D-2100 product family has advanced beyond the ring architecture, introducing a new mesh architecture to mitigate the increased latencies and bandwidth constraints associated with previous ring architecture. The mesh architecture encompasses an array of vertical and horizontal communication paths allowing traversal from one core to another through a shortest path (hop on vertical path to correct row, and hop across horizontal path to correct column). The caching and home agent (CHA) located at each of the LLC slices maps addresses being accessed to a specific LLC bank, memory controller, or I/O subsystem, and provides the routing information required to reach its destination using the mesh interconnect.

In addition to the improvements expected in the overall core-to-cache and core-to-memory latency, we also expect to see improvements in latency for I/O-initiated accesses. In the previous generation of processors, in order to access data in LLC, memory, or I/O, a core or I/O would need to go around the ring. In the Intel Xeon SoC D-2100 product family, a core or I/O can access the data in LLC, memory, or I/O through the shortest path over the mesh.

Cache hierarchy changes

Figure 3. Generational cache comparison.

In the previous generation, the mid-level cache was 256 KB per core and the Last-Level Cache (LLC) was a shared inclusive cache, with 1.5 MB per core. In the Intel Xeon processor D-2100 product family, the cache hierarchy has changed to provide a larger Mid-Level Cache (MLC) of 1 MB per core and a smaller, shared non-inclusive 1.375 MB LLC per core. A larger MLC increases the hit rate into the MLC resulting in lower effective memory latency and also lowers demand on the mesh interconnect and LLC. The shift to a non-inclusive cache for the LLC allows for more effective utilization of the overall cache on the chip versus an inclusive cache.

If the core on the Intel Xeon processor D-2100 product family has a miss on all the levels of the cache, it fetches the line from memory and puts it directly into MLC of the requesting core, rather than putting a copy into both the MLC and LLC, as was done on the previous generation. When the cache line is evicted from the MLC it is placed into the LLC if it is expected to be reused.

Due to the non-inclusive nature of LLC, the absence of a cache line in LLC does not indicate that the line is not present in private caches of any of the cores. Therefore, a snoop filter is used to keep track of the location of cache lines in the L1 or MLC of cores when it is not allocated in the LLC. On previous-generation CPUs, the shared LLC itself took care of this task.

Even with the changed cache hierarchy in the Intel Xeon processor D-2100 product family, the effective cache available per core is roughly the same as the previous generation for a usage scenario where different applications are running on different cores. Because of the non-inclusive nature of LLC, the effective cache capacity for an application running on a single core is a combination of MLC cache size and a portion of LLC cache size. For other usage scenarios, such as multithreaded applications running across multiple cores with some shared code and data, or a scenario where only a subset of the cores on the socket are used, the effective cache capacity seen by the applications may seem different than previous-generation CPUs. In some cases, application developers may need to adapt their code to optimize it with the change in the cache hierarchy.

Intel® Memory Protection Extensions (Intel® MPX)

C/C++ pointer arithmetic is a convenient language construct often used to step through an array of data structures. If an iterative write operation does not take into consideration the bounds of the destination, adjacent memory locations may get corrupted. Such unintended modification of adjacent data is referred to as a buffer overflow. Buffer overflows have been known to be exploited, causing denial-of-service attacks and system crashes. Similarly, uncontrolled reads could reveal cryptographic keys and passwords. More sinister attacks that do not immediately draw the attention of the user or system administrator alter the code execution path, such as modifying the return address in the stack frame to execute malicious code or script.

Intel's Execute Disable Bit and similar hardware features from other vendors have blocked buffer overflow attacks that redirected the execution to malicious code stored as data. Intel MPX technology consists of new Intel® architecture instructions and registers that compilers can use to check the bounds of a pointer at runtime before it is used. This new hardware technology is supported by the compiler.

Bound paging flowchart
Figure 4. New Intel MPX instructions and example of their effect on memory.

New Instruction	Function
BNDMK b, m	Creates LowerBound (LB) and UpperBound (UB) in bounds register b.
BNDCL b, r/m	Checks the address of a memory reference or address in r against the lower bound.
BNDCU b, r/m	Checks the address of a memory reference or address in r against the upper bound.
BNDCN b, r/m	Checks the address of a memory reference or address in r against the upper bound in one's compliment.

For additional information see the Intel Memory Protection Extensions Enabling Guide.

Mode-based execute control

Mode-based execute provides finer grain control on execute permissions to help protect the integrity of the system code from malicious changes. It provides additional refinement within the extended page tables by turning the Execute Enable (X) permission bit into two options:

XU for user pages
XS for supervisor pages

The CPU selects one or the other based on permission of the guest page and maintains an invariant for every page that does not allow it to be writable and supervisor-executable at the same time. A benefit of this feature is that a hypervisor can more reliably verify and enforce the integrity of kernel-level code. The value of the XU/XS bits is delivered through the hypervisor, so hypervisor support is necessary.

Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

generational hierarchy of the Intel® AVX technology
Figure 5. Generational overview of Intel® AVX technology.

Intel® AVX-512 was originally introduced with the Intel® Xeon Phi™ processor product line. There are certain Intel AVX-512 instruction groups (AVX512CD and AVX512F) that are common to the Intel Xeon Phi processor product line and the Intel Xeon processor D-2100 product family. However, the Intel Xeon processor D-2100 product family introduces new Intel AVX-512 instruction groups (AVX512BW and AVX512DQ), as well as a new capability (AVX512VL) to expand the benefits of the technology. The AVX512DQ instruction group is focused on new additions for benefiting high-performance computing (HPC) workloads such as oil and gas, seismic modeling, the financial services industry, molecular dynamics, ray tracing, double-precision matrix multiplication, fast Fourier transform and convolutions, and RSA cryptography. The AVX512BW instruction group supports Byte/Word operations, which can benefit some enterprise applications and media applications, as well as HPC. AVX512VL is not an instruction group but a feature that is associated with vector length orthogonality.

Feature list of the Intel® AVX-512 technology.

One 512-bit FMA
512-bit FP and Integer
32 registers
8 mask registers
32 SP/16 DP Flops/Cycle
Embedded rounding
Embedded broadcast
Scalar / SSE / Intel AVX "promotions"
Native media additions
HPC additions
Transcendental support
Gather/Scatter

Intel AVX-512 instructions offer the highest degree of support to software developers by including an unprecedented level of richness in the design of the instructions. This includes 512-bit operations on packed floating-point data or packed integer data, embedded rounding controls (override global settings), embedded broadcast, embedded floating-point fault suppression, embedded memory fault suppression, additional gather/scatter support, high-speed math instructions, and compact representation of large displacement value. The following sections cover some of the details of the new features of Intel AVX-512.

AVX512DQ

The doubleword and quadword instructions, indicated by the AVX512DQ CPUID flag enhance integer and floating-point operations, consisting of additional instructions that operate on 512-bit vectors whose elements are 16 32-bit elements or 8 64-bit elements. Some of these instructions provide new functionality such as the conversion of floating-point numbers to 64-bit integers. Other instructions promote existing instructions such as with the vxorps instruction to use 512-bit registers.

AVX512BW

The byte and word instructions, indicated by the AVX512BW CPUID flag, enhance integer operations, extending write-masking and zero-masking to support smaller element sizes. The original Intel AVX-512 foundation instructions supported such masking with vector element sizes of 32 or 64 bits because a 512-bit vector register could hold at most 16 32-bit elements, so a write mask size of 16 bits was sufficient.

An instruction indicated by an AVX512BW CPUID flag requires a write mask size of up to 64 bits because a 512-bit vector register can hold 64 8-bit elements or 32 16-bit elements. Two new mask types (_mmask32 and _mmask64) along with additional maskable intrinsics have been introduced to support this operation.

AVX512VL

An additional orthogonal capability known as vector length extensions provide for most Intel AVX-512 instructions to operate on 128 or 256 bits, instead of only 512. Vector length extensions can currently be applied to most foundation instructions and the conflict detection instructions, as well as the new byte, word, doubleword, and quadword instructions. These Intel AVX-512 vector length extensions are indicated by the AVX512VL CPUID flag. The use of vector length extensions extends most Intel AVX-512 operations to also operate on XMM (128-bit, SSE) registers and YMM (256-bit, Intel® AVX) registers. The use of vector length extensions allows the capabilities of EVEX encodings including the use of mask registers and access to registers 16..31 to be applied to XMM and YMM registers, instead of only to ZMM registers.

Mask registers

In previous generations of Intel AVX and Intel® AVX2 the ability to mask bits was limited to load and store operations. In Intel AVX-512, this feature has been greatly expanded with eight new opmask registers used for conditional execution and efficient merging of destination operands. The width of each opmask register is 64 bits, and they are identified as k0–k7. Seven of the eight opmask registers (k1–k7) can be used in conjunction with EVEX-encoded Intel AVX-512 foundation instructions to provide conditional processing, such as with vectorized remainders that only partially fill the register, while the opmask register k0 is typically treated as a "no mask" when unconditional processing of all data elements is desired. Additionally, the opmask registers are also used as vector flags/element level vector sources to introduce novel SIMD functionality, as seen in new instructions such as VCOMPRESSPS. Support for the 512-bit SIMD registers and the opmask registers is managed by the operating system using XSAVE/XRSTOR/XSAVEOPT instructions (see Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 2B, and Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3A).

map of opmask register k1
Figure 6. Example of opmask register k1.

Embedded rounding

Embedded rounding provides additional support for math calculations by allowing the floating-point rounding mode to be explicitly specified for an individual operation, without having to modify the rounding controls in the MXCSR control register. In previous SIMD instruction extensions, rounding control is generally specified in the MXCSR control register, with a handful of instructions providing per-instruction rounding override via encoding fields within the imm8 operand. Intel AVX-512 offers a more flexible encoding attribute to override MXCSR-based rounding control for floating-point instruction with rounding semantics. This rounding attribute embedded in the EVEX prefix is called Static (per instruction) Rounding Mode or Rounding Mode Override. Static rounding also implies exception suppression (SAE) as if all floating-point exceptions are disabled and no status flags are set. Static rounding enables better accuracy control in intermediate steps for division and square root operations for extra precision, while the default MXCSR rounding mode is used in the last step. It can also help in cases where precision is needed for the least significant bit such as in range reduction for trigonometric functions.

Embedded broadcast

Embedded broadcast provides a bit field to encode data broadcast for some load-op instructions, such as instructions that load data from memory and perform some computational or data movement operation. A source element from memory can be broadcasted (repeated) across all elements of the effective source operand without requiring an extra instruction. This is useful when we want to reuse the same scalar operand for all operations in a vector instruction. Embedded broadcast is only enabled on instructions with an element size of 32 or 64 bits, and not on byte and word instructions.

Quadword integer arithmetic

Quadword integer arithmetic removes the need for expensive software emulation sequences. These instructions include gather/scatter with D/Qword indices, and instructions that can partially execute, where k-reg mask is used as a completion mask.

Table 4. Quadword integer arithmetic instructions.

Instruction	Description
VPADDQ zmm1 {k1}, zmm2, zmm3	INT64 addition
VPSUBQ zmm1 {k1}, zmm2, zmm3	INT64 subtraction
VP{SRA,SRL,SLL}Q zmm1 {k1}, zmm2, imm8	INT64 shift (imm8)
VP{SRA,SRL,SLL}VQ zmm1 {k1}, zmm2, zmm3	INT64 shift (variable)
VP{MAX,MIN}Q zmm1 {k1}, zmm2, zmm3	INT64 max, min
VP{MAX,MIN}UQ zmm1 {k1}, zmm2, zmm3	INT64 max, min
VPABSQ zmm1 {k1}, zmm2, zmm3	INT64 absolute value
VPMUL{DQ,UDQ} zmm1 {k1}, zmm2, zmm3	32x32 = 64 integer multiply

Math support

Math support is designed to aid with math library writing and to benefit financial applications. Data types that are available include PS, PD, and SS. IEEE division/square root formats, DP transcendental primitives, and new transcendental support instructions are also included.

Table 5. A portion of the 30 math support instructions.

Instruction	Description
VGETXEP _{{PS,PD,SS,SD}}	Obtain exponent in FP format
VGETMANT _{{PS,PD,SS,SD}}	Obtain normalized mantissa
VRNDSCALE _{{PS,PD,SS,SD}}	Round to scaled integral number
VFIXUPIMM _{{PS,PD,SS,SD}}	Patch output numbers based on inputs
VRCP14 _{{PS,PD,SS,SD}}	Approx. reciprocal() with rel. error 2^-14
VRSQRT14 _{{PS,PD,SS,SD}}	Approx. rsqrt() with rel. error 2^-14
VDIV _{{PS,PD,SS,SD}}	IEEE division
VSQRT_{{PS,PD,SS,SD}}	IEEE square root

New permutation primitives

Intel AVX-512 introduces new permutation primitives, such as two-source shuffles with 16/32-entry table lookups with transcendental support, matrix transpose, and a variable VALIGN emulation.

Table 6. Two-source shuffles instructions

2-Src Shuffles
VSHUF{PS,PD}
VPUNPCK{H,L}{DQ,QDQ}
VUNPCK{H,L}{PS,PD}
VPERM{I,D}2{D,Q,PS,PD}
VSHUF{F,I}32X4

graph giving an example of a process
Figure 7. Example of a two-source shuffles operation.

Expand and compress

Expand and compress allows vectorization of conditional loops. Similar to Fortran pack/unpack intrinsic, it also provides memory fault suppression, can be faster than using gather/scatter, and also has opposite operation capability for compress. The figure below shows an example of an expand operation.

VEXPANDPS zmm0 {k2}, [rax]

Moves compressed (consecutive) elements in register or memory to sparse elements in register (controlled by mask), with merging or zeroing.

a diagram
Figure 8. Expand instruction and diagram.

Bit Manipulation

Intel AVX-512 provides support for bit manipulation operations on mask and vector operands including vector rotate. These operations can be used to manipulate mask registers, and they have some application with cryptography algorithms.

Table 7. Bit manipulation instructions.

Instruction	Description
KUNPCKBW k1, k2, k3	Interleave bytes in k2 and k3
KSHIFT{L,R}W k1, k2, imm8	Shift bits left/right using imm8
VPROR{D,Q} zmm1 {k1}, zmm2, imm8	Rotate bits right using imm8
VPROL{D,Q} zmm1 {k1}, zmm2, imm8	Rotate bits left using imm8
VPRORV{D,Q} zmm1 {k1}, zmm2, zmm3/mem	Rotate bits right w/ variable ctrl
VPROLV{D,Q} zmm1 {k1}, zmm2, zmm3/mem	Rotate bits left w/ variable ctrl

Universal ternary logical operation

A universal ternary logical operation is another feature of Intel AVX-512 that provides a way to mimic an FPGA cell. The VPTERNLOGD and VPTERNLOGQ instructions operate on dword and qword elements and take three-bit vectors of the respective input data elements to form a set of 32/64 indices, where each 3-bit value provides an index into an 8-bit lookup table represented by the imm8 byte of the instruction. The 256 possible values of the imm8 byte is constructed as a 16 x 16 Boolean logic table, which can be filled with simple or compound Boolean logic expressions.

Conflict detection instructions

Intel AVX-512 introduces new conflict detection instructions. This includes the VPCONFLICT instruction along with a subset of supporting instructions. The VPCONFLICT instruction allows for detection of elements with previous conflicts in a vector of indexes. It can generate a mask with a subset of elements that are guaranteed to be conflict free. The computation loop can be re-executed with the remaining elements until all the indexes have been operated on.

Table 8. A portion of the 8 conflict detection instructions.

CDI Instructions
VPCONFLICT{D,Q} zmm1{k1}, zmm2/,mem
VPBROADCASTM{W2D,B2Q} zmm1, k2
VPTESTNM{D,Q} k2{k1}, zmm2, zmm3/mem
VPLZCNT{D,Q} zmm1 {k1}, zmm2/mem

VPCONFLICT{D,Q} zmm1{k1}{z}, zmm2/B(mV), For every element in ZMM2, compare it against every element and generate a mask identifying the matches, but ignore elements to the left of the current one; that is, newer.

a diagram
Figure 9. Diagram of mask generation for VPCONFLICT.

In order to benefit from CDI, use Intel compilers version 16.0 in Intel® C++ Composer XE 2016, which will recognize potential run-time conflicts and generate VPCONFLICT loops automatically.

Transcendental support

Additional 512-bit instruction extensions have been provided to accelerate certain transcendental mathematic computations and can be found in the instructions VEXP2PD, VEXP2PS, VRCP28xx, and VRSQRT28xx, also known as Intel AVX-512 exponential and reciprocal instructions. These can benefit some finance applications.

Compiler support

Intel AVX-512 optimizations are included in Intel compilers version 16.0 in Intel C++ Composer XE 2016 and the GNU Compiler Collection (GCC) 5.0 (NASM 2.11.08 and binutils 2.25). Table 8 summarizes compiler arguments for optimization on the Intel Xeon processor D-2100 product family microarchitecture with Intel AVX-512.

Table 9. Summary of Intel® Xeon® processor D-2100 product family compiler optimizations.

Compiler Optimizations for Intel® AVX-512 on Intel® Xeon® processor D-2100 product family microarchitecture
Intel® Compilers version 16.0 or greater	GCC 5.0 or greater
General optimizations
-QxCOMMON-AVX512 on Windows* with Intel Compilers -xCOMMON-AVX512 on Linux* with Intel Compilers	-mavx512f -mavx512cd on Linux with Intel Compilers
Intel Xeon processor D-2100 product family specific optimizations
-QxCORE-AVX512 on Windows with Intel Compilers -xCORE-AVX512 on Linux with Intel Compilers	-mavx512bw -mavx512dq -mavx512vl -mavx512ifma -mavx512vbmi on Linux with Intel Compilers

For more information see the Intel® Architecture Instruction Set Extensions Programming Reference.

Intel® speed shift technology

The Intel Xeon processor D-1500 product family introduced hardware power management (HWPM), a new optional processor power management feature in the hardware that liberates the operating system from making decisions about processor frequency. HWPM allows the platform to provide information on all available constraints, allowing the hardware to choose the optimal operating point. Operating independently, the hardware uses information that is not available to software and is able to make a more optimized decision in regard to the p-states and c-states. The Intel Xeon processor D-2100 product family expands on this feature by providing a broader range of states that it can affect as well as a finer level of granularity and microarchitecture observability via the package control unit (PCU). On the Intel Xeon processor D-1500 product family the HWPM was autonomous, also known as out-of-band mode, and oblivious to the operating system. The Intel Xeon processor D-2100 product family allows for this as well, but also offers the option for a collaboration between the HWPM and the operating system, known as native mode. The operating system can directly control the tuning of the performance and power profile when and where it is desired, while elsewhere the PCU can take autonomous control in the absence of constraints placed by the operating system. In native mode, the Intel Xeon processor D-2100 product family is able to optimize frequency control for legacy operating systems, while providing new usage models for modern operating systems. The end user can set these options within the BIOS; see your OEM BIOS guide for more information. Modern operating systems that provide full integration with native mode include Linux*, starting with kernel 4.10, and Windows Server* 2016.

Intel® QuickAssist Technology (Intel® QAT)

Intel® QAT accelerates and compresses cryptographic workloads by offloading the data to hardware capable of optimizing those functions. This makes it easier for developers to integrate built-in cryptographic accelerators into network, storage, and security applications. In the case of the Intel Xeon processor D-2100 product family, the third-generation Intel QAT is integrated into the hardware, and offers outstanding capabilities including up to 100Gbs crypto, 100Gbs compression, and 100K ops RSA2K.

Segments that can benefit from the technology include the following:

Server: secure browsing, email, search, big-data analytics (Hadoop*), secure multitenancy, IPsec, SSL/TLS, OpenSSL
Networking: firewall, IDS/IPS, VPN, secure routing, web proxy, WAN optimization (IP comp), 3G/4G authentication
Storage: real-time data compression, static data compression, secure storage

Supported algorithms include the following:

Cipher algorithms: (A)RC, AES, 3DES, Kasumi, Snow3G, and ZUC
Hash/authentication algorithms supported: MD5, SHA1, SHA-2, SHA-3, HMAC, AES-XCBC-MAC, Kasumi, Snow 3G, and ZUC
Public key cryptography algorithms: RSA, DSA, Diffie-Hellman (DH), ECDSA, ECDH

ZUC and SHA-3 are new algorithms that are included in the third generation of Intel QAT.

Intel® Key Protection Technology (Intel® KPT) is a new supplemental feature of Intel QAT that can be found on the Intel Xeon processor D-2100 product family. Intel KPT was developed to help secure cryptographic keys from platform-level software and hardware attacks when the key is stored and used on the platform. This new feature focuses on protecting keys during runtime usage and is embodied within tools and techniques, and supports both OpenSSL and PKCS#11 cryptographic frameworks.

For a more detailed overview see Intel QuickAssist Technology for Storage, Server, Networking and Cloud-Based Deployments. Programming and optimization guides can be found on the 01 Intel Open Source website.

Internet wide area RDMA protocol (iWARP)

IWARP is a technology that allows network traffic managed by the network interface controller (NIC) to bypass the kernel, which thus reduces the impact on the processor due to the absence of network-related interrupts. This is accomplished by the NICs communicating with each other via queue pairs to deliver traffic directly into the application user space. Large storage blocks and virtual machine migration tend to place more burden on the CPU due to the network traffic. This is where iWARP can be of benefit. Through the use of the queue pairs it is already known where the data needs to go and thus it is able to be placed directly into the application user space. This eliminates extra data copies between the kernel space and the user space that would normally occur without iWARP.

For more information see the information video on Accelerating Ethernet with iWARP Technology.

comparison diagram
Figure 10. iWARP comparison block diagram.

Select models of the Intel Xeon processor D-2100 product family have integrated Intel® Ethernet connections with up to 4x10 GbE/1 Gb connections that include support for iWARP. This new feature can benefit various segments including network function virtualization and software-defined infrastructure. It can also be combined with the Data Plane Development Kit to provide additional benefits with packet forwarding.

iWARP uses Verb APIs to talk to each other instead of traditional sockets. For Linux, OFA Open Fabrics Enterprise Distribution (OFED) provides Verb APIs, while Windows* uses Network Direct APIs. Contact your Linux distribution to see if it supports OFED verbs, and on Windows, support is provided starting with Windows Server 2012 R2 or newer.

RAS features

The Intel Xeon processor D-2100 product family includes new RAS (reliability, availability, and serviceability) features. Listed below is a comparison of the RAS features from the previous generation.

Table 10. RAS feature summary table.

Feature	Intel® Xeon® Processor D-1500 Product Family	Intel® Xeon® Processor D-2100 Product Family
MCA and Corrected Machine Check Interrupt (CMCI)	Yes	Yes
MCA Bank Error Control (Cloaking)	Yes	Yes
PCI Express Hot-Plug	Yes¹	Yes
PCI Advanced Error Reporting	Yes	Yes
PCI Express "Stop and Scream"	Yes¹	Yes
PCI Express ECRC (End-to-End CRC)	Yes¹	Yes
Corrupt Data Containment Mode - Uncore (Poisoning supported in uncore only / no recovery)	Yes	Yes
Corrupt Data Containment Mode - Core	No	No
x4 Single Device Data Correction (SDDC)	Yes	Yes
Memory Mirroring	No	Yes
Memory Demand/Patrol Scrubbing	Yes	Yes
Data Scrambling with Command and Address	Yes	Yes
Memory Rank Sparing	No	Yes
Enhanced SMM	Yes	Yes

1. Only available on PCIe Gen3 ports.

Intel® Volume Management Device (Intel® VMD)

Intel® VMD is a hardware technology on the Intel Xeon processor D-2100 product family primarily to improve the management of high-speed solid state drives (SSDs). Previously, SSDs were attached to a Serial ATA (SATA) or other interface type and managing them through software was acceptable. When we move toward directly attaching the SSDs to a PCIe* interface in order to improve bandwidth, software management of those SSDs adds more delays. Intel VMD uses hardware to mitigate these management issues rather than completely relying on software. This is accomplished by the Intel provided NVMe* driver, which works in conjunction with Intel VMD. The NVMe driver allows restrictions that might have been placed on it by an operating system to be bypassed. This means that features like hot insert could be available for an SSD even if the operating system doesn't provide it, and the driver can also provide support for third-party vendor NVMe non-Intel solid state drives.

Intel® Boot Guard

Intel® Boot Guard adds another level of protection by performing a cryptographic Root of Trust for Measurement of the early firmware platform storage device, such as the trusted platform module or Intel® Platform Trust Technology. It can also cryptographically verify early firmware using OEM-provided policies. Unlike Intel® Trusted Execution Technology (Intel® TXT), Intel Boot Guard doesn't have any software requirements; it is enabled at the factory and it cannot be disabled. Intel Boot Guard operates independently of Intel TXT but it is also compatible with it. Intel Boot Guard reduces the chance of malware exploiting the hardware or software components.

diagram
Figure 11. Intel Boot Guard secure boot options.

Three Secured Boot Options On Purley:

Measured Boot
Boot Guard puts cryptographic measurement of the Early Firmware* into the platform protected storage device such as TPM or the Platform Trust Technology (PTT)
Verified Boot
Boot Guard cryptographically verifies the Early Firmware using the OEM provided policies.
Measured + Verified Boot
Performs both of the above actions.

Early Firmware

Setup Memory
Loads next block into memory
Continues with verification and/or measurements

Platform Firmware

Continue Verification or Measurement
UEFI 2.3.1 Secure boot for verification
TPM 1.2/2.0/PTT for measurement

Platform storage extensions

Platform storage extensions provides smarter and more cost-effective storage solutions through integrated technologies that accelerate data movement, protect data, and simplify data management. This is accomplished through different features such as Intel® QuickData Technology, which provides a direct memory access (DMA) engine within the SoC, enabling data copies by dedicated hardware instead of the CPU. Asynchronous DRAM refresh (ADR) helps preserve key data in battery-backed memory in the event of a loss in power. Non-transparent bridging enables redundancy via PCI Express. Lastly, end-to-end CRC protection is provided for the PCIe I/O subsystem.

The Innovation Engine

The Innovation Engine (IE) is an embedded core within the SoC. It is similar to Intel® Management Engine (Intel® ME), with some privilege and I/O differences. The IE is designed to assist OEMs in providing a more secure form of the Intel ME. IE code is cryptographically bound to the OEM, and code that is not authenticated by the OEM will not load. The system can operate normally without having to activate IE because it is an optional feature.

For cloud and embedded segments, the basic manageability without cost, space, or power of a Baseboard Management Controller (BMC) can be attractive. The IE runs simple management applications (for example, Intelligent Platform Management Interface (IPMI)) and network stack for out of band operations.

diagram
Figure 12. BMC-less manageability for lightweight requirements.

For the enterprise segment, IE can be of value for improving system performance by reducing BMC round trips or System Management Mode (SMM) interrupts on the CPU. IE runs OEM-specific BMC- or BIOS-assist software.

diagram
Figure 13. BMC- or BIOS-assisted configuration.

For more in-depth enterprise-level needs, IE and ME can work together to provide new or enhanced usage models using telemetry and controls provided by Intel. The IE can communicate with the ME to pull in telemetry data and provide additional processing capability.

diagram
Figure 14. IE provides enhancement to Intel® ME firmware.

Intel® Node Manager (Intel® NM)

Intel® NM is a core set of power management features that provide a smart way to optimize and manage power, cooling, and compute resources in the data center. This server management technology extends component instrumentation to the platform level and can be used to make the most of every watt consumed in the data center. First, Intel NM reports vital platform information such as power, temperature, and resource utilization using standards-based, out-of-band communications. Second, it provides fine-grained controls such as helping with reduction of overall power consumption or maximizing rack loading, to limit platform power in compliance with IT policies. This feature can be found across Intel's product segments, including the Intel Xeon SoC D-2100 product family, providing consistency within the data center.

The Intel Xeon SoC D-2100 product family includes the fourth generation of Intel NM, which extends control and reporting to a finer level of granularity than on the previous generation. To use this feature you must enable the BMC LAN and the associated BMC user configuration at the BIOS level, which should be available under the server management menu. The Intel NM Programmer's Reference Kit is simple to use and requires no additional external libraries to compile or run. All that is needed is a C/C++ compiler and to then run the configuration and compilation scripts.

Table 11. Intel® NM fourth-generation features.

	Capabilities	Intel® Node Manager 4.0
Telemetry & monitoring	Monitor platform power consumption
	Monitor inlet airflow temperature
	Support shared power supplies
	Monitor CPU package and memory power consumption
	PMBus support
	BMC power reading support
	Support Voltage Regulator & Current Monitor configuration
	Hot-swap controller support
	Power Component Telemetry
Power management during operation	Set platform power limits & policies ( 16 policies )
API support	ACPI power meter support
	DCMI API support
	Node Manager IPMI API support over SMBus
	ACPI support
	Node Manager IPMI API support over IE Sideband Interface
Power management during boot	Set power optimized boot mode in BIOS (during next reboot)
	Configure core(s) to be disabled by BIOS (during next reboot)
	Set platform power limit during boot
Performance & Characterization	CPU, Memory, I/O utilization metrics
Performance & Characterization	Compute utilization per Second (CUPS)
Hardware Protection	SMART/CLST
PSU events	Reduce platform power consumption during power supply event (PSU failover/undercurrent)
Assess platform parameters	Node Manager Power Thermal Utility (determines max, min & efficient power levels)
Platform temp excursions	Reduce platform power consumption during inlet airflow excursion

Author

David Mulnix is a software engineer and has been with Intel Corporation for over 20 years. His areas of focus includes software automation, server power, and performance analysis, and he has contributed to the development support of the Server Efficiency Rating Tool*.

Contributors

Akhilesh Kumar and Elmoustapha Ould-ahmed-vall.

Resources

Intel® 64 and IA-32 Architectures Software Developer's Manual (SDM)

Intel® Architecture Instruction Set Extensions Programming Reference

Intel® Memory Protection Extensions Enabling Guide

Intel Node Manager Website

Intel Node Manager Programmer's Reference Kit

Open Source Reference Kit for Intel® Node Manager

How to set up Intel® Node Manager

Intel® Performance Counter Monitor (Intel® PCM), - A better way to measure CPU utilization

Intel® Memory Latency Checker (Intel® MLC), a Linux* tool available for measuring the DRAM latency on your system

Intel® VTune™ Amplifier 2017, a rich set of performance insight into hotspots, threading, locks and waits, OpenCL™ bandwidth and more, with profiler to visualize results

↧

Stellaris 2.0*: Rebuilding the Galaxy

May 9, 2018, 9:18 am

Latest and popular articles on Intel Technologies

≫ Next: Solving Latency Challenges in End-to-End Deep Learning Applications

≪ Previous: Intel® Xeon® Processor D-2100 Product Family Technical Overview

The original article is published by Intel Game Dev on VentureBeat*: Stellaris 2.0: Rebuilding the galaxy. Get more game dev news and related topics from Intel on VentureBeat.

Screen of the Game Stellaris

As Stellaris* approaches its second anniversary, big changes are on the horizon for the space grand strategy/4X hybrid from Paradox Development Studio*. The Imminent 2.0* update is one of the most ambitious the studio has worked on, and one that the team has been developing on the side as far back as Utopia*, which launched in April last year.

"I question the idea that you can't make this kind of large update for the game," says game director Martin Anward. "Some people have asked why we didn't save this for Stellaris 2. Here's the thing: if your house needs a new roof, you renovate a roof, but if we did Stellaris 2, that's not renovating a roof, that's building a new house. The game is more popular than ever, so there's no reason why we shouldn't be able to do this."

The scale of the update and accompanying Apocalypse DLC is the result of the interconnectedness of Stellaris' myriad systems. Paradox* wanted to do a war update, but the studio couldn't do that without changing a lot of other fundamental systems — everything to starbases and fleet movement. "If you fire up Stellaris 2.0," says Anward, "you'll still recognize it as Stellaris, but how you basically play the game, how you build ships, how you expand, how you maneuver, that's all changed. We've not done something like this before."

One of the most dramatic changes is to how starbases and expansion works. Borders used to shift dynamically and by constructing starbases. Though this made expansion seem partially organic, it created problems like empires being forced to declare war to colonize a star system that fell within the borders of an empire that didn't even want it. And it meant that expansion was haphazard, as whole chunks of space were gobbled up by empires just for a single colony.

"The whole colonization and dynamically growing borders that you don't really understand have all been replaced with deliberate choices with trade-offs," Anward explains. Starbases are still involved, though, as they're now used to claim systems, with the borders then shifting to reflect which star systems an empire controls. Anward notes that they've been fleshed out with extra mechanics, too, with trade and shipbuilding now falling under their purview.

Punch It

Screen of the Game Stellaris

"The second pillar is the war and peace system," says Anward. "We've changed how war works entirely. We have additional claims, we have casus belli, we've changed the FTL types — which I know is a controversial decision, but it was required. Otherwise I couldn't see a way to make war all that good."

Stellaris was unusual at launch as players could pick between three different FTL systems, dramatically changing how fleets moved around the galaxy. Warp was slow but ships could travel anywhere, hyperlanes were faster but limited ships to travelling down fixed routes, and wormholes allowed ships to jump between systems instantly, but only ones with the appropriate wormhole gates. 2.0 scraps all but hyperlanes.

"One of the biggest problems with war in the current version is you can't understand, in your head, how fleets will move," explains Anward. "There are too many possible variables with all these different FTL types. You can't look at a war and get an overview of where the enemy could come from. For all its theoretical interesting bits, what it really ends up with is a lot of fleets moving around and you don't understand why."

It was important for Paradox that players didn't feel like they were losing something without getting more features to mitigate the loss. The new FTL system has allowed the team to create ‘galactic terrain'. Chokepoints, environmental hazards, and islands of constellations create strategic wrinkles that make space more interesting. Players will be able to create more effective defenses, too, and prepare for invasions without needing to spend hours trying to counter three different FTL types.

Eventually, high-tier techs will let them use jump drives and static wormholes, but there will be restrictions and costs that mean it eill still be easier to predict the flow of a war. Ultimately, players still have lots of choices, then, and you'll even be able to set the hyperlane density at the start of a game, filling space with routes, effectively replicating the warp FTL mechanics. It's an option, though Anward thinks it's better if you exercise some restraint.

Cleaning Up Space

Choice is a big part of any Stellaris update, and this extends to the paid content in the accompanying Apocalypse DLC. As the name suggests, it's full of world-killing engines of destruction and massive capital ships, but it's not just for militant empires. Sure, you can relive of the destruction of Alderaan, but if you're playing a spiritualist species, for instance, you can use your planet destroyer to convert the population, while synthetic empires can infect worlds with nanobots, transforming them into cyborgs.

Paradox's approach to DLC is to focus on adding new stuff inspired by other sci-fi universes, like Synthetic Dawn*'s robot empires and now Apocalypse's devastating weapons, leaving the core changes to free updates. Players get big features for free, but it also makes things easier for the developers.

"We typically try to follow the policy of: if this is a core gameplay thing that we want to build on later, we try to keep it free as much as possible," explains Anward. "A good example would be Ascension perks* because that's a thing we made paid but now we're making it free because we want to build on it. Let's say we were to make the new war system paid, that wouldn't work at all. We'd have to support both systems and we could never know which one someone had or built mechanics upon it."

It also means that new players don't have to shell out for DLC immediately just to see how the game has evolved since it launched in 2016. Paradox has also been trying to open the doors to new players through Twitch* and YouTube*. It has a dedicated video team, streaming all the games being published by Paradox Interactive, not just its in-house titles, and recently it aired a YouTube series where YouTube gamers new to Stellaris were flung into a new galaxy. Anward believes it's helped make it easier for newcomers to get into.

"It is by far our easiest game to get into in Paradox. If you're starting in Europa Universalis IV*, you just get thrown into it. It's France, it's 1444, now go. Everyone is sending you alliances, wars are going on, history starts unfolding the moment you unpause. In Stellaris, you start with something more manageable. You start with a few spaceships and one system — you can figure it out. But that's relative and there are still people who find it too much for them."

And unlike Europa Universalis*, Crusader Kings*, or Hearts of Iron*, Stellaris isn't limited to the past. "We are not constrained by history. It is also sometimes a drawback because we don't have history to lean on. We have to rely on anchoring what we do in sci-fi tropes, but it does allow for an immense amount of freedom. We can make whatever we want."

Stellaris' 2.0 update and Apocalypse are due out on February 22.

↧

Solving Latency Challenges in End-to-End Deep Learning Applications

May 10, 2018, 12:54 pm

Latest and popular articles on Intel Technologies

≫ Next: Dauntless*: Making a Different Kind of Monster Hunting Game

≪ Previous: Stellaris 2.0*: Rebuilding the Galaxy

Intel® Student Ambassador David Ojika Uses Intel® Movidius™ Myriad™ 2 Technology for Specialized Vision Processing at the Edge

ai banner

Abstract

The Intel® Student Ambassador Program for Artificial Intelligence, part of the Intel® AI Academy, collaborates with universities around the globe. The program offers key resources to artificial intelligence (AI) students, data scientists and developers, including education, access to newly optimized frameworks and technologies, hands-on training, and workshops. This paper details the decoupling of cloud-based deep learning training from accelerated inference at the edge.

While the compute-intensive process of training convolutional neural networks (CNNs) can be greatly enhanced in the cloud, cloud communication introduces the problem of latency which may lead to lagging inference performance in edge devices and mission-critical applications.

Movidius Myriad 2 Intel fellowship recipient David Ojika and graduate research assistant Vahid Daneshmand set out to resolve the problem using specialized vision processors and distributed computing architecture. Their technique, conclusions and future work as they explored end-to-end image analytics with the Intel® Movidius™ Myriad™ 2 vision processing unit (VPU) are examined here.

The compute-intensive process of training machine learning models is being accelerated by cloud computing. Cloud communications, however, introduce the problem of latency during model inference, leading to lagging performance for edge applications.

Solve Deep Learning Challenges with Intel® Technology

Ojika is an Intel fellowship for Code Modernization recipient and a recent doctoral graduate in computer engineering at the University of Florida. He has completed several internships at Intel, where he worked on near-memory accelerators and heterogeneous platforms including Intel® Xeon® processors and FPGAs. Ojika’s research interest spans systems research, focusing on machine learning platforms and architectures for large-scale, distributed data analytics.

Ojika’s Intel internship exposed him to a broad range of hardware and software systems from the company that enabled him to advance his Ph.D. studies. That exposure prompted him to continue his collaboration with Intel as an Intel® Student Ambassador, helping build an AI community at the University of Florida.

ai cpu brain The training of CNNs is highly computation-expensive, often requiring several hours or days of training with moderate hardware. Deploying the trained model for inference can present unique challenges depending on specific application requirements, for example real-time response, low power utilization, reduced form factor, ease of updating and managing trained models, and so forth. Intel Movidius Myriad 2 technology was chosen as a development platform to address some of these challenges.

Accelerating CNN Architectures with VPUs at the Edge

Much research has gone into utilizing GPUs to train CNNs, which are commonly used in image recognition. But, researchers have dedicated less attention to real-time performance of CNNs in resource-constrained environments where low latency or low power is of utmost importance.

This project leveraged a specialized, low-power VPU at the edge to accelerate the inferencing process of CNNs. The researchers presented a method that simplifies CNN/end-application integration with a microservices approach, presenting a loosely-coupled architecture, allowing for the elastic scaling of CNN “services” per requests. These processing inference requests, feature a light-weight front-end (for request-admission) and a load-sensitive back-end (for request-processing), exposing to end-applications simplified web interfaces and language-independent APIs serving CNN models.

Software architecture diagram

Figure 1. Software architecture

Key to the success of their research was the Intel Movidius 2 VPU, the industry’s first always-on vision processor. Offering high performance using low power, this family of vision processors gives developers immediate access to the vision processing core, enabling them to differentiate their development for proprietary capabilities. The Intel Movidius 2 VPU also offers its dedicated vision processing platform in a small footprint.

system overview diagram

Figure 2. System Overview

Intel Movidius Myriad 2

Intel Movidius 2 VPU

The first step in their development was to integrate trained CNN models into the Intel Movidius technology tool chain. For demonstration purposes, the team obtained publicly available, pre-trained models, including GoogleNet*, ResNet-50 trained with ImageNet dataset on Caffe* and TensorFlow*. Next, they compiled each of the Caffe and TensorFlow models into Movidius-specific file formats using the provided Intel® Movidius™ Neural Compute Stick (NCS) toolkit. This toolkit also supports other advanced features such as checking and profiling of compiled models.

Next, the team designed and implemented two microservices, a Java-based front-end and a Python* based back-end (figure 1) which were then deployed on an Intel Atom® processor-based platform as shown in the figure 2. Requests were received by the Intel Atom processor-based platform on behalf of the Intel Movidius Myriad 2 VPU, which then processed those requests accordingly.

Finding Workarounds for Virtualization Support

A major issue Ojika encountered involved virtualization support for the Intel Movidius NCS. Although his team managed to find a workaround, they have alerted the Intel Movidius NCS team to the challenge and hope to integrate a solution in their future development efforts.

The Intel Movidius NCS toolkit, it should be noted, provides an important tool for dealing with trained CNNs in end-to-end deployment scenarios such as Ojika’s use case. The toolkit is Python based, with intuitive APIs that allowed the team to easily integrate the Intel Movidius NCS tool chain into custom applications.

A Simpler Way to Deploy Deep Neural Networks

Ojika’s solution will significantly reduce the management complexity of deploying CNNs at scale in resource-constrained environments. And, it will help maximize resource utilization, including energy, and network bandwidth, as well as return on hardware investment. Currently, the solution is useful for real-time video analytics, such as in drones, surveillance and facial recognition.

At present, the number of clients and back-end components limits performance. In the future, they plan to implement an automated, elastic scaling mechanism for handling requests within a set of defined service-level agreements. And, they will design an efficient resource utilization scheme based on network traffic and power constraints. The researchers also plan to explore the use of overlay networks for a larger-scale deployment of their proposed architecture.

The Intel Movidius Myriad 2 VPU was found to achieve real-time performance for CNN inference on embedded devices. Ojika and Daneshmand proposed a software architecture that presents inference as a web service, enabling a shared platform for image analytics on embedded devices and latency-sensitive applications.

Check out David's Intel® Developer Mesh project for more details and updates.

Join the Intel® AI Academy

Sign up for the Intel® AI Academy and access essential learning materials, community, tools and technology to boost your AI development. Apply to become an Intel AI Student Ambassador and share your expertise with other student data scientists and developers.

References

D. Guo, W. Wang, G. Zeng and Z. Wei, "Microservices Architecture Based Cloudware Deployment Platform for Service Computing," 2016 IEEE Symposium on Service-Oriented System Engineering (SOSE), Oxford, 2016

Ganguly, Arijit, et al. “IP over P2P: enabling self-configuring virtual IP networks for grid computing.” Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International. IEEE, 2006

↧

Dauntless*: Making a Different Kind of Monster Hunting Game

May 11, 2018, 11:41 am

Latest and popular articles on Intel Technologies

≫ Next: CPU Capability Detect using Unreal Engine* 4.19

≪ Previous: Solving Latency Challenges in End-to-End Deep Learning Applications

The original article is published by Intel Game Dev on VentureBeat*: Dauntless: Making a different kind of monster hunting game. Get more game dev news and related topics from Intel on VentureBeat.

Dauntless Monster Hunting Game

Stalking the fiendish beasts in Dauntless* isn't an easy task. And while you can do it alone, developer Phoenix Labs* hopes you'll bring a friend or three along the way. According to CEO and cofounder Jesse Houston, the game is an "unapologetically cooperative" experience.

"Every time I read a forum post where it's like, 'Oh man, I totally met this random person on the Internet last night and now we're gaming friends forever," I swoon!" said Houston, laughing. "I'm just like, 'Yes, our job is done.' I am completing my mission in life of making people happy and making friends."

Despite only being in closed beta on PC for a few months, Dauntless already has a vibrant community filled with people who are willing to help their fellow players. It's the kind of scenario Houston and his colleagues dreamed of when they left their jobs (from places like Riot Games* and BioWare*) to form Phoenix Labs in 2014. They wanted to build a studio that would foster a close relationship with its players and let them have a say in the development process.

And a compelling multiplayer game is a great way of bringing those large communities together. That's how the idea for Dauntless, and its free-to-play business model, originated. It's part of a small but growing number of titles in the niche hunting-action genre. At the top of that list is Capcom*'s popular Monster Hunter franchise, so it's no surprise how influential it was on Phoenix Labs's project.

In Dauntless, you play as a Slayer, a class of warriors who defend the Shattered Isles from massive Behemoths. Killing these dangerous animals will reward you with materials that you can use to craft better weapons and armor, which in turn allow you to tackle bigger and bigger threats.

Dauntless also has an intriguing story about the people in its world and why the Behemoths, if left alive, will destroy it. But instead of front-loading the game with a ton of exposition, Phoenix Labs is splitting the narrative into bite-sized chunks.

"We wouldn't be a bunch of ex-BioWare devs if we weren't trying to tell a story. … We're trying to take a very subtle approach to it. We want it to be a really interesting part of the game, but we don't want to hit you over the head with the 'story club' with big, long cutscenes and loads of dialogue that are just in the way," said Houston.

Over the past few months, the studio has been releasing small fragments of lore through newsletters and the Behemoth bestiary on its website. Non-playable characters in the city of Ramsgate — the central hub where you upgrade your gear — will also offer clues in their conversations.

Houston cited Dark Souls* and Bloodborne* as the type of non-linear storytelling they're aiming for. If you want to know what's really going on in Dauntless, you'll have to work for it by putting all the different pieces together.

A team of Slayers take on the deadly Shrike
Figure 1. A team of Slayers take on the deadly Shrike

The Lifeblood of a Live Game

Aside from making a big splash at The Game Awards 2016 with an evocative announcement trailer, the developers haven't spent too much time or resources on marketing. So they weren't sure how early adopters — the only way to access Dauntless right now is by purchasing one of three founder's packs — would react to the game when the closed beta launched in August 2017.

But players came in droves, with many of them streaming the game on Twitch*.

"The huge response took our service down — we literally had to rebuild the platform in real-time that day. It was a crazy moment for us because we expected a couple of hundred people to show up, and thousands and thousands of people showed up and I was like, 'Oh shit. … People really want this game,'" said Houston.

Since then, the team has been soliciting feedback from the community after each major update. They interact with players through forums and social media, and try to be as open and transparent as possible about their design decisions — which is why the development road-map is publicly available. Opening up those lines of communications is one of the lessons Houston learned at Riot, which operates the successful multiplayer game League of Legends.

"One of the big takeaways that I got from working on League of Legends was the importance of thinking about a game in a live-service mentality, that iteration and improvement are the lifeblood of what makes a really good live game. If you're not listening to the community and you're not working with them, you're not going to iterate in the right places," he said.

These creator-to-player conversations have already led to big adjustments to Dauntless's matchmaking and progression systems, as well as the monetization model. One such change had to do with loot boxes. Originally, players could only acquire cosmetic items by buying Chroma Cores, which spit out a random prize when opened.

But because of industry wide conversations around loot boxes — mostly due to the controversy surrounding Star Wars: Battlefront II*— and thoughtful feedback from the Dauntless community, Phoenix Labs decided to get rid of the Chroma Cores. Now you can just buy emotes, experience boosts, and other items directly from the in-game store.

"That wasn't really a negative conversation with our community. It was like, 'Here's what we're thinking, here's why we're thinking it,' and they were like, 'It's not working.' And we went, 'OK, let's change it until it works,' and then we did," said Houston.

Customized Slayer with different pieces of armor
Figure 2. You can customize your Slayer with different pieces of armor.

A Peaceful Coexistence

If Phoenix Labs sticks to its plan of launching the open beta later this year, Dauntless will inevitably go head-to-head with the PC release of Monster Hunter: World (set for the fall). When the latter debuted on consoles in January, it became an instant hit. According to industry tracking firm NPD, Capcom's juggernaut took the No. 1 spot that month, beating out perennial bestsellers like Call of Duty* and Grand Theft Auto V*.

But the Dauntless team isn't worried about the competition. In fact, the popularity around Monster Hunter: World has only helped them. Phoenix Labs saw an increase in the number of players in the beta after World's release. It also generated more discussions on Reddit* and forums as players compared their experiences between both games.

"The really interesting thing is that the last couple of weeks, since Monster Hunter has been out, have also been some of our biggest weeks as well. It is really exposing a wider group of folks to a really cool genre that has otherwise been kind of hard to attack. … At the same time, it was really validating for us," said Houston.

Despite what some observers may think, Phoenix Labs doesn't consider Monster Hunter (or Capcom for that matter) as a competitor. Houston likened it to a "rising tide lifts all boats" situation: The awareness that Monster Hunter brings to other hunting-action games can only benefit them.

"I have a ton of respect for Capcom. They created a really awesome genre! And I just want to help make it better, and I want them to help us make our game better," he said.

To play the closed beta yourself, head on over to the Dauntless website.

↧

CPU Capability Detect using Unreal Engine* 4.19

May 11, 2018, 1:02 pm

Latest and popular articles on Intel Technologies

≫ Next: Accelerating x265 with Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

≪ Previous: Dauntless*: Making a Different Kind of Monster Hunting Game

With the release of Unreal Engine* 4.19, many features have been optimized for multicore processors. In the past, game engines traditionally followed console design points, in terms of graphics features and performance. In general, most games weren't optimized for the processor, which can leave a lot of PC performance sitting idle. Intel's work with Unreal Engine 4 is focused on unlocking the potential of games as soon as developers work in the engine, to fully take advantage of all the extra processor computing power that a PC platform provides.

Intel's enabling work for Unreal Engine* 4.19 delivered the following:

Increased the number of worker threads to match a user's processor
Increased the throughput of the cloth physics system
Integrated support for Intel® VTune™ Amplifier

To take advantage of the additional computing power on high-end CPUs, Intel has developed a plugin that gives detailed CPU metrics and SynthBenchmark performance indicators. The metrics from this plugin can be used to differentiate features and content by CPU capability. Binning features and content in this manner will allow your game to run on a range of systems without impacting the overall performance.

Unreal Engine* 4.19 Capability Detect Plugin

Using the Capability Detect Plugin, you can access C++ and Blueprint-compatible helper functions for CPU metrics, render hardware interface (RHI) functions, and the SynthBenchmark performance indexes for the CPU/GPU.

Table 1. CPU detect functions

Third Party Function	Blueprint Function	Description
Intel_IsIntelCPU()	IsIntelCPU()	Returns TRUE if Intel CPU
Intel_GetNumLogicalCores()	GetNumLogicalCores()	Returns Number of Logical Cores
Intel_GetNumPhysicalCores()	GetNumPhysicalCores()	Returns Number of Physical Cores
Intel_GetCoreFrequency()	GetCoreFrequency()	Returns the current Core Frequency
Intel_GetMaxBaseFrequency()	GetMaxBaseFrequency()	Returns the Maximum Core Frequency
Intel_GetCorePercMaxFrequency()	GetCorePercMaxFrequency()	Returns % of Maximum Core Frequency in use
Intel_GetFullProcessorName()	GetFullProcessorName()	Returns Long Processor Name
Intel_GetProcessorName()	GetProcessorName()	Returns Short Processor Name
Intel_GetSKU()	N/A	Not in Use

Table 2. Cache and memory detect functions

Cache and Memory Functions
Third-Party Function	Blueprint Function	Description
Intel_GetUsablePhysMemoryGB()	GetUsablePhysMemoryGB()	Returns Usable Physical Memory in GB
Intel_GetComittedMemoryMB()	GetComittedMemoryMB()	Returns Committed Memory in MB
Intel_GetAvailableMemoryMB()	GetAvailableMemoryMB()	Returns Available Memory in MB

Table 3. Render hardware interface (RHI) wrapper functions

RHI Wrapper Functions
Third-Party Function	Blueprint Function	Description
N/A	IsRHIIntel()	Returns TRUE if GPU is Intel
N/A	IsRHINVIDIA()	Returns TRUE if GPU is NVIDIA
N/A	IsRHIAMD()	Returns TRUE if GPU is AMD
N/A	RHIVendorName()	Returns Vendor Name of GPU

Table 4. SynthBenchmark wrapper functions

SynthBenchmark Wrapper Functions
Third-Party Function	Blueprint Function	Description
N/A	ComputeCPUPerfIndex()	100: avg. good CPU, <100:slower, >100:faster
N/A	ComputeGPUPerfIndex()	100: avg. good GPU, <100:slower, >100:faster

SynthBenchmark

When using the SythBenchmark wrappers, be aware that the first call of each ComputeCPUPerfIndex() and ComputeGPUPerfIndex() will incur a slight performance cost while the performance indexes are computed. Performance index values are cached after the first call and subsequent calls to either ComputeCPUPerfIndex() or ComputeGPUPerfIndex() will not have the additional overhead of running the benchmark. For performance-critical aspects of your game it is recommended to call both of these functions during startup or loading screens.

Installing the Capability Detect Plugin

1. Download the Capability Detect Plugin from GitHub* and open the project folder.

Project folder caption

2. If the Plugins folder doesn't exist in the root directory, add it now.

Steps to create the plugin folder

Plugin folder image

3. Extract the CapabilityDetect plugin into the Plugins folder.

CapabilityDetect plugin folder

4. Launch the project using the .uproject file.

Steps to launch the project

5. Go to Edit > Plugins in the main menu. When the Plugin window loads, the Capability Detect Plugin should be installed in the project.

Installed Capability Detect Plugin

Now that the plugin is installed, it can be used to differentiate game content and features. In the next section we'll describe how to use this plugin to bin features by CPU capabilities.

Unreal Engine 4.19 Feature Differentiation

Detecting capabilities

In order to segment features by platform configuration, create a new UDataAsset named UPlatformConfig. UPlatformConfig will store the characteristics of the platform being targeted such as the number of physical cores, logical cores, usable physical memory, processor name, and/or SynthBenchmark performance index.

#include "CoreMinimal.h"
#include "Engine/DataAsset.h"
#include "PlatformConfig.generated.h"
/**
 * Platform Configuration Data Asset
 */
UCLASS(BlueprintType)
class CAPABILITYDETECTDEMO_API UPlatformConfig : public UDataAsset
{
       GENERATED_BODY()
public:
       UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "Platform Configuration")
       float CPUPerfIndex;
       UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "Platform Configuration")
       FString Name;
       UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "Platform Configuration")
       bool IsIntelCPU;
       UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "Platform Configuration")
       int NumPhysicalCores;
       UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "Platform Configuration")
       int NumLogicalCores;
       UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "Platform Configuration")
       float UsablePhysMemoryGB;
       UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "Platform Configuration")
       float ComittedMemoryMB;
       UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "Platform Configuration")
       float AvailableMemoryMB;
       UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "Platform Configuration")
       float CacheSizeMB;
       UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "Platform Configuration")
       float MaxBaseFrequency;
       UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "Platform Configuration")
       float CoreFrequency;
       UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "Platform Configuration")
       float CorePercMaxFrequency;
       UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "Platform Configuration")
       FString FullProcessorName;
       UPROPERTY(EditAnywhere, BlueprintReadWrite, Category = "Platform Configuration")
       FString ProcessorName;
};

Next, we can set up a class called UPlatformTest with static functions to compare UPlatformConfig properties to the capabilities detected by the plugin.

#include "CoreMinimal.h"
#include "PlatformTest.generated.h"
class UPlatformConfig;
/**
 * Static functions for testing capabilities.
 */
UCLASS(BlueprintType)
class CAPABILITYDETECTDEMO_API UCapabilityTest : public UObject
{
       GENERATED_BODY()
public:
       UFUNCTION(BlueprintCallable, Category = "Capabilities")
       static bool CapabilityTest(UPlatformConfig* config);
       UFUNCTION(BlueprintCallable, Category = "Capabilities")
       static UPlatformConfig* GetCapabilityLevel();
};

The CapabilityTest() function will compare a UPlatformConfig to features detected by the Capability Detect Plugin. In this case, we will check if physical cores, logical cores, and the SynthBenchmark CPU performance index exceed the properties of the UPlatformConfig passed into the function.

bool UCapabilityTest::CapabilityTest(UPlatformConfig* config)
{
    // True if system capabilities exceed platform definitions
    return
        UCapabilityDetectBPLib::GetNumPhysicalCores() >= config->NumPhysicalCores   
        && UCapabilityDetectBPLib::GetNumLogicalCores() >= config->NumLogicalCores
        && UCapabilityDetectBPLib::ComputeCPUPerfIndex() >= config->CPUPerfIndex;

Now that we have a way to compare capabilities we can create another function to setup and test platform configurations. We'll create a function called GetCapabilityLevel() and create four segmentation levels named LOW, MEDIUM, HIGH, and ULTRA. We'll provide a name that corresponds to the feature level and specify the physical/logical cores, and SynthBenchmark performance index for each configuration being tested. Finally, since we are using a greater-than-or-equal symbol for the comparison in CapabilityTest(), we will test from highest to lowest and return the result.

UPlatformConfig* UCapabilityTest::GetCapabilityLevel()
{
       // Create Platform Definitions
       UPlatformConfig *ULTRA, *HIGH, *MEDIUM, *LOW;
       ULTRA = NewObject<UPlatformConfig>();
       HIGH = NewObject<UPlatformConfig>();
       MEDIUM = NewObject<UPlatformConfig>();
       LOW = NewObject<UPlatformConfig>();
       // Assign Properties to platform definitions.
       // LOW - 2 Physical Cores 4 Hyper-threads
       LOW->Name = TEXT("LOW");
       LOW->NumPhysicalCores = 2;
       LOW->NumLogicalCores = 4;
       LOW->CPUPerfIndex = 0.0;
       // MEDIUM - 4 Physical Cores 8 Hyper-threads
       MEDIUM->Name = TEXT("MEDIUM");
       MEDIUM->NumPhysicalCores = 4;
       MEDIUM->NumLogicalCores = 8;
       MEDIUM->CPUPerfIndex = 50.0;
       // HIGH - 6 Physical Cores 12 Hyper-threads
       HIGH->Name = TEXT("HIGH");
       HIGH->NumPhysicalCores = 6;
       HIGH->NumLogicalCores = 12;
       HIGH->CPUPerfIndex = 100.0;
       // ULTRA - 8 Physical Cores 16 Hyper-threads
       ULTRA->Name = TEXT("ULTRA");
       ULTRA->NumLogicalCores = 8;
       ULTRA->NumPhysicalCores = 16;
       ULTRA->CPUPerfIndex = 125.0;
       // Test platforms against detected capabilities.
       if (CapabilityTest(ULTRA)) {
              return ULTRA;
       }
       if (CapabilityTest(HIGH)) {
              return HIGH;
       }
       if (CapabilityTest(MEDIUM)) {
              return MEDIUM;
       }
       return LOW;
}

Detecting capabilities in C++

With the UCapabilityTest class we now have a way to determine CPU feature levels. We can use the results from GetCapabilityLevel() to differentiate content in either C++ or Blueprints. For instance, if we create an actor, we can differentiate features in the Tick function.

// Called every frame
void AMyActor::Tick(float DeltaTime)
{
       Super::Tick(DeltaTime);
       UPlatformConfig* CapabilityLevel = UCapabilityTest::GetCapabilityLevel();
       if (CapabilityLevel->Name == TEXT("LOW"))
       {
              // Use Simple Approximation for LOW end CPU...
              // e.g. Spawn 100 CPU Particles...
       }
       else if (CapabilityLevel->Name == TEXT("MEDIUM"))
       {
              // Use Advanced Approximation for MID range CPU...
              // e.g. Spawn 200 CPU Particles
       }
       else if (CapabilityLevel->Name == TEXT("HIGH"))
       {
              // Use Simple Simulation for HIGH end CPU...
              // e.g. Spawn 300 CPU Particles
       }
      else if (CapabilityLevel->Name == TEXT("ULTRA"))
       {
              // Use Advanced Approximation for ULTRA CPU...
              // e.g. Spawn 400 CPU Particles
       }
}

Detecting capabilities in Blueprints

Alternatively, we can use the same GetCapabilityLevel() function we used in our actor's Tick function in Blueprints, since we decorated it with the UFUNCTION(BlueprintCallable) attribute. In this case, we are using the level Blueprint and call the Get Capability Level node after the BeginPlay. The UPlatformConfig value returned by the Get Capability Level node has a Name property that can be used in a Switch on String node to differentiate features in your level. Finally, we just print the name of the CPU feature level to the screen (Figure 1).

Figure 1. Blueprint capability detect

Lastly, there is a Blueprint function that comes packaged with the Capability Detect Plugin. With this function you can get more granularity with your platform details in your Blueprints. Just add the Detect Capabilities node to your Blueprint and utilize the values you need for your game (Figure 2).

Figure 2. Detect capabilities blueprint node

Conclusion

With the higher core counts of modern CPUs, we can do much more with our games. However, players with fewer cores may be at a disadvantage compared to players with higher-end systems. To alleviate this disparity, it is possible to bin features using both C++ and Blueprints. Binning features as demonstrated will allow for maximum CPU usage while maintaining a consistent framerate for players with a range of platform configurations.

↧

Accelerating x265 with Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

May 11, 2018, 3:08 pm

Latest and popular articles on Intel Technologies

≫ Next: GDC 2018 Tech Sessions

≪ Previous: CPU Capability Detect using Unreal Engine* 4.19

Introduction

Motivation

Vector units in CPUs have become the de facto standard for acceleration of media, and other kernels that exhibit parallelism according to the single instruction, multiple data (SIMD) paradigm.¹ These units enable a single register file to be treated as a combination of multiple registers, whose cumulative width equals that of the vector register file. A single instruction can therefore operate in parallel on all data in this vector register, resulting in significant speedups to applications that exhibit data access trends that fit this pattern. Starting from a 64-bit vector register file that may be treated as an 8-bit register in the architecture extended with MMX™ technology, SIMD on Intel® architecture processors has evolved to enable 256-bit register files that allow for 32 parallel 8-bit operations in Intel® Advanced Vector Extensions (Intel® AVX) and Intel® Advanced Vector Extensions 2 (Intel® AVX2) generations.

Kernels in media workloads fit this pattern of execution naturally, because the same operation (filtering for example) is uniformly applied across several pixels of a frame. Consequently, several popular open source projects leverage SIMD instructions for code acceleration. The x264 project for Advanced Video Coding (AVC) encoding² and the x265 project for High Efficiency Video Coding (HEVC) encoding³ are the two widely used media libraries that extensively use multiple generations of SIMD instructions on Intel architecture processors, from MMX technology all the way up to Intel AVX2. As shown in Figure 1, x264 and x265 achieve two times and five times speedup respectively over their corresponding baselines that do not use any SIMD code. The x265 encoder gains more performance from Intel AVX2 when compared to x264, because the quantum of work done per frame is substantially larger for HEVC than for AVC.⁴

graph showing peformance benefits comparisons
Figure 1. Performance benefit for x264 and x265 from Intel® Advanced Vector Extensions 2 for 1080p encoding with main profile using an Intel® Core™ i7-4500U Processor.

Focus of this whitepaper

The recently released Intel® Xeon® Scalable processors, part of the platform formerly code-named Purley, have introduced the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set.⁵ Intel AVX-512 instructions are capable of performing two times the number of operations in the same number of cycles as the previous generation Intel AVX2 instruction set. To accommodate this increased throughput, a larger fraction of the die is utilized, resulting in increased power being consumed, when compared to the previous-generation SIMD units. Therefore, certain Intel AVX-512 instructions are expected to cause a higher degradation to CPU clock frequency than others.⁶ While this reduction in frequency is offset by the increased throughput for the Intel AVX-512 instructions, media kernels continue to rely significantly on SIMD instructions in older generations (because not all kernels benefit from the increased width) and on straight-line C code that is not amenable to SIMD conversion, which may see reduced performance.

This whitepaper presents a case study based on our experience using the Intel AVX-512 SIMD instructions to accelerate the compute intensive kernels of x265. We describe how we offset the reduction in CPU frequency to ensure that the overall encoder achieves positive performance benefits. Through this process, we present recommendations of when we think Intel AVX-512 should be enabled with x265 for HEVC encoding. We also share our experience on when to choose Intel AVX-512 as a vehicle for accelerating media kernels.

Key takeaways

Our experience shows that enabling Intel AVX-512 specifically for media kernels requires achieving a balance that should be delicately handled. From our results, we recommend the following:

When choosing specific kernels that can be accelerated with Intel AVX-512, the same compute-to-memory ratio should be considered. If this ratio is high, using Intel AVX-512 is recommended. Also, when using Intel AVX-512, try to align the buffers to 64B in order to avoid loads that cross cache- line boundaries.
For desktop and workstation SKUs (like the Intel® Core™ i9-7900X processor that we tested), Intel AVX-512 kernels can be enabled for all encoder configurations, because the reduction in CPU clock frequency is rather low.
For server SKUs (like the Intel® Xeon® Platinum 8180 processor on which we tested), the frequency dip is higher and increases, with more cores being active. Therefore, Intel AVX-512 should only be enabled when the amount of computation per pixel is high, because only then is the clock-cycle benefit able to balance out the frequency penalty and result in performance gains for the encoder.

Specifically, we recommend enabling Intel AVX-512 only when encoding 4K content using a slower or veryslow preset in the main10 profile. We do not recommend enabling Intel AVX-512 kernels for other settings (resolutions/profiles/presets), because of possible performance impact on the encoder.

While the results and recommendations presented in this paper are not without limitations to the evaluations and our experimental approximations, we believe that they will help the community at large to understand the benefits of using Intel AVX-512 for accelerating media workloads.

The rest of the paper is organized as follows: The "Background" section presents the background relevant to the technical material presented in the paper. "Acceleration of x265 Kernels with Intel Advanced Vector Extensions 512" discusses the choices we made to accelerate specific kernels of x265 and discusses results for the main and main10 profiles. "Accelerating x265 Encoding with Intel Advanced Vector Extensions 512" presents the results for the overall encoder for the main and main10 profiles. Finally, Section 5 provides detailed recommendations for when Intel AVX-512 should be enabled when using x265 and generic recommendations for when Intel AVX-512 should be chosen when accelerating specific kernels. This section also describes future work.

Background

This section presents the relevant background of the concepts presented in this paper. Specifically, section "HEVC Video Encoding" provides the background on HEVC. "x265, an Open Source HEVC Encoder" discusses x265 with specific focus on the existing methods of performance optimizations that it employs. Section "Introduction to the Intel® Xeon® Scalable Processor Platform" presents the relevant background on Intel Xeon Scalable processors, and Section "SIMD Vectorization Using Intel Advanced Vector Extensions 512" discusses in more detail the Intel AVX-512 architecture.

HEVC video encoding

HEVC was ratified as an encoding standard by the JCT- VC (Joint Collaborative Team on Video Coding) in 2013 as a successor to the vastly popular AVC standard.⁴ The video encoding and decoding processes in HEVC resolves around identifying three units: a coding unit (CU) that represents each block in the picture, a prediction unit (PU) that represents the mode decision, including motion compensated prediction of the CU, and a transform unit (TU) that represents the way in which the generated residual error between the predicted and the actual block is coded.

Initially, a frame is divided into a sequence of its largest non- overlapping coding units, called a coding tree unit (CTU). A CTU can then be split into multiple CUs with variable sizes of 64x64, 32x32, 16x16, and 8x8 to form a quad-tree. Each CU is then predicted from a set of candidate-blocks, which may be in either the same frame or different frames. If the block used for the prediction is in the same frame, the block is said to intra-predicted, while if it is in a different frame, it is said to be inter-predicted.

Intra-predicted blocks are represented by a combination of the prediction block and a mode that denotes the angle of the prediction. The allowed modes for intra-prediction are labeled DC, planar, and angular modes representing various angles from the predicted block. Inter-predicted blocks are represented by a combination of the block used for prediction (the reference block) and the motion vector (MV) that represents the delta between the current and the reference block. Blocks that have zero MV are said to use the merge mode, while others use the AMP (Advanced Motion Prediction) mode. The skip mode is a special case of the merge mode when the predicted block is identical to the source, that is, no residual. The AMP modes may use PUs that are the same size of the CU (denoted as 2Nx2N PUs) or may further partition them (denoted as rectangular and asymmetric PUs) to compute the MVs. The residual generated as a difference from the original and the predicted picture is then quantized and coded using TUs that may vary from 32x32 up to 4x4 blocks, depending on the prediction mode.

The entire process of inter, intra, CU, PU, and TU selection benefits across a broad variety of usage models including big data, artificial intelligence, high-performance computing, enterprise-class IT, cloud, storage, communication, and Internet of Things. Top enhancements include performance for a wide range of workloads with one and a half of memory bandwidth, integrated network/fabric, and optional integrated accelerators. Our results in x265 indicate a significant gen- over-gen speedup of 50 – 67 percent for offline encodes when compared to the previous-generation Intel® Xeon® processor 10 is called Rate-Distortion Optimization (RDO). The goal of Intel® Xeon® processor E5-2600. This boost comes primarily from RDO is to ensure that distortion is minimized at the target bitrate or the bitrate is minimized at the target quality level as represented by distortion. Throughout the process of RDO, various combinations of CUs, PUs, and TUs are attempted by an encoder, for which it employs several kernels. In this paper, we chose to vectorize these specific kernels by converting them to use Intel AVX-512 instructions.

HEVC encoding also supports multiple profiles for encoding a video, with each profile representing a different number of bits used to represent each pixel. The main and main10 profile are popular profiles of HEVC (their AVC counterparts are called main and high profiles respectively). Each component of a pixel is represented with a minimum of 8 bits in the main profile resulting in the values ranging from 0 –255. The main10 profile uses 10 bits per pixel, allowing for a higher range of 0 –1023 for each pixel, enabling the representation of more details in the encoded video. 2.2 x265, an Open Source HEVC Encoder The x265 encoder is an open-source HEVC that compresses video in compliance to the HEVC standard.⁷ This encoder has been integrated into several open-source frameworks including VLC* , HandBrake*,⁸ and FFMpeg⁹ and is the de facto open-source video encoder for HEVC. The x265 encoder has assembly optimizations for several platforms, including Intel architecture, ARM*, and PowerPC*.

The x265 encoder employs techniques for inter-frame and intra-frame parallelism to deal with the increased complexity of HEVC encoding.¹⁰ For inter-frame parallelism, x265 encodes multiple frames in parallel by using system-level software threads. For intra-frame parallelism, x265 relies on the Wavefront Parallel Processing (WPP) tool exposed by the HEVC standard. This feature enables encoding rows of CTUs of a given frame in parallel, while ensuring that the blocks required for intra-prediction from the previous row are completed before the given block starts to encode; as per the standard, this translates to ensuring that the next CTU on the previous row completes before starting the encode of a CTU on the current row. The combination of these features gives a tremendous boost in speed with no loss in efficiency compared to the publicly available reference encoder, HM.

Introduction to the Intel® Xeon® processor Scalable family platform

The Intel® Xeon® processor Scalable family, part of the Intel® platform formerly code-named Purley, are designed to deliver new levels of consistent and breakthrough performance. The platform is based on cutting-edge technology and provides compelling the improved microarchitecture features available on Intel Xeon Scalable processors.

SIMD vectorization using Intel® AVX-512

The Intel AVX-512 vector blocks present a 512-bit register file, allowing 2X parallel data operations per cycle compared to that of Intel AVX2. Though the benefits of vectorizing kernels to use the Intel AVX-512 architecture seem obvious, several key questions must be answered specifically for media workloads before embarking on this task. First, is there sufficient parallelism inherently preset in media kernels that they can leverage this increased parallelism? Second, is the fraction of the execution that exploits this parallelism sufficiently large such that we can expect average speedups as per Amdhal’s law? Third, by enabling such vectorization, is there some effect on the execution on the serial- and non-vector codes?

Acceleration of x265 Kernels with Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

As a first step in acceleration, we used handwritten Intel AVX-512 instructions to select the kernels from x265 to be accelerated. While automated tools that generate vectorized SIMD code are available, we found that handwritten assembly outperforms auto-vectorizing tools, which convinced us to use this technique. This section details how this technique was performed and the gains in cycle count we observed from these kernels for sample runs in main and main10 profiles.

Selecting the kernels to accelerate

We selected over 1,000 kernels from the core compute We selected over 1,000 kernels from the core compute of x265 to optimize with Intel AVX-512 instructions for the main and main10 profiles. These kernels were chosen based on their resource requirements. Some kernels may require frequent memory access like different block-copy and block-fill kernels, while others may involve intense computation like DCT, iDCT, and quantization kernels. There is also a third class of kernels that involve a combination of both in varying proportions. We found that ensuring that the buffers that the assembly routines accessed were 64-byte aligned reduces cache misses and in general helps Intel AVX-512 kernels. A complete list of the kernels optimized with Intel AVX-512 instructions for main and main10 kernels are listed in Appendix A1 and A2 respectively.

Framework to evaluate cycle-count improvements

The x265 encoder implements a sample test bench as a correctness and performance measurement tool for assembly kernels. It accepts valid arguments for a given kernel and invokes the C primitive and corresponding assembly kernel and compares both output buffers. It verifies all possible corner cases for the given input type by using a randomly distributed set of values. Each assembly kernel is called 100 times and checked against its C primitive output for ensuring the correctness. To measure performance improvement, the test bench measures the difference in the clock ticks (as reported by the rdtsc instruction) between the assembly kernel and the C kernel for 1,000 runs and reports the average between them.

Cycle-Count improvement for kernels in the main and main10 profiles

Figure 2 shows the cycle-count improvements for each of the 500 kernels in the main profile and the 600+ kernels in the main10 profile that were accelerated with Intel AVX-512. In each curve, the kernels are sorted in increasing order of their cycle count gains over the corresponding Intel AVX-512 implementation. Appendix A details the per-kernel gains over Intel AVX2 in cycle counts.

On average, we saw a 33 percent and 40 percent gain in the cycle count over the Intel AVX2 kernels for kernels in the main and main10 profile respectively. The reason for the higher gains is as follows. In the main10 profile, x265 uses 16 bits to represent each pixel, as opposed to the main profile, which uses 8 bits; although main10 technically only needs 10 bits, using 16 bits simplifies all data structures in the software. Therefore, the amount of work that has to be done for the same number of pixels is doubled. Due the higher quantum of compute, kernels in the main10 profile gain more from Intel AVX-512 over Intel AVX2, than what the kernels in the main profile gain. These results from cycle counts indicate that at the kernel level, there is much benefit in using Intel AVX-512 to accelerate x265. However, this does not account for the reduction in clock frequency incurred when using Intel AVX-512 instructions compared to using Intel AVX2 instructions. In the next section, we look at the effect on overall encoding time, which also accounts for this effect.

Accelerating x265 Encoding with Intel Advanced Vector Extensions 512

In this section, we look at the impact of using Intel AVX-512 kernels for real encoding use cases with x265. Section "Test Setup" describes our test setup including the videos chosen, the x265 presets used, and the system configurations of the test machines. Section "Encoding on Intel® Core™ Processors" presents results on a workstation machine with an Intel Core i9-7900X processor, while section "Encoding on Intel Xeon Scalable Processors" presents results on a typical high-end server CPU that has two Intel Xeon Platinum 8180 processors.

Test setup

Our tests mainly focused on encoding 1080p videos with the main profile and 4K videos with the main10 profile. We used four typical 1080p clips (crowdrun, ducks_take_off, park_ joy, and old_town_cross), and three 4k clips (Netflix_Boat, Netflix_FoodMarket, and Netflix_Tango) for our tests ¹⁰. Appendix B gives a little more detail, along with screenshots of the videos used. We encode the 1080p to the main profile at the following bitrates (in Kbps): 1000, 3000, 5000, and 7000. For the 4K clips, the main10 profiles target the following bitrates (in Kbps): 8000, 10000, 12000, and 14000.

We encode these videos with a version of x265 that has all the kernels described in Section 3; these kernels were contributed as part of the default branch of x265. The kernels are disabled by default and may be enabled with the –asm avx512 option in the x265 command-line interface.

Figure 2. Cycle-count gains of the main and main10 profile Intel® Advanced Vector Extensions 512 kernels over the corresponding Intel® Advanced Vector Extensions 2 kernels.

We focused our experiments on four presets of x265 to represent the wide set of use cases that x265 presents: ultrafast, veryfast, medium, and veryslow. These presets represent a wide variety of trade-offs between encode efficiency and frames per second (FPS). The veryslow preset generates the most efficient encode but is the slowest; this preset is also the preferred choice for any offline encoding use cases such as OTT. The ultrafast preset is the quickest setting of x265 but generates the encode with the lowest efficiency. The veryfast and medium presets represent intermediate points in the trade-off between performance and encoder efficiency. Typically, the more efficient presets employ more tools of HEVC, resulting in more compute-per- pixel than the less efficient presets. This is important to call out as Intel AVX-512 kernels tend to give better speedup when the compute-per-pixel is higher, as shown from the results in the previous section.

Encoding on Intel® Core™ Processors

Figure 3 shows the performance of encoding 1080p and 4K video in main and main10 profile with Intel AVX-512 kernels relative to using Intel AVX2 kernels on a workstation-like configuration with an Intel Core i9-7900X processor using a single instance of x265. The full details of the system configuration are described in Appendix C. The single instance results in high utilization of the CPU across all configurations, representing a typical use case for this system when performing HEVC encoding.

Intel® Core™ i9-7900X Processor
Graph with performance metrics
Figure 3. Encoder performance from using Intel® Advanced Vector Extensions 512 kernels on a single instance of x265, as measured on a workstation-like system with an Intel® Core™ i9-7900X processor.

From the results, we see that for all profiles and presets, enabling Intel AVX-512 kernels results in a positive performance gains. On the Intel Core i9-7900X processor system, our measurements did not indicate any significant reduction in clock frequency. The cycle-count improvements from the kernels therefore directly reflect an increased encoder performance. When we observed the relative encoder performance per encode, we observed that there were no command lines that demonstrated lower performance with Intel AVX-512 than with Intel AVX2.

We therefore recommend that for the Intel Core i9-7900X processor, and similar systems where the frequency reduction is minimal, Intel AVX-512 kernels be enabled for all encoding profiles and resolutions when using x265.

Encoding on Intel Xeon Scalable Processors

In this section, we present results from using x265 accelerated by Intel AVX-512 on a high-end server configuration with two Intel Xeon Platinum 8180 processors arranged in a dual-socket configuration with 28 hyperthreaded cores per CPU. For full details of the system configuration, refer to Appendix C.

x265 single instance performance using 8 threads and 16 threads

Figure 4 shows the performance of a single instance of x265 with kernels that use Intel AVX-512 for encoding 1080p videos in the main profile and 4K videos in the main10 profile relative to using kernels that only use Intel AVX2 instructions. Two configurations, one with 8 threads per instance and another with 16 threads per instance, are shown in the graph to understand the impact of increasing the number of active cores on the CPU; limiting the number of threads for each instance is done using the --pools option of the x265 library.

The figure shows that for a given thread configuration, the gains when encoding 4K content in the main10 profile are higher than for the 1080p content in the main profile. Also, for a given resolution and profile, the gains that we see from the presets that have more work-per-pixel (the higher efficient presets like the veryslow preset) are higher than the faster presets; in fact, for 1080p content in the main profile, we see an average performance loss. These gains are consistent with previously observed results that demonstrate that the more the work per pixel of a specific configuration, the better it is to use Intel AVX-512. Additionally, when we investigated the S-curves of these profiles (not shown here for brevity), we saw that several encoder command lines outside the 4K main10 veryslow setting lost performance over Intel AVX2.

We therefore recommend using Intel AVX-512-enabled kernels only when doing 4K encodes in the main10 profile with the versylow preset. For other presets and encoder settings, the amount of work per pixel is insufficient to offset the reduction in clock frequency to the gains in cycle-count achieved.

One additional observation we can make from Figure 4 is that the performance gains are in general higher across the board when using 8 threads for the single instance of x265, compared to the 16 threads. Upon further analysis, we observe that when more cores are activated with Intel AVX- 512 instructions in the Intel Xeon Platinum 8180 processor, the frequency reduces further, resulting in lower gains from using Intel AVX-512 instructions. In a typical server, however, encoder vendors attempt to maximize all available CPU cores to get the maximum throughput out of the given server.
This use case is explored in Section 4.3.2 where we attempt to saturate the server with 4K main10 encodes to see if the lower frequency when more cores are activated may result in muting the gains.

Intel® Xeon® Platinum 8180 Processor
graph showing peformance benefits comparisons
Figure 4. Relative performance of a single instance of x265 when using Intel® Advanced Vector Extensions 512 kernels with 8 or 16 threads over Intel® Advanced Vector Extensions 2 kernels on a server configuration with two Intel® Xeon® Platinum 8180 processors.

Saturating Intel® Xeon® Platinum 8180 processors using multiple instances of x265

To study whether activating more cores results in performance loss for 4K encodes in the main10 profile, we saturated one and both CPUs of a dual-socket Intel Xeon Platinum 8180 processor-based server with four and eight instances of x265, respectively, with each instance using 16 threads. We measured the total FPS achieved by all x265 instances to encode the same clip at different bitrates when using kernels that use Intel AVX-512 and reported the number relative to when the Intel AVX2-enabled kernels were used. Figure 5 shows these results.

Intel® Xeon® Platinum 8180 processor - Single and Dual Socket Saturation
graph showing performance benefits comparisons
Figure 5. Single-socket and dual-socket saturation of theIntel® Xeon® Platinum 8180 processor with x265 instances.

Figure 5. Shows that even when saturating one or both CPUs, encoding 4K videos with main10 shows positive performance gains over using the Intel AVX2 counterparts. However, the gains are lower than the corresponding gains achieved when a single instance of x265 that uses fewer cores. Additionally, we observe that for lower efficiency presets such as veryfast and medium, the gains are muted due to the higher frequency drop with more active cores.

These results reiterate our recommendation that Intel AVX-512 kernels should only be enabled when encoding 4K content for the main10 profile for the veryslow preset. For other presets that have lower compute per pixel, enabling Intel AVX-512 kernels may result in a performance loss over using Intel AVX2 kernels.

Figure 5 shows that even when saturating one or both CPUs, encoding 4K videos with main10 shows positive performance gains over using the Intel AVX2 counterparts. However, the gains are lower than the corresponding gains achieved when a single instance of x265 that uses fewer cores. Additionally, we observe that for lower efficiency presets such as veryfast and medium, the gains are muted due to the higher frequency drop with more active cores.

Conclusions and Future Work

In this paper, we presented our experience with using the Intel AVX-512 instructions available in the newly introduced Intel Xeon Scalable processors to accelerate the open-source HEVC encoder x265. The specific challenges that we had to overcome included selecting the right kernels to accelerate with Intel AVX-512 such that the reduction in CPU frequency were offset from the benefits in cycle count, and choosing the right encoder configuration that enabled the right balance of compute per pixel to achieve positive gains in encoder performance.

Recommendations

Our experience shows that enabling Intel AVX-512 specifically for media kernels requires achieving a balance that should be delicately handled. From our results, we recommend the following:

When choosing specific kernels that can be accelerated with Intel AVX-512, the same compute-to-memory ratio should be considered. If this ratio is high, using Intel AVX-512 is recommended. Also, when using Intel AVX-512, try to align the buffers to 64B in order to avoid loads that cross cache- line boundaries.
For desktop and workstation SKUs (like the Intel Core i9-7900X processor that we tested), Intel AVX-512 kernels can be enabled for all encoder configurations because the reduction in CPU clock frequency is rather low.
For server SKUs (like the Intel Xeon Platinum 8180 processor on which we tested), the frequency dip is higher, and increases, with more cores being active. Therefore, Intel AVX-512 should only be enabled when the amount of computation per pixel is high, because only then is the clock- cycle benefit able to balance out the frequency penalty and result in performance gains for the encoder.

While the results and recommendations presented in this paper are not without the limitations of the evaluations and our experimental approximations, we believe that they will help the community at large to understand the benefits of using Intel AVX-512 for accelerating media workloads.

Future work

The task of accelerating x265 with Intel AVX-512 has opened several avenues for future work. The accelerated kernels are available through the public mailing list. Future extensions of this work to enable further acceleration from Intel AVX-512 include (1) performing a thorough analysis of the use of Intel AVX-512 for videos at other resolutions and presets available in x265, (2) enabling schemes to dynamically enable and disable Intel AVX-512 kernels by monitoring the CPU frequency, and (3) a fundamental re-architecting of the encoder to segregate the worker threads into different types of threads, only some of which may run Intel AVX-512 limiting the number of cores where the CPU frequency drop is observed. We will continue to develop and contribute these solutions to open source, and encourage the reader to also contribute the project at http://x265.org.

Acknowledgements

This work was funded in part by a non-recurring engineering grant from Intel to MulticoreWare. We would like to thank the various developers and engineers at MulticoreWare for their extensive support throughout this work. In particular, we would like to thank Thomas A. Vaughan for his guidance and Min Chen for his expert comments on the assembly patches.

Appendix A

A1 – Main profile instructions per cycle (IPC) gains

Primitive	IPC Gain	Primitive	IPC Gain	Primitive	IPC Gain	Primitive	IPC Gain
sad	0.16%	i422 chroma_vss	32.70%	i420 chroma_vpp	23.19%	luma_vss	43.18%
pixelavg _pp	0.87%	luma_vss	32.89%	addAvg	23.37%	luma_vss	43.35%
i444 chroma_vps	1.14%	sad_x3	33.01%	addAvg	23.38%	i444 chroma_hpp	43.43%
i444 chroma_vps	1.18%	luma_vps	33.05%	i444 chroma_hps	23.53%	ssd_s	43.57%
pixelavg _pp	1.41%	i420 chroma_hpp	33.08%	i420 chroma_hps	23.77%	luma_hps	43.68%
convert_p2s	1.95%	i444 chroma_hpp	33.14%	var	23.95%	luma_vss	43.75%
i420 chroma_vps	2.45%	sad_x4	33.14%	i420 chroma_hpp	24.03%	luma_hps	43.84%
i420 chroma_vps	2.72%	i444 chroma_vss	33.16%	i422 chroma_vpp	24.11%	luma_hps	43.94%
i422 chroma_hps	2.83%	i420 chroma_vss	33.16%	i444 chroma_vss	24.15%	luma_vsp	44.06%
i420 p2s	3.21%	copy _ps	33.33%	i422 chroma_vss	24.15%	luma_vsp	44.11%
i444 p2s	3.21%	i420 copy _ps	33.33%	i420 chroma_vss	24.15%	sub_ps	44.11%
sad_x3	3.29%	i444 chroma_vss	33.34%	i420 chroma_vps	24.20%	i444 chroma_hpp	44.15%
i420 chroma_vps	3.62%	i422 chroma_vss	33.34%	i444 chroma_vpp	24.20%	convert_p2s	44.33%
sad_x4	4.50%	i420 chroma_vss	33.34%	i420 chroma_vpp	24.20%	i444 chroma_hpp	44.35%
sad	4.62%	i422 copy _ps	33.43%	sad	24.21%	luma_vss	44.42%
i420 chroma_hps	4.90%	i444 chroma_vss	33.43%	i444 chroma_vps	24.22%	luma_hps	44.43%
i420 chroma_hps	5.19%	i422 chroma_vss	33.43%	i420 chroma_vps	24.22%	luma_hpp	44.48%
pixel_satd	5.42%	i420 chroma_hpp	33.55%	i444 chroma_hps	24.25%	luma_vpp	44.54%
i444 chroma_vps	5.43%	i422 chroma_hpp	33.57%	i420 chroma_hpp	24.42%	luma_vss	44.61%
i422 chroma_hps	5.82%	dequant_normal	33.60%	sad_x4	24.53%	cpy1Dto2D_shl	44.61%
i444 chroma_vps	6.78%	sad_x4	33.62%	i444 chroma_hps	24.57%	luma_vsp	44.62%
dct	7.06%	i444 chroma_vss	33.89%	i422 chroma_hps	24.65%	luma_vsp	44.66%
i444 chroma_hps	7.08%	i420 chroma_vss	33.89%	psyCost_pp	24.89%	luma_vss	44.70%
i444 chroma_hps	7.26%	sad_x3	33.92%	i422 chroma_vps	25.00%	luma_vpp	44.74%
i422 chroma_vss	8.85%	i420 pixel_satd	34.01%	i444 chroma_vss	25.17%	luma_vsp	44.85%
luma_vss	9.76%	i444 chroma_hps	34.02%	i422 chroma_vss	25.17%	i422 copy _sp	45.20%
i422 chroma_hps	10.27%	luma_vps	34.04%	i420 chroma_vss	25.17%	getResidual32	45.24%
i444 chroma_hps	11.00%	i444 chroma_hpp	34.20%	i422 chroma_vps	25.66%	luma_vpp	45.30%
i444 chroma_hps	11.14%	i420 pixel_satd	34.20%	luma_vps	25.82%	luma_hps	45.35%
sad	11.26%	i420 chroma_hpp	34.23%	i444 chroma_vps	25.89%	i444 chroma_hpp	45.41%
i420 chroma_hps	11.38%	i444 chroma_vss	34.43%	i444 chroma_vps	25.92%	luma_hpp	45.49%
pixel_sa8d	11.55%	i422 chroma_vss	34.43%	i420 chroma_hps	25.95%	convert_p2s	45.52%
i444 chroma_hps	11.91%	i420 chroma_vss	34.43%	i420 chroma_vps	26.07%	luma_hps	45.58%
luma_vpp	11.96%	i422 chroma_vsp	34.59%	convert_p2s	26.25%	luma_vpp	45.62%
i422 chroma_hps	12.10%	i444 chroma_vss	34.71%	i422 chroma_vps	26.42%	convert_p2s	45.62%
copy _pp	12.54%	i444 chroma_vss	34.76%	i444 chroma_vps	26.56%	luma_vpp	45.69%
ssd_s	12.58%	addAvg	34.88%	i444 chroma_vss	26.71%	cpy2Dto1D_shl	45.75%
i420 chroma_vps	12.58%	addAvg	35.14%	i422 chroma_vss	26.71%	i422 addAvg	45.76%
i444 chroma_hps	12.79%	sad	35.43%	i420 chroma_vss	26.71%	convert_p2s	46.00%
idct	13.32%	ssd_ss	35.45%	sad_x4	26.80%	i420 add_ps	46.09%
luma_vps	13.78%	i444 chroma_vss	35.51%	i422 chroma_hpp	27.06%	add_ps	46.10%
i444 chroma_hps	13.87%	i420 pixel_satd	35.55%	i422 chroma_hps	27.13%	luma_vsp	46.14%
sad	13.88%	pixelavg _pp	35.56%	luma_hpp	27.15%	luma_hps	46.29%
copy _cnt	14.25%	luma_vpp	35.62%	i420 pixel_satd	27.23%	luma_vss	46.31%
luma_vpp	14.28%	luma_vpp	36.21%	i444 chroma_vss	27.24%	i444 chroma_vsp	46.52%
pixel_satd	14.45%	i420 chroma_hpp	36.45%	i422 chroma_vss	27.24%	i422 chroma_vsp	46.52%
idct	14.49%	i422 chroma_hpp	36.65%	luma_hpp	27.29%	i420 chroma_vsp	46.52%
pixel_satd	14.92%	i422 chroma_hpp	36.76%	luma_vps	27.45%	luma_hps	46.65%
pixel_satd	14.99%	sad	36.76%	psyCost_pp	27.62%	pixelavg _pp	46.67%
sad	15.21%	i422 chroma_hpp	36.81%	luma_vsp	27.72%	luma_vss	46.88%
idct	15.23%	copy _pp	36.82%	i422 chroma_hps	28.00%	i422 addAvg	46.88%
sad_x3	15.32%	pixelavg _pp	36.84%	pixel_satd	28.50%	luma_hps	46.90%
i444 chroma_vpp	15.47%	convert_p2s	36.87%	cpy2Dto1D_shl	28.69%	luma_vsp	46.97%
i422 chroma_vpp	15.47%	i420 p2s	36.87%	luma_vps	28.71%	i422 p2s	47.10%
i420 chroma_vpp	15.47%	i444 p2s	36.87%	i444 chroma_hpp	28.78%	copy _pp	47.11%
pixel_satd	15.52%	i444 chroma_hpp	37.07%	i420 pixel_satd	28.80%	luma_vss	47.64%
pixel_satd	15.62%	luma_vpp	37.11%	i422 pixel_satd	28.81%	i444 chroma_hpp	47.83%
pixel_satd	15.66%	luma_vss	37.49%	i422 pixel_satd	28.95%	i422 addAvg	47.85%
sad_x3	15.70%	addAvg	37.76%	luma_vss	29.26%	luma_hps	48.46%
pixel_satd	15.75%	i444 chroma_vps	37.90%	i444 chroma_vss	29.29%	copy _ps	48.57%
i420 chroma_hps	15.83%	i444 chroma_vss	38.04%	i420 chroma_hps	29.42%	sub_ps	48.83%
copy _pp	15.93%	i444 chroma_vps	38.05%	luma_vpp	29.43%	luma_hpp	48.97%
luma_vpp	16.10%	i444 chroma_vps	38.23%	scale1D_128to64	29.50%	i422 add_ps	49.02%
nquant	16.33%	sad	38.42%	luma_vss	29.59%	i444 chroma_vsp	49.43%
sad	16.35%	i444 chroma_hpp	38.45%	i444 chroma_vpp	29.69%	i420 sub_ps	49.46%
i444 chroma_vpp	16.39%	Weight_sp	38.48%	i422 chroma_vpp	29.69%	add_ps	49.50%
i420 chroma_hps	16.60%	i444 chroma_hpp	38.55%	i420 chroma_vpp	29.69%	i422 sub_ps	49.52%
i444 chroma_vpp	17.02%	sad	38.56%	i422 chroma_hps	29.71%	i420 addAvg	49.74%
i422 chroma_vpp	17.02%	luma_hpp	38.79%	i422 pixel_satd	29.75%	convert_p2s	49.75%
i420 chroma_vpp	17.02%	pixel_satd	39.15%	i444 chroma_vpp	29.82%	i422 p2s	49.75%
pixel_satd	17.08%	luma_hpp	39.21%	i422 chroma_vpp	29.82%	i444 p2s	49.75%
luma_vps	17.10%	i444 chroma_hpp	39.30%	luma_vss	29.91%	luma_vss	49.84%
luma_vps	17.36%	i444 chroma_vps	39.39%	i444 chroma_vss	29.92%	luma_hpp	50.00%
i444 chroma_vss	17.55%	addAvg	39.51%	i422 chroma_vss	29.92%	copy _sp	50.11%
i420 chroma_vss	17.55%	i420 chroma_hpp	39.55%	i420 chroma_vss	29.92%	luma_vss	50.22%
pixel_satd	17.59%	i422 pixel_satd	39.57%	luma_vps	30.19%	luma_hpp	50.61%
pixel_satd	17.66%	i422 chroma_hpp	39.61%	sad_x4	30.24%	luma_hpp	51.19%
i444 chroma_vss	18.42%	convert_p2s	39.78%	sad	30.30%	i444 chroma_vsp	51.23%
i422 chroma_vss	18.42%	i420 p2s	39.78%	luma_vps	30.37%	luma_hpp	51.70%
i420 chroma_vss	18.42%	i422 p2s	39.78%	luma_vps	30.39%	nonPsyRdoQuant	51.74%
i444 chroma_vpp	18.49%	i444 p2s	39.78%	i444 chroma_vpp	30.39%	i444 chroma_vsp	52.08%
i420 chroma_vpp	18.49%	copy _sp	39.93%	i422 chroma_vpp	30.39%	copy _pp	52.17%
luma_vps	18.50%	i420 addAvg	40.02%	i420 chroma_vpp	30.39%	i444 chroma_vsp	52.22%
luma_vpp	18.51%	luma_hps	40.04%	ssd_ss	30.44%	i444 chroma_vsp	52.28%
sad_x3	18.99%	i444 chroma_hpp	40.07%	i422 chroma_hpp	30.45%	nonPsyRdoQuant	52.32%
copy _pp	19.76%	addAvg	40.64%	i420 pixel_satd	30.53%	i422 copy _ss	52.45%
luma_vss	19.80%	luma_vsp	40.87%	i422 chroma_vpp	30.54%	nonPsyRdoQuant	52.56%
pixel_satd	19.89%	i444 chroma_vsp	40.96%	i444 chroma_hpp	30.54%	i444 chroma_vsp	52.77%
sad	20.09%	i420 chroma_vsp	40.96%	i422 chroma_hpp	30.56%	i422 chroma_vsp	52.77%
sad_x3	20.26%	luma_vss	41.01%	i444 chroma_hpp	30.63%	blockfill_s	52.93%
i444 chroma_hps	20.52%	i420 copy _sp	41.12%	i420 chroma_hpp	30.85%	i444 chroma_vsp	53.30%
i420 chroma_hps	20.80%	copy _cnt	41.14%	luma_vsp	30.95%	i422 chroma_vsp	53.30%
psyCost_pp	21.15%	luma_vsp	41.16%	sad_x4	30.95%	i420 chroma_vsp	53.30%
i444 chroma_hps	21.17%	Weight_pp	41.23%	i422 chroma_vss	30.99%	i422 chroma_vsp	53.36%
pixel_satd	21.19%	luma_hps	41.42%	i444 chroma_hps	31.12%	i444 chroma_vsp	54.34%
pixel_satd	21.21%	addAvg	41.84%	i444 chroma_vpp	31.17%	i422 chroma_vsp	54.34%
quant	21.23%	i420 addAvg	41.87%	i444 chroma_vpp	31.20%	i420 chroma_vsp	54.34%
sad_x3	21.29%	luma_vsp	41.99%	sad	31.29%	psyRdoQuant	54.44%
i444 chroma_vpp	21.42%	luma_hps	42.05%	luma_vsp	31.33%	luma_hpp	54.62%
i422 chroma_vpp	21.42%	convert_p2s	42.13%	sad_x3	31.34%	i444 chroma_vsp	54.64%
i420 chroma_vpp	21.42%	i420 p2s	42.13%	i422 pixel_satd	31.46%	i420 chroma_vsp	54.64%
i420 chroma_vps	21.60%	i422 p2s	42.13%	luma_hps	31.52%	luma_hpp	54.78%
pixel_satd	21.61%	i444 p2s	42.13%	i444 chroma_vpp	31.57%	luma_hpp	55.06%
i444 chroma_vps	21.69%	i444 chroma_vsp	42.31%	pixelavg _pp	31.62%	luma_hpp	55.40%
i422 chroma_hps	21.99%	i422 chroma_vsp	42.31%	luma_vps	31.76%	copy _pp	55.41%
i420 addAvg	22.01%	i420 chroma_vsp	42.31%	i444 chroma_hps	31.78%	psyRdoQuant	55.70%
luma_vsp	22.09%	luma_vsp	42.35%	sad_x3	31.95%	psyRdoQuant	55.72%
i444 chroma_vps	22.27%	i420 chroma_hpp	42.43%	i444 chroma_vss	31.96%	var	55.75%
i422 chroma_vps	22.41%	nonPsyRdoQuant	42.51%	i420 chroma_vss	31.96%	copy _ss	56.00%
sad_x4	22.44%	luma_hps	42.54%	i422 chroma_vss	32.01%	i444 chroma_vsp	56.36%
var	22.51%	addAvg	42.56%	i444 chroma_hpp	32.12%	i422 chroma_vsp	56.36%
i444 chroma_vpp	22.64%	luma_hps	42.58%	var	32.17%	i420 chroma_vsp	56.36%
i420 chroma_vpp	22.64%	luma_vss	42.82%	i420 chroma_hpp	32.32%	i420 copy _ss	56.63%
sad_x4	22.84%	i422 addAvg	42.93%	i444 chroma_hps	32.44%	i444 chroma_vsp	57.60%
i444 chroma_vpp	22.87%	luma_vpp	42.97%	luma_vsp	32.61%	i420 chroma_vsp	57.60%
i422 chroma_vpp	22.87%	dequant_scaling	42.98%	i444 chroma_vss	32.67%	copy _pp	58.33%
i422 chroma_hpp	22.92%	luma_hpp	42.99%	i420 chroma_vss	32.67%	copy _ss	60.09%
sad_x4	23.09%	i444 chroma_vsp	43.05%	i444 chroma_vss	32.69%	psyRdoQuant	62.80%
i444 chroma_vpp	23.19%	i422 chroma_vsp	43.05%	i422 chroma_vss	32.69%	i444 chroma_vsp	62.98%
				i420 chroma_vss	32.69%	i420 chroma_vsp	62.98%

A2 – Main10 profile IPC gains

Primitive	IPC Gain	Primitive	IPC Gain	Primitive	IPC Gain	Primitive	IPC Gain
convert_p2s	1.26%	i422 chroma_hps	39.92%	i422 chroma_vpp	29.64%	i444 chroma_hpp	49.20%
i420 p2s	1.26%	i422 p2s	40.30%	i420 chroma_vpp	29.64%	i444 chroma_hps	49.45%
i444 p2s	1.26%	luma_hpp	40.35%	i444 chroma_vsp	29.82%	cpy2Dto1D_shl	49.70%
addAvg	1.86%	i422 chroma_hpp	40.52%	i422 chroma_vsp	29.82%	luma_hvpp	49.80%
addAvg	6.88%	copy _cnt	40.55%	i420 chroma_vsp	29.82%	luma_vss	49.84%
dct	7.06%	luma_vpp	40.58%	luma_vss	29.91%	i420 chroma_hps	49.85%
sad_x3	7.65%	luma_vsp	40.59%	i444 chroma_vss	29.92%	convert_p2s	49.87%
sad	7.74%	i444 chroma_vps	40.60%	i422 chroma_vss	29.92%	i420 p2s	49.87%
sad	8.29%	i422 chroma_vps	40.60%	i420 chroma_vss	29.92%	i422 p2s	49.87%
i420 addAvg	8.36%	i420 chroma_vps	40.60%	i444 chroma_vps	29.93%	i422 p2s	49.87%
sad_x3	8.77%	sad_x3	40.64%	i422 chroma_vps	29.93%	i444 p2s	49.87%
luma_vss	9.76%	nonPsyRdoQuant	40.70%	i420 chroma_vps	29.93%	luma_hps	49.94%
intra_pred_ang27	9.79%	add_ps	40.71%	luma_vsp	30.06%	i422 chroma_hps	50.07%
cpy2Dto1D_shl	10.13%	sad_x4	40.73%	i444 chroma_vsp	30.11%	i444 chroma_hpp	50.13%
sad_x3	10.81%	luma_vpp	40.73%	i422 chroma_vsp	30.11%	luma_vss	50.22%
sad_x4	10.96%	copy _pp	40.81%	i420 chroma_vsp	30.11%	luma_hpp	50.25%
i420 addAvg	11.05%	i422 chroma_vps	40.88%	pixel_satd	30.30%	i420 chroma_vpp	50.28%
pixel_satd	11.05%	luma_vss	41.01%	i422 pixel_satd	30.30%	luma_hps	50.67%
i420 pixel_satd	11.05%	i444 chroma_vsp	41.02%	i422 pixel_satd	30.35%	addAvg	50.67%
i422 pixel_satd	11.05%	i420 chroma_vsp	41.02%	add_ps	30.69%	i422 addAvg	50.67%
luma_vsp	12.64%	i444 chroma_vsp	41.05%	sad	30.94%	luma_hpp	50.75%
copy _cnt	13.29%	i420 chroma_vsp	41.05%	dequant_normal	31.10%	i420 chroma_hpp	50.82%
idct	13.32%	sad	41.06%	sad	31.37%	copy _pp	50.95%
i444 chroma_vps	14.44%	intra_pred_ang34	41.06%	pixel_satd	31.43%	i422 addAvg	50.99%
i422 chroma_vps	14.44%	convert_p2s	41.09%	i420 pixel_satd	31.43%	luma_hps	51.17%
i420 chroma_vps	14.44%	i444 p2s	41.09%	i422 pixel_satd	31.43%	i422 chroma_hpp	51.22%
idct	14.49%	nonPsyRdoQuant	41.21%	i444 chroma_vpp	31.60%	i444 chroma_hpp	51.37%
i444 chroma_vpp	14.84%	sad_x4	41.22%	i422 chroma_vss	31.76%	luma_hpp	51.48%
idct	15.23%	i422 chroma_vpp	41.25%	i444 chroma_vss	31.96%	luma_hps	51.57%
luma_vsp	15.24%	i420 chroma_vpp	41.25%	i420 chroma_vss	31.96%	copy _ss	51.58%
sad_x3	15.53%	i420 chroma_vpp	41.36%	sad	31.99%	luma_hpp	51.63%
addAvg	15.60%	i444 chroma_vsp	41.40%	psyCost_pp	32.12%	luma_hps	51.64%
i422 chroma_vpp	15.71%	luma_vpp	41.43%	i420 chroma_hps	32.32%	luma_hps	51.65%
i420 chroma_vpp	15.71%	luma_hvpp	41.46%	i422 addAvg	32.46%	luma_hps	51.70%
addAvg	15.90%	luma_vpp	41.48%	i422 chroma_vss	32.62%	luma_hps	51.81%
i422 chroma_vpp	16.07%	i444 chroma_vsp	41.51%	i444 chroma_vss	32.67%	i422 chroma_hpp	51.86%
intra_pred_ang25	16.22%	luma_hvpp	41.54%	i420 chroma_vss	32.67%	luma_hps	51.89%
nquant	16.33%	intra_pred_ang11	41.55%	i444 chroma_vss	32.69%	addAvg	51.89%
sad_x4	16.42%	convert_p2s	41.58%	i422 chroma_vss	32.69%	i420 addAvg	51.89%
luma_vsp	16.55%	sad_x4	41.71%	i420 chroma_vss	32.69%	i422 addAvg	51.89%
i420 addAvg	17.12%	sad_x4	41.71%	luma_vss	32.89%	luma_hps	51.93%
sad_x4	17.33%	luma_vsp	41.78%	i444 chroma_vsp	33.14%	luma_hps	51.99%
i444 chroma_vss	17.55%	sad_x4	41.83%	i422 chroma_vsp	33.14%	i444 chroma_hpp	52.16%
i420 chroma_vss	17.55%	i444 chroma_vsp	42.01%	i444 chroma_vss	33.16%	i422 copy _sp	52.45%
i444 chroma_vps	17.88%	i444 chroma_vsp	42.08%	i420 chroma_vss	33.16%	i422 copy _ps	52.45%
i422 chroma_vps	17.88%	i422 chroma_vsp	42.08%	convert_p2s	33.27%	i422 copy _ss	52.45%
i420 chroma_vps	17.88%	nonPsyRdoQuant	42.13%	i444 chroma_vss	33.34%	i444 chroma_hps	52.94%
pixel_satd	18.02%	pixelavg _pp	42.17%	i422 chroma_vss	33.34%	copy _ss	53.20%
i422 addAvg	18.13%	i422 chroma_vpp	42.20%	i420 chroma_vss	33.34%	i420 chroma_hps	53.22%
i444 chroma_vss	18.42%	i420 chroma_vpp	42.20%	i444 chroma_vss	33.43%	i422 chroma_hps	53.27%
i422 chroma_vss	18.42%	luma_vps	42.30%	i422 chroma_vss	33.43%	i420 chroma_hpp	53.48%
i420 chroma_vss	18.42%	sub_ps	42.52%	pixelavg _pp	33.45%	copy _pp	53.53%
addAvg	19.50%	luma_vsp	42.55%	pixel_satd	33.45%	i422 chroma_hpp	53.81%
i444 chroma_vps	19.54%	luma_hvpp	42.65%	i420 pixel_satd	33.45%	i422 chroma_hpp	53.89%
i422 chroma_vps	19.54%	pixelavg _pp	42.65%	addAvg	33.46%	i444 chroma_hpp	54.31%
i420 chroma_vps	19.54%	luma_vps	42.72%	luma_vsp	33.47%	ssd_ss	54.69%
sad_x3	19.75%	convert_p2s	42.77%	sad_x4	33.51%	i422 chroma_hpp	54.77%
luma_vss	19.80%	luma_vss	42.82%	i444 chroma_vsp	33.79%	i420 chroma_hpp	55.18%
i422 pixel_satd	19.95%	luma_vsp	43.05%	i422 chroma_vsp	33.79%	luma_hpp	55.53%
pixel_satd	20.02%	convert_p2s	43.11%	i420 chroma_vsp	33.79%	i444 chroma_hpp	55.56%
i420 pixel_satd	20.02%	i444 chroma_hpp	43.15%	i444 chroma_vss	33.89%	i444 chroma_hpp	55.78%
i422 pixel_satd	20.02%	luma_vsp	43.17%	i420 chroma_vss	33.89%	i444 chroma_hpp	55.94%
i444 chroma_vps	20.09%	luma_vss	43.18%	luma_vsp	34.08%	luma_hpp	55.96%
i420 chroma_vps	20.09%	luma_vsp	43.22%	sub_ps	34.13%	copy _sp	56.00%
i422 chroma_vss	20.53%	luma_hvpp	43.24%	i444 chroma_vsp	34.18%	copy _ps	56.00%
sad_x4	20.69%	luma_vss	43.35%	i420 chroma_vsp	34.18%	i444 chroma_hpp	56.07%
i444 chroma_vps	20.86%	luma_vsp	43.36%	i444 chroma_vsp	34.22%	luma_hpp	56.16%
i422 chroma_vps	20.86%	i420 chroma_hpp	43.38%	i422 chroma_vsp	34.22%	i420 copy _sp	56.63%
i444 chroma_vpp	20.98%	cpy1Dto2D_shl	43.50%	i420 chroma_vsp	34.22%	i420 copy _ps	56.63%
quant	21.23%	luma_vsp	43.50%	i444 chroma_vss	34.43%	i420 copy _ss	56.63%
i422 chroma_vpp	21.45%	luma_vpp	43.51%	i422 chroma_vss	34.43%	i422 chroma_hpp	57.32%
sad	21.61%	copy _pp	43.54%	i420 chroma_vss	34.43%	i444 chroma_hps	57.33%
i444 chroma_vpp	21.78%	luma_hvpp	43.57%	pixel_satd	34.59%	luma_hpp	57.40%
i444 chroma_vps	22.06%	luma_vpp	43.58%	i444 chroma_vss	34.71%	i420 chroma_hps	57.97%
i420 chroma_vps	22.06%	luma_hvpp	43.60%	i444 chroma_vss	34.76%	luma_hpp	58.55%
i444 chroma_vsp	22.12%	luma_vss	43.75%	intra_pred_ang10	34.76%	i444 chroma_hps	59.21%
i422 chroma_vsp	22.12%	luma_vps	43.77%	i444 chroma_vps	34.80%	i420 chroma_hps	59.46%
i420 chroma_vsp	22.12%	i444 chroma_vsp	43.80%	i444 chroma_vps	34.98%	blockfill_s	59.53%
i444 chroma_vsp	22.14%	i420 chroma_vsp	43.80%	luma_vps	35.07%	luma_hpp	59.56%
i422 chroma_vsp	22.14%	pixelavg _pp	43.94%	i444 chroma_vps	35.34%	i422 chroma_hps	59.75%
i420 chroma_vsp	22.14%	psyRdoQuant	44.02%	Weight_pp	35.37%	copy _sp	60.09%
i422 chroma_vpp	22.28%	sad_x3	44.17%	i444 chroma_vss	35.51%	copy _ps	60.09%
i420 chroma_vpp	22.28%	pixelavg _pp	44.23%	luma_vps	35.63%	luma_hps	60.23%
i444 chroma_vpp	22.28%	luma_hvpp	44.24%	i422 chroma_hps	35.68%	psyRdoQuant	60.25%
i422 chroma_vpp	22.35%	luma_hvpp	44.28%	i444 chroma_vps	36.38%	luma_hpp	60.26%
ssd_ss	22.60%	luma_vsp	44.31%	i422 chroma_vss	36.56%	i444 chroma_hps	60.28%
i444 chroma_vpp	23.06%	dequant_scaling	44.37%	sad	36.66%	i420 chroma_hps	60.48%
sad_x4	23.09%	convert_p2s	44.40%	luma_vpp	36.68%	luma_hps	60.76%
luma_vpp	23.67%	luma_vpp	44.41%	i444 chroma_vpp	36.70%	copy _pp	60.87%
luma_vpp	23.82%	luma_vss	44.42%	luma_vsp	36.71%	i444 chroma_hps	60.92%
i444 chroma_vpp	23.84%	sad_x4	44.42%	sad_x3	36.75%	i422 chroma_hps	61.09%
i444 chroma_vss	24.15%	luma_vpp	44.60%	sad_x4	36.78%	luma_hpp	61.28%
i422 chroma_vss	24.15%	luma_vss	44.61%	pixel_satd	36.88%	i444 chroma_hpp	61.38%
i420 chroma_vss	24.15%	luma_hvpp	44.61%	i422 chroma_vpp	36.91%	luma_hpp	61.43%
intra_pred_ang9	24.37%	getResidual32	44.64%	copy _pp	36.96%	luma_hpp	61.44%
i444 chroma_vpp	24.41%	luma_hpp	44.68%	addAvg	37.08%	i422 chroma_hps	61.55%
luma_vpp	24.48%	luma_vss	44.70%	sad_x4	37.09%	luma_hpp	61.58%
i422 addAvg	24.62%	luma_hvpp	44.73%	i420 chroma_vpp	37.29%	luma_hpp	62.26%
psyCost_pp	24.88%	i444 chroma_vsp	44.76%	i422 chroma_vpp	37.36%	i422 chroma_hps	62.31%
i420 chroma_vpp	24.90%	i422 chroma_vsp	44.76%	i420 chroma_vpp	37.36%	luma_hpp	62.35%
i422 chroma_vpp	25.11%	i420 chroma_vsp	44.76%	luma_vss	37.49%	i420 chroma_hpp	62.39%
i420 chroma_vpp	25.11%	sad_x4	44.85%	luma_vpp	37.53%	i420 chroma_hps	62.39%
i444 chroma_vps	25.17%	luma_hvpp	45.15%	i444 chroma_vps	37.54%	i444 chroma_hpp	62.46%
i422 chroma_vps	25.17%	luma_vps	45.19%	i422 chroma_vps	37.54%	luma_hpp	62.63%
i420 chroma_vps	25.17%	i422 chroma_hpp	45.23%	i420 chroma_vps	37.54%	i444 chroma_hps	62.88%
i444 chroma_vss	25.17%	intra_pred_dc	45.26%	i444 chroma_vpp	37.59%	i420 chroma_hps	62.95%
i422 chroma_vss	25.17%	sad	45.31%	i420 chroma_vpp	37.59%	luma_hpp	63.07%
i420 chroma_vss	25.17%	luma_vps	45.36%	i444 chroma_vps	37.59%	i444 chroma_hps	63.15%
i422 chroma_vps	25.28%	psyRdoQuant	45.40%	i422 chroma_vps	37.59%	luma_hps	63.16%
i444 chroma_vps	25.97%	i420 add_ps	45.40%	pixel_satd	37.60%	i420 chroma_hpp	63.34%
i422 chroma_vps	25.97%	pixelavg _pp	45.52%	i444 chroma_vps	37.60%	luma_hpp	63.61%
i420 chroma_vps	25.97%	addAvg	45.54%	i420 chroma_vps	37.60%	i420 chroma_hps	63.85%
luma_vpp	26.22%	i420 addAvg	45.54%	i444 chroma_vsp	37.66%	luma_hpp	63.91%
sad	26.25%	i422 addAvg	45.54%	i422 chroma_vps	37.68%	i420 chroma_hpp	64.12%
psyCost_pp	26.30%	i444 chroma_vsp	45.57%	i444 chroma_vpp	37.69%	i444 chroma_hps	64.15%
i444 chroma_vsp	26.38%	i422 chroma_vsp	45.57%	i444 chroma_vps	37.71%	i444 chroma_hpp	64.23%
i420 chroma_vsp	26.38%	i420 chroma_vsp	45.57%	i420 chroma_vps	37.71%	i422 chroma_hpp	64.39%
i420 addAvg	26.39%	luma_vps	45.58%	convert_p2s	37.73%	i422 chroma_hpp	64.56%
i422 addAvg	26.39%	pixelavg _pp	45.61%	i420 p2s	37.73%	i444 chroma_hps	64.84%
pixel_satd	26.62%	luma_vps	45.62%	i422 p2s	37.73%	i422 chroma_hps	64.87%
i444 chroma_vss	26.71%	luma_vps	45.64%	i444 p2s	37.73%	i444 chroma_hpp	64.92%
i422 chroma_vss	26.71%	sad_x3	45.65%	i444 chroma_vpp	37.74%	i420 chroma_hps	64.93%
i420 chroma_vss	26.71%	i422 add_ps	45.68%	i444 chroma_vpp	37.76%	i422 chroma_hpp	65.05%
luma_vsp	26.77%	addAvg	45.72%	addAvg	37.80%	i444 chroma_hps	65.06%
luma_vps	27.04%	i420 addAvg	45.72%	i422 chroma_vpp	37.99%	i420 chroma_hpp	65.14%
luma_vpp	27.10%	pixelavg _pp	45.80%	i444 chroma_vss	38.04%	i422 chroma_hps	65.35%
i444 chroma_vss	27.24%	i444 chroma_hpp	45.95%	i420 chroma_hpp	38.04%	i422 chroma_hps	65.63%
i422 chroma_vss	27.24%	psyRdoQuant	45.96%	luma_vps	38.08%	i444 chroma_hps	65.72%
i422 chroma_vps	27.26%	luma_vsp	45.97%	i444 chroma_vpp	38.09%	i422 chroma_hpp	65.80%
i420 addAvg	27.28%	sad	46.04%	i444 chroma_vpp	38.27%	i444 chroma_hpp	65.88%
i422 addAvg	27.28%	luma_hvpp	46.17%	i422 chroma_vpp	38.27%	i420 chroma_hpp	65.92%
addAvg	27.55%	luma_vss	46.31%	i444 chroma_hps	38.30%	i420 chroma_hpp	65.94%
i422 chroma_vpp	27.71%	sad_x3	46.36%	intra_pred_ang2	38.34%	i444 chroma_hps	66.03%
i420 chroma_vpp	27.71%	sad_x3	46.42%	i444 chroma_hps	38.37%	i422 chroma_hps	66.03%
pixel_satd	27.93%	luma_vps	46.44%	i444 chroma_vpp	38.48%	i420 chroma_hps	66.15%
ssd_s	28.04%	luma_hpp	46.46%	copy _pp	38.51%	i422 chroma_hpp	66.20%
pixel_satd	28.10%	i444 chroma_vsp	46.66%	addAvg	38.54%	i422 chroma_hps	66.20%
pixelavg _pp	28.47%	sad_x3	46.71%	nonPsyRdoQuant	38.57%	i420 chroma_hps	66.29%
i420 pixel_satd	28.54%	luma_hpp	46.82%	sad_x3	38.74%	i422 chroma_hpp	66.32%
i422 pixel_satd	28.54%	luma_vss	46.88%	sad_x3	38.80%	i444 chroma_hpp	66.38%
pixel_satd	28.56%	i422 chroma_hps	46.99%	sad	38.84%	i444 chroma_vpp	66.41%
i420 pixel_satd	28.56%	intra_pred_ang26	47.26%	Weight_sp	38.86%	i444 chroma_hps	66.50%
i422 pixel_satd	28.56%	luma_vps	47.31%	pixel_satd	38.88%	i444 chroma_vpp	66.61%
i444 chroma_vps	28.75%	luma_hvpp	47.44%	i420 pixel_satd	38.88%	i444 chroma_vpp	66.63%
luma_vps	28.78%	pixelavg _pp	47.50%	copy _pp	38.96%	i444 chroma_hps	66.64%
luma_vps	28.82%	luma_vss	47.64%	i422 sub_ps	39.19%	i444 chroma_hpp	66.64%
i422 chroma_hps	28.86%	luma_vps	47.69%	i420 sub_ps	39.34%	i420 chroma_hpp	66.64%
i420 chroma_hps	29.02%	i420 chroma_hpp	47.78%	i420 chroma_hps	39.47%	i420 chroma_hpp	66.65%
sad_x3	29.04%	i422 chroma_hps	47.82%	luma_vpp	39.54%	i444 chroma_hps	66.71%
i444 chroma_hps	29.11%	luma_vsp	47.93%	luma_hvpp	39.63%	i422 chroma_hpp	66.71%
luma_vsp	29.13%	luma_hvpp	48.30%	i444 chroma_vps	39.68%	i444 chroma_hps	66.75%
luma_vss	29.26%	addAvg	48.40%	i420 chroma_vps	39.68%	i444 chroma_hps	66.91%
i444 chroma_vss	29.29%	i420 addAvg	48.40%	luma_hpp	39.72%	i422 chroma_hpp	66.92%
luma_vpp	29.39%	luma_hps	48.96%	addAvg	39.77%	i444 chroma_hpp	67.59%
luma_vss	29.59%	luma_hps	49.05%	convert_p2s	39.79%	i444 chroma_hpp	67.78%
				i420 p2s	39.79%	i420 chroma_hpp	69.14%
				i444 p2s	39.79%	i444 chroma_hpp	69.23%

Appendix B

1080p Test Clips and Bitrates Used

The following 1080p clips were used for generating test results.

passerby in a verdant sunny park
park_ joy _1080p.y4m

large crowd of joggers in a park
crowd_run_1080p50.y4m

ducks loligagging in a blue pond
ducks_take_off_1080p50.y4m

Urban landscape of old European city
old_town_cross_1080p50.y4m

4k Test Clips and Bitrates Used

The following 4k clips were used for generating test results.

vacation panaroma
Netflix_Boat_4096x2160_60fps_10bit_420.y4m

Tango afficionados
Netflix_Tango_4096x2160_60fps_10bit_420.y4m

a rural open market
Netflix_FoodMarket_4096x2160_60fps_10bit_420.y4m

Appendix C

Configurations for Testing on Intel® Core™ i7-4500U Processor
System Attribute	Value
OS Name	Windows 10 professional
Version	10.0.16299 Build 16299
System Model	MS-7A93
System Type	x64-based PC
Processor	Intel® Core™ i7- 4500U CPU @ 3.30GHz, 3312 MHz, 10 Core(s), 20 Logical Processor(s)
Core(s) per socket:	2
Thread(s) per core:	2
Socket(s):	1
NUMA node(s):	1

BIOS
BIOS Version/Date	American Megatrends Inc. 1.00, 6/2/2017
SMBIOS Version	3
BIOS Mode	UEFI

Graphic Interface:
Version	PCI-Express
Link Width	x16
Max. Supported	x16

Memory:
Type	DDR3
Channel	1
Size	8 GB
DRAM Frequency	800 MHz
command Rate (CR)	2T

Configurations for Testing on Intel® Core™ i9-7900X Processor
System Attribute	Value
OS Name	Microsoft Windows 10 Enterprise
Version	110.0.16299 Build 16299
System Model	MS-7A93
System Type	x64-based PC
Processor	Intel® Core™ i9-7900X CPU at 3.30GHz, 3312Mhz, 10 Core(s), 20 Logical Processor(s)
Core(s) per socket:	10
Thread(s) per core:	2
Socket(s):	1
NUMA node(s):	1

BIOS
BIOS Version/Date	American Megatrends Inc. 1.00, 6/2/2017
SMBIOS Version	3
BIOS Mode	UEFI

Graphic Interface:
Version	PCI-Express
Link Width	x16
Max. Supported	x16

Memory:
Type	DDR4
Channel	2
Size	32 GB
DRAM Frequency	1066.8 MHz
command Rate (CR)	2T

Configurations for Testing on Intel® Xeon® Platinum 8180 Processor
System Attribute	Value
OS Name	CentOS
Version	7.2
System Model	Intel S4PR1SY2B
System Type	x86_64
Processor	Intel® Xeon® Platinum 8180 CPU at 2.50 GHz
Core(s) per socket:	28
Thread(s) per core:	2
Socket(s):	2
NUMA node(s):	2

BIOS
BIOS Version/Date	SE5C620.86B.0X. 01.0007.062120172 125 / 06/21/2017
SMBIOS Version	2.8
BIOS Mode	UEFI

Graphic Interface:
Version	PCI-Express
Link Width	x16
Max. Supported	x16

Memory:
Type	DDR4
Channel	2
Size	192 GB
DRAM Frequency	1333 MHz
command Rate (CR)	2T

References

David A. Patterson and John L. Hennessey, Computer Organization and Design: the Hardware/Software Interface, 2nd Edition, Morgan Kaufmann Publishers, Inc., San Francisco, California, 1998, p.751.
VideoLAN Organization, x264, The best H.264/AVC encoder. https://www.videolan.org/developers/x264.html
MulticoreWare Inc., x265 HEVC Encoder/H.265 Video Codec. http://x265.org/
G. J. Sullivan, J.-R. Ohm, W.-J. Han and T. Wigand, "Overview of the High Efficiency Video Coding (HEVC) Standard,"IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12,pp. 1649-1668, 2012.
Intel Corporation, Intel Advanced Vector Instructions 512. https://www.intel.in/content/www/in/en/architecture-and-technology/avx-512-overview.html
Intel Corporation, "Intel® Xeon® Processor Scalable Family Specification Update", February, 2018. https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf
x265.org
HandBrake, An OpenSource Video Transcoder.https://handbrake.fr/
FFMPEG, A complete, cross-platform solution to record, convert and stream audio and video.
MulticoreWare Inc., "x265 Receives Significant Boost from Intel Xeon Scalable Processor Family."http://x265.org/x265-receives-significant-boost-intel-xeon-scalable-processor-family/

↧

GDC 2018 Tech Sessions

May 9, 2018, 10:56 am

Latest and popular articles on Intel Technologies

≫ Next: OpenVINO™ Toolkit Release Notes

≪ Previous: Accelerating x265 with Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

‹ Back to Highlights

Parallelizing Conqueror's Blade*

Parallelizing Conqueror's Blade from Intel® Software

Optimizing Total War*: WARHAMMER II

Optimizing Total War*: WARHAMMER II from Intel® Software

Getting Space Pirate Trainer to Perform on Intel® Graphics

Getting Space Pirate Trainer* to Perform on Intel® Graphics from Intel® Software

Masked Occlusion Culling

Masked Occlusion Culling from Intel® Software

Forts and Fights Scaling Performance on Unreal Engine*

Forts and Fights Scaling Performance on Unreal Engine* from Intel® Software

World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Physics and Graphics

World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Physics and Graphics from Intel® Software

Accelerate Game Development and Enhance Game Experience with Intel® Optane™ Technology

Accelerate Game Development and Enhance Game Experience with Intel® Optane™ Technology from Intel® Software

Scale CPU Experiences: Maximize Unity* Performance Using the Entity Components System and C# Job System

Scale CPU Experiences: Maximize Unity* Performance Using the Entity Components System and C# Job System from Intel® Software

↧

OpenVINO™ Toolkit Release Notes

May 8, 2018, 3:26 pm

Latest and popular articles on Intel Technologies

≫ Next: Installing the OpenVINO™ Toolkit for Windows*

≪ Previous: GDC 2018 Tech Sessions

Introduction

NOTE: The OpenVINO™ toolkit was formerly known as the Intel® Computer Vision SDK

The OpenVINO™ toolkit is a comprehensive toolkit for quickly developing applications and solutions that emulate human vision. Based on Convolutional Neural Networks (CNNs), the toolkit extends CV workloads across Intel® hardware, maximizing performance.

The OpenVINO™ toolkit:

Enables the CNN-based deep learning inference on the edge.
Supports heterogeneous execution across Intel CV accelerators, using a common API for the CPU, Intel® Integrated Graphics, Intel® Movidius™ Neural Compute Stick, and FPGA.
Speeds time-to-market through an easy-to-use library of CV functions and pre-optimized kernels.
Includes optimized calls for CV standards, including OpenCV*, OpenCL™, and OpenVX*

New and Changed in This Release

Model Optimizer Changes

The Model Optimizer component has been replaced by a Python*-based application, with a consistent design across the supported frameworks. Key features are listed below. See the Model Optimizer Developer Guide for more information.

General changes:
- Several CLI options have been deprecated since the last release. See the Model Optimizer Developer Guide for more information.
- More optimization techniques were added.
- Usability, stability, and diagnostics capabilities were improved.
- Microsoft* Windows* 10 support was added.
- A total of more than 100 public models are now supported for Caffe*, MXNet*, and TensorFlow* frameworks.
- A framework is required for unsupported layers, and a fallback to the original framework is available for unsupported layers.
Caffe* changes:
- The workflow was simplified, and you are no longer required to install Caffe.
- Caffe is no longer required to generate the Intermediate Representation for models that consist of standard layers and/or user-provided custom layers. User-provided custom layers must be properly registered for the Model Optimizer and the Inference Engine. See Using the Model Optimizer to Convert Caffe* Models for details and a list of standard layers.
- Caffe is now only required for unsupported layers that are not registered as extensions in the Model Optimizer.
TensorFlow* support is significantly improved, and now offers a preview of the Object Detection API support for SSD*-based topologies.

Inference Engine

Added Heterogeneity support:
- Device affinities via API are now available for fine-grained, per-layer control.
- You can now specify a CPU fallback for layers that the FPGA does not support. For example, you can specify HETERO: FPGA, CPU as a device option for Inference Engine samples.
- You can use the fallback for CPU + Intel® Integrated Graphics if you have custom layers implemented only on the CPU, and you want to execute the rest of the topology on the Intel® Integrated Graphics without rewriting the custom layer for the Intel® Integrated Graphics.
Asynchronous execution: The Asynchronous API improves the overall application frame rate, allowing you to perform secondary tasks, like next frame decoding, while the accelerator is busy with current frame inference.
New customization features include easy-to-create Inference Engine operations. You can:
- Express the new operation as a composition of existing Inference Engine operations or register the operation in the Model Optimizer.
- Connect the operation to the new Inference Engine layer in C++ or OpenCL™. The existing layers are reorganized to “core” (general primitives) and “extensions” (topology-specific, such as DetectionOutput for SSD). These extensions now come as source code that you must build and load into your application. After the Inference Engine samples are compiled, this library is built automatically, and every sample explicitly loads the library upon execution. The extensions are also required for the pre-trained models inference.
Plugin support added for the Intel® Movidius™ Neural Compute Stick hardware (Myriad2).
Samples are provided for an increased understanding of the Inference Engine, APIs, and features:
- All samples automatically support heterogeneous execution.
- Async API showcase in Object Detection via the SSD sample.
- Minimalistic Hello, classification sample to demonstrate Inference Engine API usage.

OpenCV*

Updated to version 3.4.1 with minor patches. See the change log for details. Notable changes:
- Implementation of on-disk caching of precompiled OpenCL kernels. This feature reduces initialization time for applications that use several kernels.
- Improved C++ 11 compatibility on source and binary levels.
Added subset of OpenCV samples from the community version to showcase the toolkit capabilities:
- bgfg_segm.cpp - background segmentation
- colorization.cpp - performs image colorization using DNN module (download the network from a third-party site)
- dense_optical_flow.cpp - dense optical flow using T-API (Farneback, TVL1)
- opencl_custom_kernel.cpp - running custom OpenCL™ kernel via T-API
- opencv_version.cpp - the simplest OpenCV* application - prints library version and build configuration
- peopledetect.cpp - pedestrian detector using built-in HOGDescriptor

OpenVX*

A new memory management scheme with the Imaging and Analytics Pipeline (IAP) framework drastically reduces memory consumption.
- Introduces intermediate image buffers that result in a significant memory footprint reduction for complex Printing and Imaging (PI) pipelines operating with extremely large images.
- Deprecated tile pool memory consumption reduction feature. Removed from the Copy Pipeline sample.
The OpenVX* CNN path is not recommended for CNN-based applications and is partially deprecated:
- CNN AlexNet* sample is removed.
- CNN Custom Layer (FCN8) and Custom Layers library are removed.
- The OpenVX* SSD-based Object Detection web article is removed.
- OpenVX* FPGA plugin is deprecated. This is part of the CNN OVX deprecation.
The VAD tool for creating OpenVX* applications is deprecated and removed.
New recommendation: Use Deep Learning Inference Engine capabilities for CNN-based applications.

Examples and Tutorials

Model downloader for the OpenVINO™ toolkit public models in Caffe format:
- densenet-121, densenet-161, densenet-169, densenet-201 - Densely Connected Convolutional Networks. https://github.com/shicai/DenseNet-Caffe
- squeezenet 1.0, squeezene 1.1 - SqueezeNet: AlexNet-level accuracy with 50 times fewer parameters and 0.5 MB model size. https://github.com/DeepScale/SqueezeNet
- mtcnn-p, mtcnn-r, mtcnn-o - Multi-task Cascaded Convolutional Networks: Proposal, Refine, Output https://github.com/DuinoDu/mtcnn/tree/master/model
- mobilenet-ssd – Common object detection architecture https://github.com/chuanqi305/MobileNet-SSD
- vgg16, vgg19 - Very Deep Convolutional Networks For Large-Scale Image Recognition https://arxiv.org/pdf/1409.1556.pdf
- ssd300, ssd512 - Single Shot MultiBox Detector https://arxiv.org/pdf/1512.02325.pdf
Cross-check tool: To debug the model inference both in whole and layer-by-layer, comparing accuracy and performance between CPU, Intel® Integrated Graphics, and the Intel® Movidius™ Neural Compute Stick.
CNN pre-trained models (prototxt) + pre-generated Intermediate Representations (.xml + .bin):
- age-gender-recognition-retail: Age and gender classification.
- face-detection-retail: Face Detection.
- person-detection-retail: Person detection.
- license-plate-recognition-barrier: Chinese license plate recognition.
- face-detection-adas: Face Detection.
- person-detection-retail: Person Detection.
- head-pose-estimation-adas: Head and yaw + pitch + roll.
- vehicle-attributes-recognition-barrier: Vehicle attributes (type/color) recognition.
- person-vehicle-bike-detection-crossroad: Multiclass (person, vehicle, non-vehicle) detector.
- vehicle-license-plate-detection-barrier: Multiclass (vehicle, license plates) detector.

Known Issues

ID	Description	Component	Workaround
1	Releasing a non-virtual vx_array object after it has been used as a parameter in a graph and before graph execution, may result in slow vxProcessGraph and data corruption.	OpenVX*	N/A
2	When a graph is abandoned due to a failure in a user node, the callbacks that are attached to skipped nodes are called.	OpenVX	N/A
3	The OpenVX* volatile kernels extensions API are subject to change.	OpenVX	N/A
4	Multiple user node access the same array cause application crash.	OpenVX	N/A
5	Intel® Integrated Graphics equalize histogram node partially runs on CPU.	OpenVX	N/A
6	User node hangs when calling Intel® Intetegated Performance Primitives if the node is linked to IAP.so	OpenVX	N/A
7	Edge Tracing part of IPU Canny Edge detection runs on CPU.	OpenVX	N/A
8	The Harris Corners* Kernel Extension produces inaccurate results when the sensitivity parameter is set outside the range of [0.04; 0.15]	OpenVX	N/A
9	The API vxQueryNode() returns zero for custom Intel® Integrated Graphics nodes when queried for the attribute VX_NODE_ATTRIBUTE_PERFORMANCE.	OpenVX	N/A
10	Node creation methods do not allow using the NULL pointer for non-optional parameters.	OpenVX	N/A
11	The vx_delay object doesn’t the support vx_tensor and vx_object_array types	OpenVX	N/A
12	The vx_delay object is not supported as a user node input parameter	OpenVX	N/A
13	Scalar arguments are not changing dynamically in several nodes in `ColorConvert` node on Intel®Integrated Graphics in Runtime	OpenVX	N/A
14	The OpenCL™ out of order queue feature might slow down a single nodes graph	OpenVX	N/A
15	On CPU in `vxConvolutionLayerrounding_police parameter` ignored, TO_ZERO rounding is used in any case	OpenVX	N/A
16	On CPU in `vxFullyConnecedLayerrounding_police` parameter ignored, TO_ZERO rounding is used in any case	OpenVX	N/A
17	On CPU in `vxTensorMultiplyNode` rounding_policy parameter ignored, TO_ZERO policy is used in any case	OpenVX	N/A
18	Unsupported Dynamic Shapes for Caffe* layers	Model Optimizer	N/A
19	Some TensorFlow operations are not supported, but only a limited set of different operations can be successfully converted.	Model Optimizer	Enable unsupported ops through Model Optimizer extensions and IE custom layers
20	Only TensorFlow models with FP32 Placeholders. If there is non FP32 Placeholder, the next immediate operation after this Placeholder should be a Cast operation that converts to FP32.	Model Optimizer	Rebuild your model to include a FP32 placeholder only or add cast operations
21	Only TensorFlow models with FP32 weights are supported.	Model Optimizer	Rebuild your model to have FP32 weights only
22	The recent version of TensorFlow Detection API is not supported. Only SSD models frozen in versions prior r1.6.0 of the detection API can be converted.	Model Optimizer	N/A
23	Pre-build protobuf binary distributed as egg-file with Model Optimizer breaks Python 3.5.2 installation. It shouldn't be used with Python 3.5.2.	Model Optimizer	Build protobuf binary yourself (recommended), or use Python version of protobuf (slow
24	TensorFlow models with trainable layers such as Conv2D or MatMul that re-use weights from the same Const operations cannot be successfully converted.	Model Optimizer	Rebuild a model with duplicated Const operations to avoid weights sharing
25	Embedded preprocessing in Caffe models is not supported and is ignored.	Model Optimizer	Pass preprocessing parameters through MO CLI parameters
26	Offloading computation to TensorFlow using the following command line parameters doesn't work with TensorFlow 1.8: `--tensorflow_operation_patterns` `--tensorflow_subgraph_patterns` `--offload_unsupported_operations_to_tf`	Model Optimizer	N/A
27	Releasing the the plugin's pointer before inference completion might cause a crash.	Inference Engine	Release the plugin pointer at the end of the application, when inference is done.
28	Altera* OpenCL* 17.1 might not be installed properly. Follow the Installation guide.	Inference Engine	Use the instructions in the FPGA installation guide
29	FP11 bitstreams can be programmed to the boards using the flash approach only.	Inference Engine	Use the instructions in the FPGA installation guide
30	If Intel OpenMP was initialized before OpenCL, OpenCL will hang. This means initialization or executing the FPGA will hang, too.	Inference Engine	Initialize FPGA or Heterogeneous with the FPGA plugin priority before the CPU plugin.
31	The performance of the first iteration of the samples for networks executing on FPGA is much lower than the performance of the next iterations.	Inference Engine	Use the -`ni <number> -pc` to tet the real performance of inference on FPGA.
32	To select the best bitstream for a custom network, evaluate all available bitstreams and choose the bitstream with the best performance and accuracy. Use `validation_app` to collect accuracy and performance data for the validation dataset.	Inference Engine
33	The Intel® Movidius™ Myriad™ Vision Processing Unit plugin supports `batch=1` only	Inference Engine	Infer the batch of images one at a time, or use multiple Intel® Movidius™ Myriad™ Vision Processing Units
34	Myriad plugin may fail to open device when several processes try to do inference the same time and several NCS devices are available	Inference Engine	Use threads within same process to utilize multiple devices.
35	The setBatch method works only for topology which has batch as first dimension for all tensors	Inference Engine
36	Multiple OpenMP runtime initialization is possible if you are using MKL and Inference Engine simultaneously	Inference Engine	Use apreloaded iomp5 dynamic library
37	Completion Callback is called in case of succesfull execution of infer request only	Inference Engine	Use Wait to get notfied about errors in infer request

Included in This Release

The OpenVINO™ toolkit is available in three versions:

OpenVINO™ toolkit for Windows
OpenVINO™ toolkit for Linux
OpenVINO™ toolkit for Linux with FPGA Beta Support

Install Location/File Name	Description
Deep Learning Model Optimizer	Model optimization tool for your trained models.
Deep Learning Inference Engine	Unified API to integrate the inference with application logic
OpenCV* 3.4.1 library	OpenCV Community version compiled for Intel hardware. Includes PVL libraries for computer vision
Intel® Media SDK libraries (open source version)	Eases the integration between the OpenVINO™ toolkit and the Intel® Media SDK.
Intel OpenVX* 1.1 runtime	Directory containing bundled OpenVX* runtime that supports CPU, Intel® Integrated Graphics, and IPU devices
OpenCL™ NEO driver	Improves usability
Intel® FPGA Deep Learning Acceleration Suite, including pre-compiled bitstreams	Implementations of the most common CNN topologies to enable image classification and ease the adoption of FPGAs for AI developers. Includes pre-compiled bitstream samples for the Intel® Programmable Acceleration Card with Intel® Arria® 10 GX FPGA and the Arria® 10 GX FPGA Development Kit.
Intel® FPGA SDK for OpenCL™ software technology	The Intel® FPGA RTE for OpenCL™ provides utilities, host runtime libraries, drivers, and RTE-specific libraries and files
OpenVINO™ toolkit documentation	Developer guides and other documentation. Available from the OpenVINO™ toolkit product site
Pre-trained Deep Learning Models	Ten pre-trained models for prototxt and generated Intermediate Representation Files. You can use these for demonstrations, to help you learn the product, or for product development.
Computer Vision Samples	Samples that illustrate use of or application computer vision application creation for the Inference Engine, OpenCV, and OpenVX.

Where to Download This Release

https://software.intel.com/en-us/OpenVINO-toolkit/choose-download

System Requirements

Development Platform

Processors

6th-8th Generation Intel® Core™ & Xeon™ processor

Operating Systems:

Ubuntu* 16.04.3 long-term support (LTS), 64-bit
CentOS* 7.4, 64-bit
Windows* 10, 64-bit

Target Platform (choose one processor with one corresponding operating system)

Your requirements may vary, depending on which product version you use.

CPU processors with corresponding operating systems

6th-8th Generation Intel® Core™ & Xeon™ processor with operating system options:
- Ubuntu* 16.04.3 long-term support (LTS), 64-bit
- CentOS* 7.4, 64-bit
- Windows* 10, 64-bit
Intel® Pentium® processor N4200/5, N3350/5, N3450/5 with Intel® HD Graphics
- Yocto Project* Poky Jethro* v2.0.3, 64-bit

Intel® Integrated Graphics processors with corresponding operating systems (GEN Graphics)

NOTE: This installation requires drivers that are not included in the Intel® openVINO-toolkitpackage

6th Generation Intel® Core™ processor with Intel® Iris® Pro graphics and Intel® HD Graphics
- Ubuntu* 16.04.3 long-term support (LTS), 64-bit
- CentOS* 7.4, 64-bit
6th Generation Intel® Xeon™ processor with Intel® Iris® Pro graphics and Intel® HD Graphics (without e5)
- Ubuntu* 16.04.3 long-term support (LTS), 64-bit
- CentOS* 7.4, 64-bit

FPGA processors with corresponding operating systems

NOTES:
Only for the OpenVINO™ toolkit for Linux with FPGA Beta Support download
OpenCV* and OpenVX functions must be run against the CPU or Intel® Integrated Graphics to get all required drivers and tools

Intel® Arria® FPGA 10 GX development kit
- Ubuntu* 16.04.3 long-term support (LTS), 64-bit
- CentOS* 7.4, 64-bit

Intel® Movidius™ Neural Compute Stick processor with corresponding operating systems

Intel® Movidius™ Neural Compute Stick Neural Compute Stick
- Ubuntu* 16.04.3 long-term support (LTS), 64-bit
- CentOS* 7.4, 64-bit

Helpful Links

Note: Links open in a new window.

OpenVINO™ toolkit Home Page: https://software.intel.com/en-us/OpenVINO-toolkit

Intel® OpenVINO™ toolkit Documentation: https://software.intel.com/en-us/OpenVINO-toolkit/documentation/featured

Legal Information

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

No computer system can be absolutely secure.

Intel, Arria, Core, Movidius, Xeon, OpenVINO, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

*Other names and brands may be claimed as the property of others.

↧