Introduction – Wide-Area Networks (WAN) Contribution to the Growing Attack Surface

It is often said that complexity is the enemy of security. The reason for this is very simple.

Attackers work by continuously probing attack surfaces in search of "seams," or weaknesses, in the IT infrastructure. These could be places where there is human interaction, where there is control hand-off and opportunity for error, where invisible infrastructure touches an open network, or where operational information inadvertently reveals potential exploits. It's simple math. Larger and more complex attack surfaces offer more opportunities for attackers to get inside.

There was a time when minimizing the size and complexity of networks made good management and security sense. Then came the cloud, low-cost internet connectivity, mobility, the internet of Things (IoT); suddenly, networks have become anything but simple.

For example, take an ordinary database application that at one time would have run securely inside a corporate data center. Today, pieces of that solution might be scattered around the globe. You might have data storage from one cloud service provider, an analytics engine from another, access governance from another, and business logic provided by another. You might even have widely distributed connected devices feeding data directly into the database.

To further complicate this security picture, you also have WAN services that connect you to all your distributed assets and that connect the assets to each other. The combination of cloud services, mobility, and low-cost internet WAN has produced explosive growth in the size and complexity of networks and their attack surfaces. From the attackers' perspective, it's open season, and the conditions are perfect!

Protecting digital assets in this environment involves applying traditional layered strategies to new network architectures, building security into applications, and using new tools to secure endpoints. There is another part of the infrastructure that does not get the security attention it deserves. That is WAN connectivity, which is increasingly becoming the fabric that holds together the entire distributed enterprise.

As the one piece of infrastructure that touches all those distributed digital assets, the WAN is an important part of the attack surface, but it may also be the foundation for a more comprehensive approach to securing complex IT infrastructures. Yet people making decisions about WAN services typically focus on cost and performance. Security of the WAN itself is too often a secondary consideration.

In an earlier article How Netrolix Broke the SD-WAN Barrier with AI-WAN*, we showed how Netrolix has developed a new approach to wide-area networking. It is an artificial intelligence (AI)-driven internet network, an AI-WAN, that has the cost and agility advantages of software-defined wide-area networking (SD-WAN), the quality of service guarantees of private line solutions, and security that is superior to both.

In this article, we dig deeper into Netrolix AI-WAN* security advantage. But before doing that, let's look closely at the security strengths and weaknesses of the two most common WAN solutions: Multiprotocol Label Switching (MPLS) private lines and SD-WAN.

Private Infrastructure and MPLS – Private, but Secure?

From a performance, reliability, and security perspective, private connections are typically considered the ideal WAN solution. MPLS has become one of the most widely used data-carrying methods in private infrastructure because of its ability to carry a variety of network protocols and access technologies, simplifying the configuration of multipurpose private connections.

However, these connections are expensive and inflexible. They need to be configured by service providers, which sometimes takes months, and their cost is too high to support all the demand for wide-area networking. Many organizations reserve private connections for their most critical operations and use public infrastructure to fill less sensitive WAN needs.

This decision is based in part on the assumption that private connections are more secure. But are they really?

The idea is that because private lines are private, they provide no visibility to potential attackers, and therefore they are proof against outside attack. Being so invulnerable, they were never designed to natively support data-protection methods like encryption. Of course, if a service provider's core network were ever breached, all that unencrypted data would be exposed.

But that's not the only way MPLS connections can be compromised. Private MPLS circuits often use a layer-2 connection from a local incumbent service provider to send unprotected data across many miles and multiple facilities until it is handed off to the MPLS provider. This is done because, in the case of a customer with widely distributed locations, no single provider has the extended geographical footprint to directly connect all the locations using just its own assets. For example, when service provider A delivers MPLS services in service provider B's territory, provider A purchases backhaul connectivity from provider B back to the agreed upon colocation point. This way of delivering private MPLS connections creates a very easy and well-known man-in-the-middle attack surface.

The only way to protect against these kinds of MPLS attacks is to encrypt data, but alas, MPLS does not natively support encryption. To encrypt data passing through MPLS private connections, every single application, wherever it is located in the complex, highly distributed infrastructure, must encrypt its data. This can be a big management task with lots of room for error, especially in a complex network infrastructure.

Public Infrastructure and SD-WAN – The Devil Is In the Details

WAN connections over public infrastructure rely on the internet to transport data, and it's this internet connection that worries security managers. There's too much visibility and complexity to assure data security in the vastness of the global internet. That's why WAN solutions using public infrastructure often include data encryption.

For example, SD-WAN is one form of internet WAN that many organizations are adopting to fill their growing need for more WAN connectivity. Although SD-WAN doesn't deliver the quality of service of a private line, it is easier to set up and costs much less. Many contend that SD-WAN is more secure than MPLS because it natively supports data encryption, which makes it easy to encrypt all data moving in the network, regardless of where the data is coming from.

This sounds great, but as is often the case, it's not the whole story. SD-WAN has real and potential vulnerabilities that need to be considered, including:

SD-WAN's low cost and ease of deployment make it possible to expand your WAN quickly, which means that more assets can be moved into the cloud and new services can be easily added for partners and customers. All of this creates rapid growth in network size, complexity, and attack surface.
Not all data encryption provides the same level of protection. Some providers use less challenging encryption algorithms to save on computational cost. The types of encryption keys and re-keying practices can also affect the strength of encryption.
SD-WAN appliances used by many service providers contain known vulnerabilities, and they do not have adequate protection from administrative or physical access. Because of the way SD-WANs operate, compromising one SD-WAN appliance can provide access to an entire network. As SD-WANs grow, the risk from vulnerable appliances also grows. Since there is no private core network connectivity in most SD-WANs, the individual and unique peer-to-peer connections required to make them work offer no way of seeing or detecting abnormal behavior.

Netrolix AI-WAN – More Secure Than SD-WAN and MPLS

As detailed in an earlier article about Netrolix AI-WAN (How Netrolix Broke the SD-WAN Barrier with AI-WAN*), the Netrolix AI-WAN consists of the AI-WAN fabric, which is a vast network of ISPs and host data centers around the globe whose traffic is continuously analyzed and monitored by a proprietary deep-learning analytical engine.

Netrolix accomplishes this by having hardware and software installed in 70 data centers globally and collecting internet data from 20,000 nodes. This enables continuous analysis of multiple performance factors on all the ISPs on the planet to determine optimal data paths to any endpoint and across the Netrolix AI-WAN core.

To connect to this AI-WAN fabric, Netrolix has developed a suite of low-cost endpoint devices, which are software-defined gateways (SDGs) that run on either their own Intel® architecture-based bare-metal platforms, or appropriate client-owned equipment. The AI engine monitors the global internet while monitoring and communicating with every endpoint device connected to the AI-WAN fabric. All of Netrolix's services, including MPLS, Virtual Private LAN Service (VPLS), Ethernet private line, SD-WAN, global Virtual Private Network (VPN), cloud services, and other offerings are layered over the AI-WAN fabric.

This enables Netrolix to deliver WAN performance that is on par or better than traditional private networks from global service providers, with guaranteed throughput at wire speeds, much lower cost, and all with the flexibility and ease of setup that an SD-WAN offers.

There is another big advantage to Netrolix AI-WAN. It provides a level of WAN security that is unmatched by SD-WAN or MPLS services. Let's see why that is so.

Netrolix AI-WAN Defense in Depth

The Netrolix AI-WAN security posture addresses three aspects of network security:

Securing data on the network
Securing the AI-WAN fabric
Integrating with or augmenting existing enterprise security

Netrolix AI-Wan* infographic

Netrolix uses defense in depth to secure the AI-WAN fabric while integrating with existing enterprise and cloud defenses.

Five factors that secure data on the AI-WAN network

Netrolix applies a multifactor security strategy for protecting data on the network that includes the following elements:

Data encryption – All data passing through the Netrolix AI-WAN is automatically encrypted using IKEv2 elliptic curve cryptography, which is the most powerful encryption standard currently in use.
Key management – The Netrolix AI-WAN uses a robust Key Management System (KMS) to generate encryption keys for every device, every element of the AI-WAN network, every storage instance, and every network configuration. Many SD-WAN providers use one encryption key across a network, and key swapping or re-keying is done manually. In that case, if a key is compromised in one location, the entire network is compromised. In the Netrolix AI-WAN, every network element has its own key, and every key in the global AI-WAN is automatically re-keyed every 30 minutes.
Hardware Security Module (HSM) authentication – In the Netrolix AI-WAN, every Netrolix SDG uses HSM authentication, which is the same hardware-based authentication used in credit and debit card chips. It prevents access to the encryption keys of any Netrolix SDG unless the device is connected over the AI-WAN to a Netrolix management console, which prohibits unauthorized access.
RADIUS attributes – These are used to authenticate any device that connects to the AI-WAN.
The AI analytics engine – The Netrolix AI-WAN uses a proprietary deep-learning analytical engine that does several things. It analyzes global internet traffic and optimize end-to-end data paths from any device connected to the AI-WAN, across the AI-WAN core, to any endpoint (for details about this process, see How Netrolix Broke the SD-WAN Barrier with AI-WAN*). Every device connected to the AI-WAN gets data path re-optimization every five minutes.

The analytics engine also performs another important security function. It continuously monitors every device connected to the AI- WAN and identifies anomalous data patterns. It not only monitors the AI-WAN fabric itself, but also data coming from or going to devices connected to the AI-WAN, such as IoT devices, industrial control systems, or autonomous devices such as drones or robots, as an example. The ability to detect unusual network activity related to specific devices like these is an important capability. When these kinds of devices are added to an environment, they enlarge the attack surface. Yet many are being built with little understanding or regard for IT security.

Securing the AI-WAN fabric

In addition to protecting network data, several of the features described in the previous section also protect the AI-WAN fabric. For instance, by analyzing traffic associated with every device connected to the AI-WAN, the analytics engine is able to prevent someone from unplugging a device from the network and moving it to a new location. This change would immediately be detected and cause the device to be quarantined.

RADIUS functionality and IPSec prevent unauthorized devices from connecting to the network, and HSM prevents the compromise of the encryption keys. Beyond that, however, there are additional architectural features that harden the Netrolix AI-WAN.

For instance, it is not possible to locally manage a Netrolix SDG or gain visibility into a device by accessing the underlying operating system, for example. Access can only happen through a management console, and this is a containerized application that runs on a hypervisor in redundant centralized locations. All management functions executed by this console happen over IPSec using HSM authentication. With no access to the underlying architecture and no direct access to the hypervisor, Netrolix's SDGs become pretty impervious to unauthorized tampering.

The Netrolix SDGs are also protected against physical tampering. They were designed to be rigid boxes with no moving parts and no easy way to open. If a Netrolix SDG is forced open, its data is wiped with no possibility of recovery.

Total integration with enterprise layered security

The third key part of Netrolix AI-WAN security strategy is the way it is architected to easily integrate with an existing enterprise security stack. For instance, a Netrolix SDG can be configured as a simple network interface (NID). If a Netrolix AI-WAN user wants to keep their existing Fortinet, Juniper, Cisco, or whatever network devices they have in place today, those devices can connect to the AI-WAN through the Netrolix SDG configured as an NID.

But a Netrolix SDG can do more. It can work as a network access point plus a router, a switch, and a firewall, all in one solution. And it can be further configured with edge compute capabilities so that it combines network access, router, switch, firewall, and multi-access edge computing (MEC) in one solution.

Netrolix makes it very easy to configure their SDGs through the management console. When a new Netrolix SDG is plugged into your internet service, the AI-WAN immediately discovers it, optimizes it, begins encryption and key management, and enables its functions to be configured through the highly secure management console. This makes integration with existing security stacks an easy process.

How Intel Enables Netrolix AI-WAN Security

Netrolix considered several factors when choosing technology from Intel for the bare-metal platform that is the basis of all their SDGs. These considerations are detailed in an earlier article (How Netrolix Broke the SD-WAN Barrier with AI-WAN*).

Ultimately, it was the flexibility of chipsets from Intel in supporting Netrolix's architectural needs and the supporting software that were deciding factors. From the very beginning, designing an internet WAN that was more secure than any currently available public or private WAN option was central to those architectural needs.

The earlier article details chipsets used in different Netrolix SDG platforms, but several Intel technologies play an important role in supporting Netrolix's underlying AI-WAN security, including virtualization, secure hardware sharing, and hardware-based encryption. These include:

Intel® Virtualization Technology (Intel® VT) for IA-32, Intel® 64 and Intel® Architecture (Intel® VT-x)
Intel® Virtualization Technology (Intel® VT) for Directed I/O (Intel® VT-d) – secure hardware sharing
Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI) – CPU-based encryption

Netrolix AI-WAN Delivers a New Level of Internet WAN Security

One big challenge facing IT security managers today is that networks are growing so fast, traditional security practices are unable to keep up. In a world in which connected devices, distributed processing, and lots of internetworking are all happening beyond the direct control of those responsible for securing digital assets, WAN security is becoming fundamentally important.

Netrolix has created a new approach to internet WAN, an AI-WAN that is optimized and secured by a proprietary, deep-learning analytics engine. Netrolix's multifactor security strategy has effectively created a "defense in depth" approach to WAN security that does more than provide new levels of protection. It also extends WAN security beyond the wires to integrate with IT systems and existing enterprise security stacks. These are all the ways Netrolix AI-WAN makes wide-area networking more secure.

To learn more about the Netrolix AI-WAN and Netrolix's many networking services built on the AI-WAN fabric, see the article How Netrolix Broke the SD-WAN Barrier with AI-WAN and visit the Netrolix website.

Also, visit the Intel® Network Builders website and explore a vibrant resource center that provides support to anyone interested in software-defined networking and network function virtualization.

Abstract

Deep learning has revolutionized the field of machine learning. Convolutional Neural Networks (CNNs) have become very popular for solving problems related to image recognition, image reconstruction, and various other computer vision problems. Libraries such TensorFlow* and Keras* make the programmer’s job easier. But, these libraries do not directly provide support for complex networks and uncommonly used layers. This guide will help you to write complex neural networks such as Siamese networks in Keras. It also explains the procedure to write your own custom layers in Keras.

Introduction

Person re-identification is defined as identifying if the same person exists in a given pair of images. Some of the challenges faced while tackling this problem are caused by pictures being taken from various viewpoints and variations in the light intensity that result in pictures of different people looking similar, thus creating a false positive. The Normalized X-Corr model¹ is used to solve the problem of person re-identification. This guide demonstrates a step-by-step implementation of a Normalized X-Corr model using Keras, which is a modification of a Siamese network².

Figure 1. Architectural overview of a Normalized X-Corr model.

Overview of the Normalized X-Corr Model

Arulkumar Subramaniam and his colleagues¹ propose a deep neural network to solve the problem of binary classification. Figure 1 gives an overview of the Normalized X-Corr (normxcorr) model. Firstly, the images are passed through conv-pool-conv-pool layers to extract features from the two images. The idea behind using these layers is to extract features from the image, so the weights of conv layers are shared (i.e., both images are passed through the same layers). After extracting the features, establishing a similarity between the features is necessary. This is done by the normalized correlation layer, which is a custom layer that will be discussed later in this guide. This layer basically takes a small 55 patch and then it convolves around in the other feature map and calculates the normalized correlation given by:

normxcorr model

We will denote feature maps as X and Y belonging to the images. Considering the sizes in Figure 1, we take a patch from map of X centered at (x,y) at a given depth and normxcorr is calculated with Y(a,b), where 1 <= a <= 12 and y – 2<= b <= y + 2. Thus, for every X(x,y), 5×12=60, values are generated and stored along the depth of the output feature map. This is done at all depths; therefore, we have output dimensions as 12×37×1500 (i.e., 60×25).

In Figure 2, the size of the image is assumed to be 8×8 for the purpose of demonstration. If we consider the patch centered at the block marked by the red square in image 1 of size 5×5, we calculate Normalized-X-Corr of this patch with patches marked by the green squares in image 2 (i.e., across the entire width of image), and height within [3 - 2, 3 + 5], which is [1,5]. Thus, the total number of values generated by a single patch in image 1 is the width×height allowed (i.e., 8×5=40). These values are stored along the depth of the output feature map. Thus, for one patch, we generate an output of 1×1×40. Considering the entire image, we would have a feature map of size 8×8×40. But, if the input has more than one channel, then the calculated feature maps are stacked one behind the other. Due to this, height and width of the output feature map remain the same, but the depth gets multiplied by the depth of input images. Hence, an input image of 8×8×5 would generate an output feature map of 8×8×(40×5) (i.e., 8×8×200). For the patch centered at the block marked by the blue color, we see that to satisfy the criteria, we need to add padding. Thus, in such cases, the image is padded with zeros.

After the Normalized-X-Corr layer, two conv layers and pooling have been added to concisely incorporate greater context information. On top of it, two fully connected layers are added and a softmax activation function is applied.

More information about the architecture is available in the paper “Deep Neural Networks with Inexact Matching for Person Re-Identification.”

Figure 2. Demonstrating normalization correlation layers operation.

Diving into the Code

The code below was tested on Intel® AI DevCloud. The following libraries and frameworks were also used: Python* 3 (February 2018 version), Keras* (version 2.1.2), Intel® Optimization for TensorFlow* (version 1.3.0), NumPy (version 1.14.0).

import keras 
import sys 
from keras import backend as K 
from keras.layers import Conv2D, MaxPooling2D, Dense,Input, Flatten 
from keras.models import Model, Sequential 
from keras.engine import InputSpec, Layer 
from keras import regularizers 
from keras.optimizers import SGD, Adam 
from keras.utils.conv_utils import conv_output_length 
from keras import activations 
import numpy as np

These are some of the imports from Keras and other libraries we need to implement in this model.

a = Input((160,60,3)) 
b = Input((160,60,3))

These create placeholders for the input images.

model = Sequential() 
model.add(Conv2D(kernel_size = (5,5), filters = 20,input_shape = (160,60,3), activation = 'relu')) 
model.add(MaxPooling2D((2,2))) 
model.add(Conv2D(kernel_size = (5,5), filters = 25, activation = 'relu')) 
model.add(MaxPooling2D((2,2)))

These are the layers that need to be shared between the images. Therefore, we create a model of these layers.

feat_map1 = model(b) 
feat_map2 = model(a)

model(a) passes the input it gets through the model and returns the output layer. This is done for both the layers so that they share the same model and output two feature maps as feat_map1 and feat_map2.

normalized_layer = Normalized_Correlation_Layer(stride = (1,1), patch_size = (5, 5))([feat_map1, feat_map2])

This is the custom layer that establishes a similarity between the feature maps extracted from the images. We pass the feature maps as a list input. Its implementation is mentioned later in this guide.

final_layer = Conv2D(kernel_size=(1,1), filters=25, activation='relu')(normalized_layer) 
final_layer = Conv2D(kernel_size=(3,3), filters=25, activation = None)(final_layer) 
final_layer = MaxPooling2D((2,2))(final_layer) 
final_layer = Dense(500)(final_layer) 
final_layer = Dense(2, activation = "softmax")(final_layer)

These are layers that are added on top of the normalized correlation layer.

x_corr_mod = Model(inputs=[a,b], outputs = final_layer)

Finally, a new model is created with inputs as the images to be passed as a list, which gives a binary output.

The visualizations of layers of this model are available paper “Supplementary Material for the Paper: Deep Neural Networks with Inexact Matching for Person Re-Identification.”

Normalized Correlation Layer

This is not a layer provided by Keras so we have to write it on our own layer with the support provided by the Keras backend.

class Normalized_Correlation_Layer(Layer):

create a class inherited from keras.engine.Layer.

	def __init__(self, patch_size=(5,5), 
          dim_ordering='tf', 
          border_mode='same', 
          stride=(1, 1), 
          activation=None, 
          **kwargs): 

       if border_mode != 'same': 
          raise ValueError('Invalid border mode for Correlation Layer ''(only "same" is supported as of now):', border_mode) 
       self.kernel_size = patch_size 
       self.subsample = stride 
       self.dim_ordering = dim_ordering 
       self.border_mode = border_mode 
       self.activation = activations.get(activation) 
       super(Normalized_Correlation_Layer, self).__init__(**kwargs)

This constructor just sets the values passed as parameters as the class variables and also initializes it’s parent class by calling the constructor.

def compute_output_shape(self, input_shape):
      return(input_shape[0][0], input_shape[0][1], input_shape[0][2], self.kernel_size[0] * input_shape[0][2]*input_shape[0][-1])

This returns the shape of the feature map outputted by this layer as a tuple. The first element is the number of images, the second is the number of rows, the third is the number of columns, and the last one is the depth which is the allowance to move in height×allowance to move in width×depth. In our case its 5×12×25.

def get_config(self): 
   config = {'patch_size': self.kernel_size, 
          'activation': self.activation.__name__, 
          'border_mode': self.border_mode, 
          'stride': self.subsample, 
          'dim_ordering': self.dim_ordering} 
     base_config = super(Correlation_Layer, self).get_config() 
     return dict(list(base_config.items()) + list(config.items()))

This adds the configuration passed as arguments to constructor, appends it to those of the parent class, and returns it. This function is called by Keras to get the configurations.

def call(self, x, mask=None):

This function is called at every iteration. This function takes the input as feature maps as per the model.

input_1, input_2 = x 
     stride_row, stride_col = self.subsample 
     inp_shape = input_1._keras_shape

Separate the inputs from the lists and load some variables to local ones\ to make it easier to refer later on.

output_shape = self.compute_output_shape([inp_shape, inp_shape])

This uses the function written earlier to get the desired output shape and store it in the variable.

   padding_row = (int(self.kernel_size[0] / 2),int(self.kernel_size[0] / 2)) 
    padding_col = (int(self.kernel_size[1] / 2),int(self.kernel_size[1] / 2)) 
    input_1 = K.spatial_2d_padding(input_1, padding =(padding_row,padding_col)) 
    input_2 = K.spatial_2d_padding(input_2, padding = ((padding_row[0]*2, padding_row[1]*2),padding_col))

This block of code adds padding to the feature map. This is required as we take patches centered at (0,0) and other edges, too. Therefore, we need to add padding of 2 in our case. But, for the feature map of the second input, we need to take patches with an offset of 2 from the center of the patch of the first feature map. Thus, for the patch at (0, 0) we need to consider patches centered at (0,0), (0,1), (0,2), (0,-1), (0,-2) of the second feature map with same value at X. Thus, we need to add a padding of 4,

output_row = output_shape[1] 
output_col = output_shape[2]

and store them into the variables.

output = [] 
for k in range(inp_shape[-1]):

Loop for all the depths.

   xc_1 = [] 
   xc_2 = [] 
   for i in range(padding_row[0]): 
      for j in range(output_col): 
         xc_2.append(K.reshape(input_2[:, i:i+self.kernel_size[0], j:j+self.kernel_size[1], k], 
                     (-1, 1,self.kernel_size[0]*self.kernel_size[1])))

This is done for the patches of feature map 2 where we have added the extra padding (i.e., the patches that are not centered on the feature map and which are at the first rows).

for i in range(output_row): 
       slice_row = slice(i, i + self.kernel_size[0]) 
       slice_row2 = slice(i + padding_row[0], i +self.kernel_size[0] + padding_row[0]) 
       for j in range(output_col): 
          slice_col = slice(j, j + self.kernel_size[1]) 
          xc_2.append(K.reshape(input_2[:, slice_row2, slice_col, k], 
                      (-1, 1,self.kernel_size[0]*self.kernel_size[1]))) 
          xc_1.append(K.reshape(input_1[:, slice_row, slice_col, k], 
                        (-1, 1,self.kernel_size[0]*self.kernel_size[1])))

Extract patches of size 5×5 from both feature maps and store them in xc_1 and xc_2, respectively. In this case, these patches are flattened and reshaped in form (-1,1,25).

for i in range(output_row, output_row+padding_row[1]): 
       for j in range(output_col): 
           xc_2.append(K.reshape(input_2[:, i:i+ self.kernel_size[0], j:j+self.kernel_size[1], k], 
                       (-1, 1,self.kernel_size[0]*self.kernel_size[1])))

This is to extract patches of feature map 2, but which are centered below the bottom of the feature maps.

xc_1_aggregate = K.concatenate(xc_1, axis=1)

These patches are joined along axis=1 so that they would be of the shape (-1, 60, 25) for any given depth.

xc_1_mean = K.mean(xc_1_aggregate, axis=-1, keepdims=True) 
xc_1_std = K.std(xc_1_aggregate, axis=-1, keepdims=True) 
xc_1_aggregate = (xc_1_aggregate - xc_1_mean) / xc_1_std

This is just the implementation of normalization of the features of the first feature map.

xc_2_aggregate = K.concatenate(xc_2, axis=1)
xc_2_mean = K.mean(xc_2_aggregate, axis=-1, keepdims=True) 
xc_2_std = K.std(xc_2_aggregate, axis=-1, keepdims=True) 
xc_2_aggregate = (xc_2_aggregate - xc_2_mean) / xc_2_std

Similarly, for the feature maps of image 2.

xc_1_aggregate = K.permute_dimensions(xc_1_aggregate, (0, 2, 1)) 
     block = [] 
     len_xc_1= len(xc_1) 
     for i in range(len_xc_1):
		     #This for loop is to compute the product of a given patch of feature map 1 and the feature maps on which it is supposed to
         sl1 = slice(int(i/inp_shape[2])*inp_shape[2], 
              int(i/inp_shape[2])*inp_shape[2]+inp_shape[2]*self.kernel_
size[0]) 
			      #This calculates which are the patches of feature map 2 to be considered for a given patch of first feature map.

         block.append(K.reshape(K.batch_dot(xc_2_aggregate[:,sl1,:], 
                      xc_1_aggregate[:,:,i]),(-1,1,1,inp_shape[2] *self.kernel_size[0])))

Calculate the dot product (i.e., the normalized correlation and store it in "block").

block = K.concatenate(block, axis=1) 
     block= K.reshape(block,(-1,output_row,output_col,inp_shape[2] *self.kernel_size[0])) 
     output.append(block)

Join the calculated normalized correlation values, reshape them (they are calculated sequentially so that reshaping would be easier), and append it to “output.”

output = K.concatenate(output, axis=-1)

Join the output feature map calculated at each depth, along the depth of “output.”

output = self.activation(output) 
return output

Apply activation if sent as an argument and return the output generated.

Applications

Such a network can have various applications such as matching a person’s identity in crime scenes. This network can be generalized to find similarity between two images (i.e., to find if the same fruit exists in both images or not).

Further Scope

The code runs sequentially and is devoid of parallelism. The matrix multiplication of the patches can be parallelized across multiple cores using libraries such as multiprocessing. This would help to speed the training time. The accuracy of the model can be increased by finding a more suitable similarity measure between the image patches.

Acknowledgement

I would like to thank the Intel® Student Ambassador Program for AI, which provided me with the necessary training resources on the Intel® AI DevCloud and the technical support that helped me to use DevCloud.

References

Subramaniam, M. Chatterjee, and A. Mittal. “Deep Neural Networks with Inexact Matching for Person Re-Identification.” In NIPS 2016.
Dong Yi, Zhen Lei, Shengcai Liao, Stan Z. Li. “Deep Metric Learning for Person Re-Identification.” In ICPR, volume 2014.
Code on GitHub*

Abstract

Rendering is usually the main performance bottleneck of PC games on the CPU; multithreaded rendering is an effective way to eliminate the bottleneck. This article investigates the performance scalability of DirectX* 11 multithreaded rendering, discusses two basic methods for multithreaded rendering, and introduces the case of traditional multithreading deferred shading pipelines in a large-scale online game, Conqueror's Blade*.

Background

Over the past 10 years, CPU chips in the PC market have shown great improvements. According to a software and hardware investigation by Steam*², 4-core processors (usually 8 logical cores) have become mainstream in the current PC game market. The 6-core processor (usually 12 logical cores) is already on its way to become the mainstream next-generation CPU. For example, the Intel® Core™ i7-8700K processor with 8 or more physical cores has been available since late 2017. We expect this trend to continue. In the next few years, 6-core and 8-core CPUs will become the most popular processors for gamers.

In many PC games, rendering is usually single-threaded and easily becomes the biggest performance bottleneck. This makes it difficult for games to utilize extra idle cores in a multicore processor to improve game performance or enrich game content. Although DirectX 12* has been around a few years, most of the games currently under development—especially the most popular online games — are still using DirectX 11. DirectX 11 is designed to support multithreading from the beginning¹. Therefore, investigating the performance scalability of DirectX 11 multithreaded rendering on current mainstream multicore platforms, and studying the methods of making full use of this feature have important reference value for the development and optimization of the majority of games.

DirectX* 11 Multithreaded Rendering Model

First, let's briefly review the DirectX 11 multithreaded rendering model (see Figure 1). DirectX 11 supports two types of rendering— immediate and deferred, based on two Direct3D* 11 device contexts — the immediate context and the deferred context. Immediate rendering calls draw APIs through immediate context, and the generated command is immediately sent to the graphics processing unit (GPU). Deferred rendering calls draw APIs through deferred context, but only records the draw commands in a command list that is submitted to the GPU by the immediate context at another time point. DirectX 11 supports the use of different deferred simultaneous contexts in multithreading. This strategy allows the rendering of complex scenes to be divided into multiple concurrent tasks; that is, multithreaded rendering.

DirectX 11 multithreaded rendering model
Figure 1. DirectX* 11 multithreaded rendering model.

Evaluate DirectX 11 Multithreading Performance Scalability

Based on the hardware and software configuration of Table 1, we evaluate the performance scalability of DirectX 11 multithreaded rendering on multicore CPUs.

Table 1. Hardware and software configurations for performance scalability evaluation.

Configuration	Description
CPU	Intel^® Core^™ i7-6950X processor @ 3.00GHz (10 Cores)
Memory	2 x 16 GB RAM
GPU	NVIDIA GeForce* GTX 1080	AMD Radeon* RX Vega 64
Driver Version	22.21.13.8494	22.19.677.257
Operating System	Windows^® 10 Professional 64-bit
Test Program	Microsoft DirectX* SDK (June 2010) Sample: MultithreadedRendering11.exe

The evaluation uses the Intel Core i7-6950X processor (10 physical cores; that is, 20 logical cores) to simulate CPUs with different numbers of cores. To ensure that the GPU does not become a performance bottleneck for the test program, the test uses two high-performance discrete GPUs: NVIDIA GeForce* GTX 1080 and AMD Radeon* RX Vega 64. The test program uses the MultithreadedRendering11 routine in the Microsoft DirectX SDK*⁴, which is based mainly on the following considerations. First, the program performance is CPU-bound, and it is developed to demonstrate the DirectX 11 multithreaded rendering feature, which is conducive to maximizing the potential of performance scalability. Second, the main function of the program is rendering (each frame contains more than 4,000 draw calls), and there is no impact of animation, physical load, and so on. It can make scalability a result of DirectX 11 multithreaded rendering as much as possible. In addition, the program's scene complexity and rendering technology are pretty common in games, so that the test results are of representativeness. Last, but not least, the source code of the program is open, making it easy to analyze and understand the DirectX 11 multithreaded rendering methods, and the impact on scalability performance.

Figure 2. Test program

When running the test program, we chose the MT Def/Chunk mode, because the scalability in this mode is not limited by the number of game rendering passes (or scenes), but only by the number of CPU cores. The workload of each thread is relatively balanced, which can make full use of the computing power of the multicore CPU. During the test, we adjusted the CPU's active core number through the BIOS and tested the program's frame rate at each of these different core numbers. In order to compare the effects of different GPUs on DirectX 11 multithreaded rendering scalability, we divided the multithreaded frame rate on the same GPU by the single-threaded frame rate (immediate mode) under the same configuration, to obtain a normalized relative performance metric. The test results are shown in Figure 3.

Scalability of DirectX 11 multithreaded rendering
Figure 3. Multicore performance scalability of DirectX* 11 multithreaded rendering.

As we can see from Figure 3, with two CPU cores, no matter which GPU we use, multithreaded rendering (MT Def/Chunk mode) performance is lower than single-threaded rendering (immediate mode). What leads to this result? According to the source code of the test program, the number of working threads is the number of CPU physical cores minus one. In other words, on a two-core CPU, in multithreaded rendering mode, only one working thread processes all scene draw calls based on deferred rendering, while the main thread does not assume any scene draw calls. In the single-thread rendering mode, all draw calls are processed by the main thread based on immediate rendering. This means that the overhead of deferred rendering is slightly larger than that of immediate rendering on the basis of handling an equal number of draw calls.

However, when the number of CPU cores is greater than two, the DirectX 11 multithreaded rendering performance is significantly better than that of single-threaded rendering, regardless of which GPU is used, and the performance increases as the number of cores increases. When paired with the NVIDIA GeForce GTX 1080, multicore performance scales very well; performance increase is almost linear from 2 to 6 cores. Even from 6 to 10 cores, the performance increase is significant. When paired with AMD Radeon RX Vega 64, the scalability is worse than that; especially when the number of CPU cores exceeds 4, the performance increase is almost negligible.

Why does the test program have such a large scalability difference for multicore performance on different GPUs? We used Microsoft GPUView* to capture the multithreaded activities of the test program (see Figure 4), and find that the bottleneck of the test program is on the CPU with either the NVIDIA GeForce GTX 1080 or the AMD Radeon RX Vega 64 GPU. However, multithreaded concurrency is better with the NVIDIA GPU, and the main thread blocking working threads is significantly longer with AMD graphics cards.

Figure 4. DirectX* 11 multithreaded rendering parallelism with different GPUs.

From the source code, we know that each working thread has a deferred context, and all draw calls for scene rendering are called by deferred context. The main thread contains an immediate context that is responsible for submitting the commands list generated in the deferred context to the GPU. Using Windows* Performance Analyzer to further analyze the module called by the working thread, we find that, on the NVIDIA GPU, all the working threads call the graphics driver module (see Figure 5), which means that a number of deferred context operations share some of the driver load, and make the immediate context operations bear less driver load, thereby shortening the occurrences of the main thread blocking the working threads. On the AMD GPU, the graphics driver module does not appear in the working thread but is concentrated on the main thread (see Figure 6), which means that a single immediate context bears a large amount of driver load, thus increasing the time of the working threads waiting for the main thread.

Some of the N V I D I A driver load
Figure 5. Working thread (deferred context) represents some of the NVIDIA driver load.

Figure 6. The main thread (immediate context) represents a large amount of the driver load.

By checking the GPU driver support for DirectX 11 multithreaded rendering features³ (see Figure 7) through the DirectX Caps Viewer, we learn that the NVIDIA GPU driver supports driver command lists, while the AMD GPU driver does not support them. This explains why the driver modules on different GPUs appear in different contexts. When paired with the NVIDIA GPU, working threads can build driver commands in parallel in a deferred context; while when paired with the AMD GPU, the driver commands are all built in serial in the immediate context of the main thread.

Rendering by different G P U drivers
Figure 7. Support for DirectX* 11 multithreaded rendering by different GPU drivers.

Based on the above tests and analysis, we can draw the following conclusions:

Although the indirect load of deferred rendering is larger than that of the immediate rendering, the performance of the DirectX 11 multithreaded rendering can be significantly higher than that of single-threaded rendering, especially on current mainstream 4-core or more-core CPUs when using the appropriate rendering task division method — evenly distributing draw calls to contexts of more than two Direct3D* devices.
The performance scalability of DirectX 11 multithreaded rendering is GPU-related. When the GPU driver supports the driver command list, DirectX 11 multithreaded rendering can achieve good performance scalability, whereas performance scalability is easily constrained by the driver bottleneck. Fortunately, the NVIDIA GPU², with the largest share of the current game market, supports driver command lists.

Multithreaded Rendering Method

The performance scalability evaluation of the above DirectX 11 multithreaded rendering shows that on the current mainstream multicore CPUs and GPUs, multithreaded rendering on DirectX 11 games may achieve significant performance improvement. So, how do you effectively use the performance potential of multithreaded rendering? The MultithreadedRendering11 routine demonstrates two basic methods for dividing a rendering task into multiple threads:

1) Assign each thread a rendering Pass.

2) Assign each thread an equal amount of Chunk.

It should be noted that the multithreaded rendering method described here is not only suitable for DirectX 11 but also for DirectX 12. In fact, we can take the DirectX 11 deferred context as a DirectX 12 command list, and the DirectX 11 immediate context as a combination of the DirectX 12 command list and the command queue.

Figure 8 shows a multithreaded rendering method that divides the rendering task by Pass. Pass is a relatively independent rendering task. The typical Pass includes the generation of pre-Z buffers, shadow maps, reflection maps, G buffers, UI, and the main Pass generating the final frame buffer. With this method, each Pass is assigned with a working thread. A command list of this Pass is built into this working thread. The main thread is responsible for distributing Pass and orderly submitting the command list completed by the working threads. In the MultithreadedRendering11 routine, the main thread will orderly submit after all the working threads complete the command list. Figure 8 shows a better way: When a command list is completed, it should be immediately submitted to the GPU as long as the rendering order permits. Since the submitted command list is usually serial and associated with some overhead, the earlier the submission, the more serial time that can be shielded, which allows GPU processing in advance.

Figure 8. Divide the rendering task by Pass.

Dividing rendering tasks by Pass is easy to apply to the multiple-pass rendering technology commonly used in modern games. As long as Pass contains a relatively large amount of rendering load (draw call number), using this method in games is usually effective in improving performance. The shortcoming is that the performance scalability is limited by the number of Passes, and it is not easy to achieve load balance between Passes.

Figure 9 shows a multithreaded rendering method that divides rendering tasks by Chunk. Chunk is a granularity rendering task that is smaller than Pass. A typical Chunk can be a set of draw calls, a mesh, or a larger rendering unit such as a separate rendering object containing multiple meshes. In this method, each Pass is divided into Chunks, which are evenly distributed by the main thread to multiple working threads. Each working thread is responsible for building a command list. After each command list is completed, the main thread is responsible for submitting them in order. The number of working threads is determined based on the number of physical cores, rather than the number of logical cores, in order to avoid excessive command list submissions resulting in excessive overhead. The Pass as the unit of submitting the command list is conducive to unifying the rendering status of the command list, advancing GPU processing, and multiplexing the command list between Passes.

Figure 9. Dividing rendering tasks by Chunk.

The multithreaded rendering method that divides rendering tasks by Chunk can achieve a significant performance improvement, and the performance is not affected by the number of passes and increases with the increase of the number of CPU cores. The shortcoming is that for certain situations that require orderly rendering (such as rendering semi-transparent objects), the strategy of distributing Chunks is limited, and it is easy to lose the load balance among the threads, thereby affecting the performance scalability.

No matter which of the above multithreaded rendering methods is used, the following points should be noted:

Since the submitted command list is serial and with a certain amount of overhead, the command list should be executed immediately after it is completed and allowed by rendering order, rather than waiting for the other command lists. The former helps shield the serial time and relieves the GPU from burst load pressure.
To shield the overhead of using deferred contexts, each deferred context, whether for Pass or Chunk, should contain enough draw calls. If the number of draw calls processed by the deferred context is too small, you should consider handling these draw calls in the immediate context or combining them.
Try to balance the load between different contexts to maximize the advantages of multithreaded rendering.

Case Study

Here we introduce multithreaded rendering methods and effects achieved in a real DirectX 11 game. Conqueror's Blade⁵ is a large-scale online game developed by NetEase*. The game has large-scale outdoor battle scenes, a large number of characters on the same screen, and rich visual effects. These characteristics make the game demand more CPU resources. To enable players on the low-end CPU platforms to have a smooth gaming experience, developers continue to optimize the game engine for multithreading optimization in order to improve performance by fully utilizing CPU resources, or to improve game details.

Figure 10. Single-threaded rendering causes CPU performance bottlenecks.

Before the performance optimization of the game, the engine has achieved a certain amount of multithreading: some CPU-intensive tasks, such as game logic, physics, particles, animation, and other calculations use separate threads for execution. The rendering thread is mainly responsible for visibility detection and running the entire rendering pipeline. Nevertheless, the rendering thread is still a performance bottleneck for the game (see Figure 10). A typical combat scene with more than 5,000 draw calls per frame also results in considerable Direct3D runtime and driver overhead. The game uses the DirectX 11 API and a typical deferred shading pipeline. The task pipeline of the rendering thread is shown in Figure 11.

The game’s task pipeline of rendering thread
Figure 11. The game's task pipeline of rendering thread.

Based on some considerations such as limitation of game legacy code and implementation time, the game chooses a multithreading optimization method that divides rendering tasks according to Pass, which is a relatively easier implementation choice in a limited time. The specific implementation scheme is shown in Figure 12.

In the optimization scheme, Visibility is removed from the rendering thread and divided into two jobs: eye visibility and light visibility. The GBuffer generation, the shadow map generation, and the forward and transparent Passes that have or may dynamically have a large number of draw calls are also moved out of the rendering thread and encapsulated as a job that dispatches working threads. GBuffer generation has been further divided into three jobs: GBuffer Terrain, GBuffer Static, and GBuffer Dynamic, because there are too many draw calls. The rendering thread only retains the Scaleform UI. Deferred Shading and Post Process Passes must use immediate context rendering or have only a few draw calls.

Figure 12. Multithreaded rendering flowchart after game optimization.

During the operation process, the working thread first processes the two visibility check jobs in parallel, then these two jobs derive six jobs rendering passes, and the working thread builds a related Pass DirectX 11 command list in the deferred context. The rendering thread orderly executes the Passes left in the rendering thread and the command list that has been completed by the working threads in the immediate context.

After the multithreading optimization is complete, the bottleneck of the rendering thread is eliminated, the multicore utilization is significantly improved (see Figure 13), and the frame rate is increased by an average of 1.7 times than before the optimization. It has achieved the set performance target.

Figure 13. Eliminate bottlenecks after multithreaded rendering to improve multicore utilization.

Although the current solution significantly improves performance, there is still a lot of room for improvement in CPU utilization due to an uneven load between Passes. Therefore, in order to make the idle CPU further enhance performance details in games, the developers will rebuild the rendering code of the game engine and try to divide the multithreading optimization of the rendering task by Chunk.

Summary

On the multicore CPUs and GPUs with the largest share of the current game market, and for those DirectX 11 games on CPUs with performance bottlenecks in rendering, achieving multithreaded rendering may help realize significant performance improvements. Although the multicore performance scalability of DirectX 11 multithreaded rendering is limited for some GPUs with limited driver support, under the condition of reasonable implementation the performance of multithreaded rendering will be better than that of single-threaded rendering. The key to the advantage of multithreaded rendering is the division and scheduling of rendering tasks. For this reason, this article introduces the methods based on Pass and Chunk. These two methods are not only applicable to DirectX 11, but also to DirectX 12, so that multithreaded rendering optimization of DirectX 11 games can be easily ported to future DirectX 12 games. In the game Conqueror's Blade, a Pass-based multithreaded method is successfully applied to the traditional deferred shading pipeline, proving the effectiveness of DirectX 11 multithreaded rendering.

Footnotes

1. Introduction to Multithreading in Direct3D 11

2. Steam Hardware and Software Survey

3. How To: Check for Driver Support

4. Microsoft DirectX SDK (June 2010)

5. Conquerors' Blade official website

Intel® C++ Compiler, known as icc, is the high performance compiler which can be used to build and optimize your C/C++ project. The icc is distributed as part of Intel® System Studio product. In order to use icc to build your project, the installation of the Intel® System Studio product is required on your build system. If you are new to Intel System Studio, click the Choose & Download page at Intel System Studio website to acquire a free renewable commercial license for 90-day use. If you need long term license with priority support, please contact your Intel representative or drop email to intelsystemstudio@intel.com for more information.

This document describes the steps to register, download and install Intel® System Studio for a new user. By following the steps, you can install the Intel® System Studio in command line mode on your host machine without the GUI support. The Intel® C++ Compiler will be installed as part of the Intel® System Studio, along with all the other components. For detailed components in the Intel® System Studio product, please refer to Intel System Studio Website.

Click Here to Download the Document

Capsule networks (CapsNet) are the new architecture in neural networks, an advanced approach to previous neural network designs, particularly for computer vision tasks. To date, convolutional neural networks (CNN) have been used for computer vision tasks. Although CNNs have managed to achieve far greater accuracy, they still have some shortcomings.

Drawback of Pooling Layers

CNNs were built at first to classify images; they do so by using successive layers of convolutions and pooling. The pooling layer in a convolutional block is used to reduce the data dimension and to achieve something called spatial invariance, which means regardless of where the object is placed in the image, it identifies the object and classifies it. While this is a powerful concept it has some drawbacks. One of them is that while performing pooling it tends to lose a lot of information, information particularly useful while performing tasks such as image segmentation and object detection. When the pooling layer loses the required spatial information about the rotation, location, scale, and different positional attributes of the object, the process of object detection and segmentation becomes very difficult. While modern CNN architecture has managed to reconstruct the positional information using various advanced techniques, they are not 100 percent accurate, and reconstruction itself is a tedious process. Another drawback of the pooling layer is that if the position of the object is slightly changed the activation doesn’t seem to change with its proportion, which leads to good accuracy in terms of image classification but poor in performance, if you want to locate exactly where the object is in the image.

Capsule Networks

To overcome such difficulties, a new approach was proposed by Geoffrey Hinton, called capsule network¹. A capsule is a collection or group of neurons that stores different information about the object it is trying to identify in a given image; information mostly about its position, rotation, scale, and so on in a high dimensional vector space (8 dimension or 16 dimension), with each dimension representing something special about the object than can be understood intuitively (see Figure 4).

In computer graphics there is a concept of rendering, which simply means taking into account various internal representations of an object like its position, rotation, and scale, and converting them to an image on screen. In contrast to this approach our brain works in the opposite way, called inverse graphics. When we look at any object, we internally deconstruct it into different hierarchical sub parts, and we tend to develop a relationship between these internal parts of the whole object. This is how we recognize objects, and because of this our recognition does not depend on a particular view or orientation of the objects. This concept is the building block of capsule networks.

To understand how this works in a capsule network let’s look at its architectural design. The architecture of a capsule network is divided into three main parts and each part has sub operations in it. They are:

Primary capsules
- Convolution
- Reshape
- Squash
Higher layer capsules
- Routing by agreement
Loss calculation
- Margin loss
- Reconstruction loss

1. Primary capsules

This is the first layer of the capsule network and this is where the process of inverse graphics takes place. Suppose you are feeding the network with the image of a boat or a house, like in the following images:

primary capsules process of inverse graphics

Now, these images are broken down into their sub hierarchical parts in this layer. Let’s assume for the sake of simplicity that these images are constructed out of two distinct sub parts; that is, one rectangle and one triangle.

primary capsules one rectangle and one triangle

In this layer, capsules representing the triangle and rectangle will be constructed. Let us suppose we initialize this layer with 100 capsules, 50 representing the rectangle and 50 representing the triangle. The output of these capsules is represented with the help of arrows in the image below; the black arrows representing the rectangle’s output, and the blue arrows representing the triangle’s. These capsules are placed in every location of the image, and the output of these capsules denotes whether or not that object is located in that position. In the picture below you can see that in the location where the object is not placed, the length of the arrow is shorter and where the object is placed, the arrow is longer. The length represents whether the object is present, and the pose of the arrow represents the orientation of that particular object (position, scale, rotation, and so on) in the given image.

primary capsules object position, scale, rotation

An interesting thing about this representation is that, if we slightly rotate the object in our input image, the arrows representing these objects will also slightly rotate with proportion to its input counterpart. This slight change in input resulting in a slight change in the corresponding capsule’s output is known as equivariance. This enables the capsule networks to locate the object in a given image with its precise location, scale, rotation, and other positional attributes associated with it.

primary capsules equivariance

This is achieved using three distinct processes:

Convolution
Reshape function
Squash function

In this layer the input image is fed into a couple of convolution layers. This outputs some array of feature maps; let’s say this outputs an array of 18 feature maps. Now, we apply the Reshaping function to these feature maps and let’s say we reshape it into two vectors of nine dimensions each (18 = 2 x 9) for every location in the image, which is similar to the image above representing the rectangle and triangle capsules. Now, the last step is to make sure that the length of each vector is not greater than 1; this is because the length of each vector is the probability of whether or not that object is located in that given location in the image, so it should be between 1 and 0. To achieve this we apply something called the Squash function. This function simply makes sure that the length of each vector is between 1 and 0 and will not destroy the positional information located in higher dimensions of the vector.

primary capsules squash function

Now we need to figure out what these sub parts or capsules are related to. That is, if we consider our example of boat and house, we need to figure out which triangle and rectangle is part of the house and which is part of the boat. So far, we know where in the image these rectangles and triangles are located, using these convolutions and squashing functions. Now we need to figure out whether a boat or a house is located there and how these triangles and rectangles are related to the boat and the house.

2. Higher layer capsules

Before we get into higher layer capsules, there is still one major function left by the primary capsule layer. That is, before the higher layer capsule can operate, right after the Squash function in the primary layer, every capsule in the primary layer will try to predict the output of every capsule in the higher layer in the network; for example, we have 100 capsules, 50 rectangles, and 50 triangles. Now, suppose we have two types of capsules in the higher layer, one for house and another for boat. Depending upon the orientation of both triangle and rectangle capsules, these capsules will make the following predictions on higher layer capsules. This will give rise to the following scenario:

higher layer capsules triangle and rectangle predictions

And as you can see, with respect to its original orientation, the rectangle capsule and the triangle capsule both predicted the boat present in the picture in one of their predictions. They both agree that it’s the boat capsule that should be activated in the higher layer capsule. This means that the rectangle and triangle are more a part of a boat than of a house. This also means that the rectangle and triangle capsules think that selecting the boat capsule will explain their own orientation in the primary capsule. In this, both primary layer capsules agree to select the boat capsule in the next layer as a possible object located in the image. This is called routing by agreement.

higher layer capsules routing by agreement

This particular technique has several benefits. Once the primary capsules agree to select a certain higher-level capsule, there is no need to send a signal to another capsule in another higher layer, and the signal in the agreed-on capsule can be made stronger and cleaner and can help in accurately predicting the pose of the object. Another benefit is that if we trace the path of the activation, from the triangle and rectangle to the boat capsule in a higher layer, we can easily sort out the hierarchy of the parts and understand which part belongs to which object; in this particular case, rectangle and triangle belong to the boat object.

So far we have dealt with the primary layer; now the actual work of the higher capsule layer comes in. Even though the primary layer predicted some output for the higher layer, it still needs to calculate its own output and cross-check which prediction matches with its own calculation.

The first step the higher capsule layer takes to calculate its own output is to set up something called routing weights. We have some predictions given by our primary layer now for each prediction; at first iteration it declares its routing weights to be zero for all. These initial routing weights are fed into a Softmax function and the output is assigned to each prediction.

higher layer capsules routing weights softmax function

Now, after assigning the Softmax output to the predictions, it calculates the weighted sum to each capsule in this higher layer. This gives us two capsules from a bunch of predictions. This is the actual output of the higher layer for the first round or first iteration.

higher layer capsules weighted sum

Now we can find which prediction is the most accurate compared to the actual output of the layer.

higher layer capsules prediction most accurate compared to layer

After selecting the accurate prediction, we again calculate another routing weight for the next round by scalar product of the prediction and the actual output of the layer and adding it to the existing routing weight. Given by equation:

U^{^}_ij (Prediction by primary layer)

V_j(Actual output by the higher layer)

B_ij += U^{^}_ij + V_j

Now, if the prediction and the output match, the new routing weights will be large, and if not, the weight will be low. Again, the routing weights are fed into the Softmax function and the values are assigned to the predictions. You can see that the strong agreed predictions have large weights associated with them, whereas others have low weights.

higher layer capsules routing weight values assigned to predictions

Again we compute the weighted sum on these predictions with new weights given to it. But now we find that the boat capsule has a longer vector associated with it compared to the house capsule, as the weights were in favor of the boat capsule, so the layer chooses the boat capsule over the house capsule in just two iterations. In this way we can compute the output in this higher layer and choose which capsule to select from this higher layer for the next steps in the capsule network.

I have only described two iterations or rounds for the sake of simplicity, but actually it can take longer, depending upon the task you are performing.

higher layer capsules compute output in higher layer

3. Loss calculation

Now that we have made a decision on what object is in the image using the routing by agreement method, you can perform classification. As with our previous higher layer, one capsule per class, that is, one capsule for boat and one for house, we can easily add a layer on top of this higher layer and compute the length of the activation vector and, depending upon the length, we can assign a class probability to make an image classifier.

In the original paper, margin loss was used to calculate the class probability of multiple classes to create such an image classifier. Margin loss simply means that if a certain object of a class is present in the image then the squared length of the corresponding vector of that object’s capsule must not be less than 0.9. Similarly, if that object of that class is not present in the image, then the squared length of the corresponding vector of that object should not be more than 0.1.

Suppose V_k is the length of the output vector of that class K object. Now, if that object of class K is present then its squared value should not be less than 0.9; that is, |V_k|²>=0.9. Similarly, if the class K object is not present then |V_k|²<=0.1.

In addition to margin loss, there is an additional unit called the decoder network connected to the higher capsule layer. This decoder network is three fully connected layers, two of them being rectified linear unit (ReLU) activated units, and the last one the sigmoid activated layer, which is used to reconstruct the input image.

loss calculation compute iterations

This decoder network learns to reconstruct the input image by minimizing the squared difference between the reconstructed image and the input image:

Reconstruction Loss = (Reconstructed Image – Input Image)²

Now, we have total loss as:

Total Loss = Margin Loss + alpha * Reconstruction Loss

Here, the value of alpha (a constant to minimize reconstruction loss) in the paper¹ it is 0.0005 (no extra information is given on why this particular value was chosen). Here the reconstruction loss is scaled down considerably so as to give more importance to the margin loss and so it can dominate the training process. The importance of the reconstruction unit and the reconstruction loss is that it forces the network to preserve the information required to reconstruct the image up to the highest capsule layer. This also acts as a regularizer to avoid over-fitting during the training process.

In the paper¹, the capsule network is used to classify between MNIST* digits. As you can see below (in Figure 1), the paper showed different units of CapsNet for MNIST classification. Here, the input after feeding through two convolutional layers is reshaped and squashed to form 32 primary capsules with 6 x 6 x 8 capsules each. These primary capsules are fed into higher layer capsules, a total of 10 capsules with 16 dimensions each, and at last margin loss is calculated on these higher layer capsules to give class probability.

loss calculation dynamic routing between capsules.

Figure 1: Dynamic routing between capsules¹

Figure 2 shows the decoder network used to calculate reconstruction loss. A higher layer capsule is connected to three fully connected layers with the last layer being a sigmoid activated layer, which will output 784-pixel intensity values (28 x 28 reconstructed image).

loss calculation decoder structure to reconstruct a digit

Figure 2: Decoder structure to reconstruct a digit¹

An interesting thing about this higher capsule layer is that each dimension on this layer is interpretable. That is, if we take the example from the paper on the MNIST dataset, each dimension from the 16-dimension activation vector can be interpreted and signifies certain characteristics of the object. If we modify one of the 16 dimensions we can play with the scale and thickness of the input; similarly, another can represent stroke thickness, another width and translation, and so on.

loss calculation dimension perturbations

Figure 3: Dimension perturbations¹

Let’s look at how we can implement³ it using Keras* with TensorFlow* backend. You start by importing all the required libraries:

from keras import layers, models, optimizers
from keras.layers import Input, Conv2D, Dense
from keras.layers import Reshape, Layer, Lambda
from keras.models import Model
from keras.utils import to_categorical
from keras import initializers
from keras.optimizers import Adam
from keras.datasets import mnist
from keras import backend as K

import numpy as np
import tensorflow as tf

First, let’s define the Squash function:

def squash(output_vector, axis=-1):
    norm = tf.reduce_sum(tf.square(output_vector), axis, keep_dims=True)
    return output_vector * norm / ((1 + norm) * tf.sqrt(norm + 1.0e-10))

After defining the Squash function, we can define the masking layer:

class MaskingLayer(Layer):
    def call(self, inputs, **kwargs):
        input, mask = inputs
        return K.batch_dot(input, mask, 1)

    def compute_output_shape(self, input_shape):
        *_, output_shape = input_shape[0]
        return (None, output_shape)

Now, let’s define the primary Capsule function:

def PrimaryCapsule(n_vector, n_channel, n_kernel_size, n_stride, padding='valid'):
    def builder(inputs):
        output = Conv2D(filters=n_vector * n_channel, kernel_size=n_kernel_size, strides=n_stride, padding=padding)(inputs)
        output = Reshape( target_shape=[-1, n_vector], name='primary_capsule_reshape')(output)
        return Lambda(squash, name='primary_capsule_squash')(output)
    return builder

After that, let’s write the capsule layer class:

class CapsuleLayer(Layer):
    def __init__(self, n_capsule, n_vec, n_routing, **kwargs):
        super(CapsuleLayer, self).__init__(**kwargs)
        self.n_capsule = n_capsule
        self.n_vector = n_vec
        self.n_routing = n_routing
        self.kernel_initializer = initializers.get('he_normal')
        self.bias_initializer = initializers.get('zeros')

    def build(self, input_shape): # input_shape is a 4D tensor
        _, self.input_n_capsule, self.input_n_vector, *_ = input_shape
        self.W = self.add_weight(shape=[self.input_n_capsule, self.n_capsule, self.input_n_vector, self.n_vector], initializer=self.kernel_initializer, name='W')
        self.bias = self.add_weight(shape=[1, self.input_n_capsule, self.n_capsule, 1, 1], initializer=self.bias_initializer, name='bias', trainable=False)
        self.built = True

    def call(self, inputs, training=None):
        input_expand = tf.expand_dims(tf.expand_dims(inputs, 2), 2)
        input_tiled = tf.tile(input_expand, [1, 1, self.n_capsule, 1, 1])
        input_hat = tf.scan(lambda ac, x: K.batch_dot(x, self.W, [3, 2]), elems=input_tiled, initializer=K.zeros( [self.input_n_capsule, self.n_capsule, 1, self.n_vector]))
        for i in range(self.n_routing): # routing
            c = tf.nn.softmax(self.bias, dim=2)
            outputs = squash(tf.reduce_sum( c * input_hat, axis=1, keep_dims=True))
            if i != self.n_routing - 1:
                self.bias += tf.reduce_sum(input_hat * outputs, axis=-1, keep_dims=True)
        return tf.reshape(outputs, [-1, self.n_capsule, self.n_vector])

    def compute_output_shape(self, input_shape):
        # output current layer capsules
        return (None, self.n_capsule, self.n_vector)

The class below will compute the length of the capsule:

class LengthLayer(Layer):
    def call(self, inputs, **kwargs):
        return tf.sqrt(tf.reduce_sum(tf.square(inputs), axis=-1, keep_dims=False))

    def compute_output_shape(self, input_shape):
        *output_shape, _ = input_shape
        return tuple(output_shape)

The function below will compute the margin loss:

def margin_loss(y_ground_truth, y_prediction):
    _m_plus = 0.9
    _m_minus = 0.1
    _lambda = 0.5
    L = y_ground_truth * tf.square(tf.maximum(0., _m_plus - y_prediction)) + _lambda * ( 1 - y_ground_truth) * tf.square(tf.maximum(0., y_prediction - _m_minus))
    return tf.reduce_mean(tf.reduce_sum(L, axis=1))

After defining the different necessary building blocks of the network we can now preprocess the MNIST dataset input for the network:

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(-1, 28, 28, 1).astype('float32') / 255.0
x_test = x_test.reshape(-1, 28, 28, 1).astype('float32') / 255.0
y_train = to_categorical(y_train.astype('float32'))
y_test = to_categorical(y_test.astype('float32'))
X = np.concatenate((x_train, x_test), axis=0)
Y = np.concatenate((y_train, y_test), axis=0)

Below are some variables that will represent the shape of the input, number of output classes, and number of routings:

input_shape = [28, 28, 1]
n_class = 10
n_routing = 3

Now, let’s create the encoder part of the network:

x = Input(shape=input_shape)
conv1 = Conv2D(filters=256, kernel_size=9, strides=1, padding='valid', activation='relu', name='conv1')(x)
primary_capsule = PrimaryCapsule( n_vector=8, n_channel=32, n_kernel_size=9, n_stride=2)(conv1)
digit_capsule = CapsuleLayer( n_capsule=n_class, n_vec=16, n_routing=n_routing, name='digit_capsule')(primary_capsule)
output_capsule = LengthLayer(name='output_capsule')(digit_capsule)

Then let’s create the decoder part of the network:

mask_input = Input(shape=(n_class, ))
mask = MaskingLayer()([digit_capsule, mask_input])  # two inputs
dec = Dense(512, activation='relu')(mask)
dec = Dense(1024, activation='relu')(dec)
dec = Dense(784, activation='sigmoid')(dec)
dec = Reshape(input_shape)(dec)

Now let’s create the entire model and compile it:

model = Model([x, mask_input], [output_capsule, dec])
model.compile(optimizer='adam', loss=[ margin_loss, 'mae' ], metrics=[ margin_loss, 'mae', 'accuracy'])

To view the layers and overall architecture of the entire model, we can use this command: model.summary()

Finally, we can train the model for three epochs and find out how it will perform:

model.fit([X, Y], [Y, X], batch_size=128, epochs=3, validation_split=0.2)

After training the model for only three epochs, the training set output accuracy of the model on the MNIST dataset was 0.9914, and 0.9919 for the validation set, which is 99 percent accurate for both the training and validation sets.

For the above implementation the Intel® AI DevCloud was used to train the network. Intel AI DevCloud is available for academic and personal research purposes for free and the request can be made from here: https://software.intel.com/en-us/ai-academy/tools/devcloud.

In this way, you can implement the capsule network using Keras and TensorFlow backend.

Now let’s look at some of the pros and cons of a capsule network.

Pros

Requires less training data
Equivariance preserves positional information of the input object
Routing by agreement is great for overlapping objects
Automatically calculates hierarchy of parts in a given object
Activation vectors are interpretable
Reached high accuracy in MNIST

Cons

Results are not state of the art in difficult datasets like CIFAR10*
Not tested in larger datasets like ImageNet*
Slow training process due to inner loop
Problem of crowding—not being able to distinguish between two identical objects of the same type placed close to one another.

References

Dynamic Routing Between Capsules by Sara Sabour, Nicholas Frosst and Geoffrey E Hinton: https://arxiv.org/pdf/1710.09829.pdf
Capsule Networks (CapsNets) – Tutorial created by Aurélien Géron: https://www.youtube.com/watch?v=pPN8d0E3900
Code used above is adopted from the GitHub* site engwang/minimal-capsule: https://github.com/fengwang/minimal-capsule

More implementations of CapsNet in popular frameworks:

Keras + TensorFlow: https://github.com/XifengGuo/CapsNet-Keras
TensorFlow: https://github.com/naturomics/CapsNet-Tensorflow
PyTorch*: https://github.com/gram-ai/capsule-networks
TensorFlow Implementation in Jupyter Notebook*: https://github.com/ageron/handson-ml/blob/master/extra_capsnets.ipynb

Operating System Requirements

The Intel MKL 2019 release supports the IA-32 and Intel® 64 architectures. For a complete explanation of these architecture names please read the following article:

Intel Architecture Platform Terminology for Development Tools

The lists below pertain only to the system requirements necessary to support developing applications with Intel MKL. Please review your compiler (gcc*, Microsoft* Visual Studio* or Intel® Compiler Pro) hardware and software system requirements, in the documentation provided with that product, to determine the minimum development system requirements necessary to support your compiler product.

Supported operating systems:

Windows 10 (IA-32 / Intel® 64)
Windows 8.1* (IA-32 / Intel® 64)
Windows 7* SP1 (IA-32 / Intel® 64)
Windows HPC Server 2016 (Intel® 64)
Windows HPC Server 2012 (Intel® 64)
Windows HPC Server 2008 R2 (Intel® 64)
Red Hat* Enterprise Linux* 6 (IA-32 / Intel® 64)
Red Hat* Enterprise Linux* 7 (IA-32 / Intel® 64)
Red Hat* Enterprise Linux* 7.5 (IA-32 / Intel® 64)
Red Hat Fedora* core 28 (IA-32 / Intel® 64)
Red Hat Fedora* core 27 (IA-32 / Intel® 64)
SUSE Linux Enterprise Server* 11
SUSE Linux Enterprise Server* 12
SUSE Linux Enterprise Server* 15 ????
openSUSE* 13.2
CentOS 7.1
CentOS 7.2
Debian* 8 (IA-32 / Intel® 64)
Debian* 9 (IA-32 / Intel® 64)
Ubuntu* 16.04 LTS (IA-32/Intel® 64)
Ubuntu* 17.10 LTS (IA-32/Intel® 64)
Ubuntu* 18.04 LTS (IA-32/Intel® 64)
WindRiver Linux 8
WindRiver Linux 9
WindRiver Linux 10
Yocto 2.3
Yocto 2.4
Yocto 2.5
Yocto 2.6
macOS* 10.13 (Xcode 6.x) and macOS* 10.14 (Xcode 6.x) (Intel® 64)

Note: Intel® MKL is expected to work on many more Linux distributions as well. Let us know if you have trouble with the distribution you use.

Supported C/C++ and Fortran compilers for Windows*:

Intel® Fortran Composer XE 2019 for Windows* OS
Intel® Fortran Composer XE 2018 for Windows* OS
Intel® Fortran Composer XE 2017 for Windows* OS
Intel® Visual Fortran Compiler 19.0 for Windows* OS
Intel® Visual Fortran Compiler 18.0 for Windows* OS
Intel® Visual Fortran Compiler 17.0 for Windows* OS
Intel® C++ Composer XE 2019 for Windows* OS
Intel® C++ Composer XE 2018 for Windows* OS
Intel® C++ Composer XE 2017 for Windows* OS
Intel® C++ Compiler 19.0 for Windows* OS
Intel® C++ Compiler 18.0 for Windows* OS
Intel® C++ Compiler 17.0 for Windows* OS
Microsoft Visual Studio* 2017 - help file and environment integration
Microsoft Visual Studio* 2015 - help file and environment integration
Microsoft Visual Studio* 2013 - help file and environment integration

Supported C/C++ and Fortran compilers for Linux*:

Intel® Fortran Composer XE 2019 for Linux* OS
Intel® Fortran Composer XE 2018 for Linux* OS
Intel® Fortran Composer XE 2017 for Linux* OS
Intel® Fortran Compiler 19.0 for Linux* OS
Intel® Fortran Compiler 18.0 for Linux* OS
Intel® Fortran Compiler 17.0 for Linux* OS
Intel® C++ Composer XE 2019 for Linux* OS
Intel® C++ Composer XE 2018 for Linux* OS
Intel® C++ Composer XE 2017 for Linux* OS
Intel® C++ Compiler 19.0 for Linux* OS
Intel® C++ Compiler 18.0 for Linux* OS
Intel® C++ Compiler 17.0 for Linux* OS
GNU Compiler Collection 4.4 and later
PGI* Compiler version 2018
PGI* Compiler version 2017

Note: Using the latest version of Intel® Manycore Platform Software Stack (Intel® MPSS is recommended on Intel MIC Architecture. It is available from the Intel® Software Development Products Registration Center at http://registrationcenter.intel.com as part of your Intel® Parallel Studio XE for Linux* registration

Supported C/C++ and Fortran compilers for OS X*:

Intel® Fortran Compiler 19.0 for macOS *
Intel® Fortran Compiler 18.0 for macOS *
Intel® Fortran Compiler 17.0 for macOS *
Intel® C++ Compiler 19.0 for macOS *
Intel® C++ Compiler 18.0 for macOS *
Intel® C++ Compiler 17.0 for macOS *
CLANG/LLVM Compiler 9.0
CLANG/LLVM Compiler 10.0

MPI implementations that Intel® MKL for Windows* OS has been validated against:

Intel® MPI Library Version 2019 (Intel® 64) (http://www.intel.com/go/mpi)
Intel® MPI Library Version 2018 (Intel® 64) (http://www.intel.com/go/mpi)
Intel® MPI Library Version 2017 (Intel® 64) (http://www.intel.com/go/mpi)
MPICH version 3.3 (http://www-unix.mcs.anl.gov/mpi/mpich)
MPICH version 2.14 (http://www-unix.mcs.anl.gov/mpi/mpich)
MS MPI, CCE or HPC 2012 on Intel® 64 (http://www.microsoft.com/downloads)

MPI implementations that Intel® MKL for Linux* OS has been validated against:

Intel® MPI Library Version 2019 (Intel® 64) (http://www.intel.com/go/mpi)
Intel® MPI Library Version 2018 (Intel® 64) (http://www.intel.com/go/mpi)
Intel® MPI Library Version 2017 (Intel® 64) (http://www.intel.com/go/mpi)
MPICH version 3.3 (http://www-unix.mcs.anl.gov/mpi/mpich)
MPICH version 3.1 (http://www-unix.mcs.anl.gov/mpi/mpich)
MPICH version 2.14 (http://www-unix.mcs.anl.gov/mpi/mpich)
Open MPI 1.8.x (Intel® 64) (http://www.open-mpi.org)

Note: Usage of MPI and linking instructions can be found in the Intel Math Kernel Library Developer Guide

Other tools supported for use with example source code:

uBLAS examples: Boost C++ library, version 1.x.x
JAVA examples: J2SE* SDK 1.4.2, JDK 5.0 and 6.0 from Sun Microsystems, Inc.

Note: Parts of Intel® MKL have FORTRAN interfaces and data structures, while other parts have C interfaces and C data structures. The Intel Math Kernel Library Developer Guide contains advice on how to link to Intel® MKL with different compilers and from different programming languages.

Deprecation Notices :

Dropped support for all MPI IA-32 implementations
Red Hat Enterprise Linux* 5.0 support is dropped
Windows XP* is not supported Support for Windows XP has been removed
Windows Server 2003* and Windows Vista* not supported

Abstract

The Intel AI DevJam Demo project, Reducing False Negatives in the Invasive Ductal Carcinoma (IDC) Classifier, provides the source codes and tutorials for setting up the project that will be demonstrated at Intel AI DevJam at ICML (International Conference on Machine Learning) in Sweden, July 2018.

The Intel® AI DevJam Demo GUI uses a Windows* application to communicate with a facial recognition classifier and an option of two classifiers trained to detect Invasive Ductal Carcinoma (Breast cancer) in histology images. The project combines the Invasive Ductal Carcinoma (IDC) Classification Using Computer Vision & IoT and TASS Movidius Facenet Classifier projects, along with some new improvements.

The goal of this project is to intentionally try to trick the model by using very similar, but opposite class, images from a small set of testing data that I believe humans may have difficulty telling apart. A larger set of testing data is provided to compare how the model works on larger datasets.

If we find false positives we will attempt to find a way to reduce them, providing a safety net for incorrect classifications that could mean the difference between life and death.

IoT Connectivity

IoT connectivity for the project is provided by the IoT JumpWay. The IoT JumpWay is an IoT communication platform as a service (PaaS) with a social network frontend. IoT JumpWay developers will soon be able to share projects/photos/videos and events. Use of the IoT JumpWay is completely free, you can find out more on the Developer Program page.

Checklist

Make sure you have completed the following steps before continuing to configure the Universal Windows Application as you will need them to be waiting for queries or commands before you can complete this tutorial.

Setup the IDC classification server/API.
Setup the IoT alarm device.

Software Requirements

Microsoft Visual Studio 2017

Setting Up The Universal Windows Application

You should have already downloaded the repository source code when you completed the IDC classification server/API setup. Navigate to IoT-JumpWay-Microsoft-Examples/Intel-AI-DevJam-IDC and double click the IDC-Classifier-GUI.sln file to open the solution in Visual Studio 2017.

You need the application to connect to the server you setup while following the IDC Classifier tutorial. Inside the IDC classification GUI Classes folder you will find a file called GlobalData.cs, in here you will find settings that you can use to connect to your IDC Classifier Server. When you start your IDC Classifier Server, the output will show you the IP/FQD and port number.

class GlobalData
{
    public string protocol = "http://";
    public string ip = "YOUR SERVER IP";
    public int port = 8080;
    public string endpoint = "/api/TASS/infer";
    public string endpointIDC = "/api/IDC/infer";
    public string dataFolder = "Data\\1";
    //public string dataFolder = "Data\\2";

    public double threshold = 0.80;
    public int expectedCount = 6;
}

Testing Data

Inside the GUI project folder you will find a folder called Data and inside there 2 folders of data 1 & 2. Currently the 1st folder has 12 specifically chosen unseen histology images. The images chosen were examples that I believed to be very similar to examples in the opposite class. The purpose of chosing these images was to see how the network reacts with very similar but opposite class images. You can flip between the two different size datasets, 1 & 2, or point to your own in the dataFoldersetting in Classes/GlobalData.cs

To add your own data you can remove the images in the Data folder and add your own dataset to the folder. Once you have added them to the folder you need to remove any unused images from the directory inside of Visual Studio and then add the new images into the project by right clicking on the Data folder, clicking add, and then selecting your new dataset.

IDC Classifier Evaluation Results

The results from the IDC Classifier Evaluation were as shown below:

INFO:tensorflow:Global Step 73: Streaming Accuracy: 0.8935 (0.61 sec/step)
INFO:tensorflow:Global Step 74: Streaming Accuracy: 0.8942 (0.67 sec/step)
INFO:tensorflow:Final Streaming Accuracy: 0.8941

Testing The Universal Windows Application

For this to work it is neccessary for you to have added your photo to the Known Data folder of the IDC Classifier folder.

Run the app and as the app starts up it will ask you for camera and microphone permissions (microphone is currently unused at this stage in development). Once you accept the permissions the camera should start up and display on the screen.

There is a known bug related to this part of the application which uses code from the Windows Universal Samples: Basic camera app sample. You may need to restart the application a number of times before your camera loads.

Click on the camera button on the right hand side to authenticate yourself. The application will take a photo of you and send it to the server for classification. You should now be authenticated onto the system, to add other people that have permissions to use the use the system simply add their photo to the known data folder in the IDC Classifier.

Click on the Classify All Images button to begin the classification process. The application will loop through the data and send it to the server for the classification. As each image is processed the application will notify you using voice, once the application finishes it will notify you of positive identifications,

Inception V3 Results

These results are from using the AI DevJam Inception V3 IDC Classifier. As mentioned the images were purposely chosen to challenge the model on false negatives and positives. Ideally there would be zero of either, but the best case scenario with misclassification is false positives, as it would be better to incorrectly predict non cancerous as cancercerous than it would be to predict cancerous as non cancerous.

The application has been set up to detect if a test classification is correct by checking for a string in file name to compare against the prediction. In this applications case, it will check negative predictions to see if the string class0 exists in the file name, and for positive predictions it will check for class1, this felps to determine whether they are false negatives or false positives.

The logs can be viewed in the outputs area of Visual Studio. Here it will display the info of each image processed, the prediction and whether it is false positive/false negative or correct/incorrect, or whether the classifier is unsure due to a low confidence. What I hoped not to see, but expected to see, was false negatives as I had chosen a testing dataset that I believed would possibly trick the classification model.

The console logs and info from my testing below shows that the IDC Classifier identified one of the positive examples as negative, and the application caught three false positives as unsure and requires more incpection due to low confidences.

2 true positives, 0 false positives, 1 false negatives, 3 unsure, 6 true negatives, 1 incorrect examples classified, 0.44 accuracy, 1 precision, 0.67 recall, 0.8 fscore
 
- 2 true positives, 0 false positives, 1 false negatives, 6 true negatives
- 3 unsure
- 1 incorrect examples classified
- 0.44 accuracy
- 1 precision
- 0.67 recall
- 0.8 fscore

The application will classify the image as unsure if it is a positive or negative classification but has a confidence lower than the threshold set in Classes/GlobalData.cs.

False Negatives

8975_idx5_x1001_y1451_class1.png
{"Confidence": "0.9526", "ResponseMessage": "IDC Not Detected With Confidence 0.9526", "Response": "OK", "Results": 0}
FALSE NEGATIVE: IDC incorrectly not detected in image 4 8975_idx5_x1001_y1451_class1.png with 0.9526 confidence.
Processed image 4

Unsure

By detecting whether the classifier is unsure or not, we can remove some data that can be further investigated. In this example, if it is a negative or positive classification, but the confidence is lower than the threshold, it will remove them from calculations and the unsure classifications are identified for further investigation. This allowed the application to catch three of the false negatives and mark them as unsure.

In a real world example the use of a threshold would make the application more safe. By catching three of the false negatives, we have helped the classifier to separate the unsure data.

The application is able to understand that it is not very confident on some of the images it classified. This is good because the incorrectly classified images were false negatives, this means that if we had not caught these, three of the classifications would of shown no cancer when there actually was cancer.

The application allows a doctor for example, to manually check images that the classifier has classified but is unsure about.

Below are the unsure classifications made by the application using the Data\1 dataFolder setting in Classes/GlobalData.cs:

8975_idx5_x1051_y1251_class1.png
{"Confidence": "0.807", "ResponseMessage": "IDC Not Detected With Confidence 0.807", "Response": "OK", "Results": 0}
UNSURE: IDC detected in image 5 8975_idx5_x1051_y1251_class1.png with 0.807 confidence.
Processed image 5

You can see the images that were incorrectly classified along with images from opposing classes that I believed may be able to trick the IDC Classifer in the image above. I was able to find similar looking images from the negative class that shows the classifier may of confused two similar images from two seperate classes.

This was also tested using the IDC Classifier Test Program with the same outcome. It seems that similar to facial recognition, Inception V3 gets confused on similar images, this can be confirmed or not by testing larger datasets.

Things To Try

Test on a larger dataset
Train more similar examples of the misidentified images
Increase the size of the training images from the dataset to 200px x 200px
Use a different model

Testing On A Larger DataSet

The second folder located in the Data folder can be used to test the classifier on 100 images, 50 negative and 50 positive. These images have been randomly selected and may or may not confuse similar images from seperate classes.

Data Folder

You need the application to use the larger IDC testing data folder. You can achieve this by editing the Classes/GlobalData.csfile and uncommenting the Data\2 folder and commenting out the Data\1 folder. Then change expectedCount to 50.

class GlobalData
{
    public string protocol = "http://";
    public string ip = "YOUR SERVER IP";
    public int port = 8080;
    public string endpoint = "/api/TASS/infer";
    public string endpointIDC = "/api/IDC/infer";
    //public string dataFolder = "Data\\1";
    public string dataFolder = "Data\\2";

    public double threshold = 0.80;
    public int expectedCount = 50;
}

This will start the application using the larger dataset the next time you run the application. The process is the same as when we tested the smaller dataset. Click on the Classify All Images button and the program will start to process the images.

Inception V3 Results

Below you can see the end of the console output from testing using folder Data\2 with 50 IDC positive and 50 IDC negative images.

27 true positives, 0 false positives, 7 false negatives, 16 unsure, 50 true negatives, 7 incorrect examples classified, 0.03 accuracy, 1 precision, 0.79 recall, 0.89 fscore
 
- 27 true positives, 0 false positives, 7 false negatives, 50 true negatives
- 16 unsure
- 7 incorrect examples classified
- 0.03 accuracy
- 1 precision
- 0.79 recall
- 0.89 fscore

The above shows that on a dataset of 100 images there were 7 incorrect classifications all of which are false negatives with a confidence of 0.90 or higher. The application marked 16 images as unsure and correctly identified 27 of the 50 IDC positive images.

False Negatives

It appears that all of the false negatives have at least two things in common.

They all have distinctive areas of white
They all have at least one very similar training example in the opposite class

8975_idx5_x1001_y1451_class1.png
{"Confidence": "0.9526", "ResponseMessage": "IDC Not Detected With Confidence 0.9526", "Response": "OK", "Results": 0}
FALSE NEGATIVE: IDC incorrectly not detected in image 20 8975_idx5_x1001_y1451_class1.png with 0.9526 confidence.
Processed image 20

Unsure

Our unsure classifications allow us to catch classifications that the model did not have high confidence on, this could help save lives in the case of captching false negatives.

It appears that all of the unsure classifications have at least two things in common.

They mostly all have distinctive areas of white
The small amount of purple is not detected in images with pink backgrounds

8975_idx5_x1001_y1301_class1.png
{"Confidence": "0.8223", "ResponseMessage": "IDC Detected With Confidence 0.8223", "Response": "OK", "Results": 1}
UNSURE: IDC detected in image 17 8975_idx5_x1001_y1301_class1.png with 0.8223 confidence.
Processed image 17

Things To Try

We can try a couple of things to help enhance the applications capabilities.

Pre detect and remove images with large amounts of white for manual examination
Check negative classifications to see if they do actually include purple
Train more similar examples of the misidentified images
Increase the size of the training images from the dataset to 299px x 299px
Use a different model

Get Involved

This project is open sourced under the MIT license. All contributions are welcome, you can choose from any of the features list below or submit your own features for review via a pull request.

Features List

Below you will find any features that will be implemented. Pull requests are welcome.

IoTJumpWay Integration

Bugs/Issues

Please feel free to create issues for bugs and general issues you come across whilst using this or any other IoT JumpWay Microsoft repo issues: IoT-JumpWay-Microsoft-Examples Github Issues

Known Bugs

Below you will find all known bugs in the application. Each bug has a corresponding issue in the repo issues area. Pull requests are welcome.

KNOWN BUG: Crashes after permissions

Contributors

Introduction

In this article, we will see some scope for optimization in Cycle-GAN for unpaired image-to-image translation, and come up with a new architecture. Also, we will dive deeper into using Intel® AI DevCloud for further speeding up the training process by using the power of multiple clusters.

Image-to-image translation involves transferring the characteristics of an image from one domain to another. For learning such mapping, the training dataset of images can be either paired or unpaired. Paired images imply that each example is in the form of a pair, having an image from both source and target domain; the dataset is said to be unpaired when there is no one-to-one correspondence between training images from input domain X and target domain Y.

Figure 1. Paired versus unpaired Image dataset. The paired image dataset contains examples such that for every i^th example, there is an image pair x_i and y_i. Here, x_i and y_i are a sketch and its corresponding actual photograph, respectively. The unpaired image dataset contains a set of images separately for actual photographs (X) and paintings (Y). Source:Cycle-GAN Paper

Previous works such as pix2pix* have offered image-to-image translation using paired training data; for example, converting a photograph from daytime to nighttime. For this we obtained paired data by getting pictures of the same location in the daytime as well as at night.

Applications pix2pix trained on paired image dataset
Figure 2. Some applications of pix2pix* trained on a paired image dataset; that is, when it is possible to obtain images of the same subject under different domains. Source: pix2pix paper

However, obtaining paired training data can be difficult and sometimes impossible. For example, for converting a horse into a zebra from an image, it is impossible to obtain a pair of images of horse and zebra in exactly the same location and in the same posture. This is where unpaired image-to-image translation is desired. Still, it is challenging to convert the image from one domain to another when there are no paired examples available. For example, such a system would have to convert the part of the image, where the horse is detected, but not alter its background, so that one-to-one correspondence exists between the source image and the target image.

Cycle-GAN provides an effective technique for learning mappings from unpaired image data. Some of the applications of using Cycle-GAN are shown below:

Applications of Cycle G A N
Figure 3. Applications of Cycle-GAN. This technique uses an unpaired dataset for training and is still able to effectively learn to translate images from one domain to another. Source: Cycle-GAN Paper

Cycle-GAN has applications in domains where a paired image dataset is not available. Even when a paired image can be obtained, it is easier to collect from both domains separately than by selectively obtaining paired images. Also, a dataset can be built much larger and faster in the case of unpaired images. Cycle-GAN is further discussed in the next section.

Background

Generative adversarial network

A generative adversarial network (GAN) is a framework for estimating generative models. As an example, a generative model can generate the next likely video frame based on previous frames.

Figure 4. Generative adversarial networks for image generation—the generator draws samples from latent random variables and the discriminator tells whether the sample came from the generator or the real world. Source: Deep Learning for Computer Vision: Generative models and adversarial training (UPC 2016)

The generative adversarial network not only involves a neural network for generating content (generator), but also a neural network for determining whether the content is real or fake. It is called the adversarial (discriminator) network. The training of both generator and discriminator is performed simultaneously, such that both optimize against each other in a two-player zero-sum game setting, until both networks lead to an equilibrium point (Nash equilibrium) of such game.

Through the combination of the generator network and the discriminator network (adversarial) there emerged tremendous possibilities for many more creative tasks from a computer than ever possible by another method. Facebook*'s AI research director Yann LeCun referred the adversarial training of GANs as "the most interesting idea in the last 10 years in ML." However, despite the plethora of possibilities in creativity in AI through GANs, one of the weaknesses of early GANs was limited stability for training the model.

Cycle-GAN

The Cycle-GAN architecture was proposed in the paper, Unpaired Image-to-Image Translation Cycle-Consistent Adversarial Networks. If a simple GAN is used for this problem then, Jan-Yan Zhu and his colleagues (2017) suggested:

"With large enough capacity, a network can map the same set of input images to any random permutation of images in the target domain, where any of the learned mappings can induce an output distribution that matches the target distribution. Thus, an adversarial loss alone cannot guarantee that the learned function can map an individual input to a desired output."

In other words, vanilla GAN would not have any sense of direction to maintain the correspondence between the source and the target image. In order to provide this sense of direction to the network, the authors introduced the cycle-consistency loss.

Figure 5. Cycle-consistency loss in Cycle-GAN. If an input image A from domain X is transformed into a target image B from domain Y via some generator G, then when image B is translated back to domain X via some generator F, this obtained image should match the input image A. The difference between these two images is defined as the cycle-consistency loss. This loss is similarly applicable to the image from domain Y. Source: Cycle-GAN Paper

This approach requires creating two pairs of generators and discriminators: one for A2B (source to target conversion) and another for B2A (target to source conversion).

Figure 6. A simplified architecture of Cycle-GAN. Considering an example for converting an image of a horse into a zebra, Cycle-GAN requires two generators. The generator A2B converts a horse into a zebra and B2A converts a zebra into a horse. Both train together to ensure one-to-one correspondence between the input horse image and the generated image of the zebra. The two discriminators determine real or fake images for horse and zebra, respectively. Source: Understanding and Implementing CycleGAN in TensorFlow

Coming with a Better Architecture

I took special interest in Cycle-GAN due to impressive results. Initially, my goal was to implement the approach provided in the paper on the TensorFlow* framework and study the technical aspects in detail.

While implementing it, I noticed that the training process was time consuming and there was a scope for optimization in Cycle-GAN:

In order to create a system to learn the mapping from domain A to domain B, Cycle-GAN involved an extra module for learning the mapping from domain B back to domain A. This extra module takes equal computational resources, and therefore about half of the resources are utilized in creating a system, which would not have utility after the training process.

Let us reconsider the purpose for introducing cyclic-loss in Cycle-GAN:

Problem with Vanilla GANs

While generative adversarial networks are found to be performing very well at generating samples or images, in the case of the unpaired image-to-image translation problem, the correspondence between the input image and the target output image is desired.

The correspondence between input image and target output
Figure 7. When converting an image of a horse into a zebra, if the generator creates an image of a zebra, which has no relation with the input horse image, the discriminator will be okay with such an image too.

It turns out that GANs do not force the output to correspond to its input. To address this issue, that is, to preserve the true content/characteristics of the input image in the translated image, the authors of the Cycle-GAN paper introduced cycle-consistency loss.

Optimization in generator network

I questioned whether there was any other way through which this goal could be achieved without having to create a second generator-discriminator pair.

This idea for optimizing an already well-enough performing architecture came to my mind by taking inspiration from Gandhian Engineering; it talks about reducing the cost of product through innovations. The core of this approach is to create a product that has more value and yet is accessible by more people; that is, more for less for more. The key idea for doing so is nothing goes unquestioned.

For this, I specifically targeted the problem of converting an image of an apple into an image of an orange. Thus, in this case the goal would be to modify the portion of the input image where the apple is detected, but keep the background intact.

This is a different perspective from that taken in Cycle-GAN, which tries to modify the complete image but not make an assumption that the background will remain intact. That way, the second discriminator has to learn and enforce this assumption, which results in the extra time taken for learning this fact.

I figured this goal can be achieved using a single neural network. And for this we essentially have a system that takes images from both domains—A (source) and B (target)—and gives only images from domain B in the output.

Figure 8. Cycle-consistency loss versus deviation loss. By applying deviation loss, only one generator can ensure one-to-one correspondence between source and target image. This eliminates the need for two generator networks.

To ensure that images from domain B do not change I introduced deviation loss, which is defined as the difference between the encodings of an image and the output of the generator network. This loss is introduced as a replacement for the cycle-consistency loss that was present in the Cycle-GAN architecture. The deviation loss regularizes the training of the translator by directing it to translate only the bare-minimum part of the encoded image of domain A, to make it appear like real encoding from domain B. Also, it enforces the spatial features to be kept intact throughout the translation.

Optimization in discriminator and encoder-decoder pair

I found another opportunity for optimization in the discriminator of Cycle-GAN or convolutional GANs in general.

I started rethinking about generative adversarial networks. As mentioned earlier, GAN essentially turns out to be an optimization problem for two-player zero-sum games, in which the generator tries to fool the discriminator and the discriminator tries to detect fake objects. There is an entire field of research around game theory that has not been applied to GAN, even though it is a game-theoretic problem at its core, and most of the game-playing strategies involve acquiring as much of the information as possible about the opponent's thinking. Thus, it makes more sense that both the competing players share some of their perspective about the game.

So, the generator and discriminator should share as much information as possible while maintaining enough exclusiveness to keep the competition alive.

This lead to another modification in the way the generator and discriminator input the information. In the case of Cycle-GAN, the discriminator takes the whole image and predicts whether the image looks real or fake. We can consider this discriminator to be working in two parts. The first part encodes the input image and the second part predicts from the encoding.

Figure 9. The discriminator in Cycle-GAN. The second half of discriminator (Conv-2) needs feature encodings of the image, which was already available from the output of the translator network; thus, unnecessary upsampling of this encoding from the decoder and again encoding it from the first part of the discriminator (Conv-1) will not just induce an error into the encodings but will also take more training resources.

So, the latter half of the discriminator need not take input from the output of the generator's decoder (a whole image), but it can directly take the translated feature encodings from the translator network as an input.

The discriminator in Cycle G A N versus Proposed Architecture
Figure 10. The discriminator in Cycle-GAN versus the discriminator in Proposed Architecture. Note that the need for the first part of the discriminator is completely eliminated. The generator can provide feature encodings directly to the discriminator.

Due to this change, the decoder part of the generator will be unable to take part in the generator-discriminator optimization. However, it can be optimized separately, along with the encoder, in the same way as an AutoEncoder.

Also, due to this major change of using only one generator-discriminator pair instead of two, some more optimization seems possible. In the cycle-GAN architecture there were two separate encoder-decoder pairs: one pair for encoding-decoding the images from the source domain and the other pair for encoding-decoding the images from the target domain.

Since there is only one generator now, only a single encoder-decoder pair can manage to encode and decode the images from both domains. Therefore, this pair can even be trained separately, which has its own advantages.

The separate step for training can be governed by the cyclic loss or reconstruction loss, which is defined as the difference between the original image and the image obtained when it is encoded and decoded to get the same image. This is similar to AutoEncoder, but between these pairs, the translator (generator) network is sandwiched.

For training the discriminator network, the conventional loss for GAN's discriminator is used. If the discriminator correctly classifies the encodings of the fake image the translator network is penalized. If the discriminator incorrectly classifies either of the real or fake encodings the discriminator network is penalized. This loss is kept similar to that of Cycle-GAN, but the structure of the discriminator has changed in the proposed architecture.

Optimization for Intel® AI DevCloud

When I started working on implementing Cycle-GAN I soon realized the lack of computational resources for doing so, as generative adversarial networks and Cycle-GAN are very sensitive to initialization and choosing just the right hyperparameters. And training such a big network on a local system using only a CPU is not a good idea.

Intel AI DevCloud works especially well for testing research ideas. This is because it can independently perform computations on multiple nodes of the cluster. Thus, several ideas can be tested simultaneously without waiting for others to complete execution.

For utilizing multiple nodes of a cluster for higher performance, I created several versions of implementation to obtain the right set of hyperparameters. For example, I created a job having the learning rate of 0.005, another job having a learning rate of 0.01, another one of 0.02, and so on. In this way, if three jobs are submitted simultaneously, it effectively speeds up the process by 3x, as compared to sequentially running each version. This technique is very general and can be used for training any model on Intel AI DevCloud.

For specifically this optimized architecture, there emerges further possibilities to speed up the training process. The architecture consists of mainly three modules:

Encoder-decoder pair
Translator network (generator)
Discriminator network

I observed that each of these modules can be trained on separate compute nodes. The only catch is that the translator and discriminator network's inputs depend upon the encoder's output, which needs to be trained. Also, the discriminator network's input is dependent on the translator's output and the translator's loss is dependent on the discriminator's output. Thus, it is required that if each of these three networks train on separate compute nodes, they periodically share their updated weights with other networks. Since all the submitted jobs use the same storage area, I chose to update the weights at the end of each epoch. Three separate checkpoints are created by each job, while the translator and discriminator networks update their encoder-decoder pairs at the end of each epoch and only train their corresponding network's weights. That is, the translator only trains the translator network, but updates the encoder-decoder pair and discriminator network in every epoch, only to use it for inference. Similarly, the discriminator uses other two networks whose weights are periodically updated for inference, while only the discriminator network is trained.

Therefore, this technique can further speed up the training of a single implementation by up to 3x. If combining this technique with submitting multiple jobs for different implementations, three different implementations can result in up to 9x speed-up.

Proposed Approach

The final proposed architecture for unpaired image-to-image translation:

Figure 11. Proposed architecture. The aim is to obtain image Fake B from image Input A. Neural networks are represented by solid boundaries and those having the same color represent the same network. The orange-colored circles indicate loss terms.

Explanation: Consider an example for converting the image of an apple to an orange. The goal is to perform this task while keeping the background intact. Forward pass involves downsampling of the input image of an apple, translating it to the encoding of an orange and upsampling it, to produce the image of an orange. Deviation loss ensures that the output of the translator is always the feature encodings of the orange. Thus, the image of an orange is unchanged (including the background), whereas the image of an apple changes in such a way that apples are converted into oranges (since the discriminator network is forcing this conversion), while everything in the background is unchanged (since deviation-loss is resisting this change). The key idea is that the translator network learns to not alter the background and the orange but to convert the apple to an orange.

Experimental Evaluation

The performance of this architecture is compared with the Cycle-GAN implementation on the TensorFlow Framework on Intel AI DevCloud using Intel® Xeon® Gold 6128 processors.

Table 1. Comparison of time taken by Cycle-GAN and proposed architecture.

No. of Epoch(s)	Time by Cycle-GAN	Time by Proposed Architecture	Speed-up
1	66.27 minutes	32.92 minutes	2.0128x
2	132.54 minutes	65.84 minutes	2.0130x
3	198.81 minutes	98.76 minutes	2.0138x
15	994.09 minutes	493.80 minutes	2.0131x

Furthermore, this speed-up is achieved by using only a single compute node. By using multiple nodes on Intel AI DevCloud, the speed-up can be as high as 18x. Also, it is observed that due to using the same neural network for encoding and decoding, and also using a less-complex decoder, the proposed system converges nearly twice as fast; that is, it needs nearly half the number of epochs required by Cycle-GAN to produce the same result.

Results

The neural networks were trained on images of apples and oranges collected from ImageNet* and were directly available from Taesung Park's Cycle-GAN Datasets. The images were 256 x 256 pixels. The training set consisted of 1177 images of class apple and 996 images of class orange.

Input apples converted into oranges in the output
Figure 12. Results. The input images of apples are converted into oranges in the output. Note that the image background has not changed in the process.

Conclusion

Summary of the proposed architecture:

Elimination of a second translator (to translate B to A).
Using the same neural network to encode images from both domains (A or B), and the same neural network to decode images from both domains (A or B).
The discriminator takes downsampled image encoding as input, as opposed to taking the whole image, which was the case with the discriminator in Cycle-GAN.
Use of deviation loss, instead of cycle-consistency loss, from Cycle-GAN.

This optimized architecture speeds up the training process by at least 2x; it is also observed that convergence is achieved in fewer epochs than with Cycle-GAN. Also, by using optimization techniques specific to Intel AI DevCloud, up to 18x speed-up can be achieved.

butterfly enhanced with modern cpp

With multi-core processors now common place in PCs, and core counts continually climbing, software developers must adapt. By learning to tackle potential performance bottlenecks and issues with concurrency, engineers can future-proof their code to seamlessly handle additional cores as they are added to consumer systems.

To help with this effort, Intel software teams have created a graphics toolkit that shows how parallel processing techniques are best applied to eight different graphics filters. The entire source code package contains C++ files, header files, project files, filters, and database files. A DLL overlay with a simple user interface shows the speed at which each filter can be applied, both in a single-core system and when using parallel-processing techniques.

In this white paper, readers learn to use modern C++ techniques to process data in parallel, across cores. By studying the sample code, downloading the application, and learning the techniques, developers will better understand Intel® architecture and multi-core technology.

Getting Started with Parallel Processing

There are countless articles and books written on parallel processing techniques. Ian Foster has a good recap, and multiple papers have been presented at SIGGRAPH, including one by John Owens. A good reference is the 2015 book Programming Models for Parallel Computing, edited by Pavan Balaji. It covers a wide range of parallel programming models, starting with a description of the Message Passing Interface (MPI), the most common parallel programming model for distributed memory computing.

With applications across the computing spectrum, from database processing to image rendering, parallel processing is a key concept for developers to understand. Readers are assumed to have some experience and background in computer science to benefit from the concepts described here. The source code was written for C++, but the concepts extend to other languages, and will be of interest to anyone looking to better understand how to optimize their code for multi-core processors.

An in-depth discussion of the Intel architecture is beyond the scope of this article. Software developers should register at the Intel® Developer Zone and check out the documentation download page for Intel architecture to read some of the following manuals:

Getting Started

The system requirements to explore the Image Filters project solution are minimal. Any multi-core system with Windows® 10 is sufficient.

This project assumes that you have a C++ toolkit, such as Microsoft Visual Studio* with the .NET framework. Freebyte* has a full set of links here if you want to explore different C++ tools. To simply look through code, you may want to use a free code viewer such as NotePad++* or a similar product.

To begin exploring the Image Filters project, follow these steps:

Create an empty directory on your computer with a unique title such as "Image Filters Project" or "Intel C++ DLL Project".
Use whatever naming strategy you are comfortable with; you can include the year, for example.
Download the .zip file to your new directory.
The file is not large—about 40 KB. After extracting the files and building the project, you will consume about 65 MB of disk space.
Extract the files in the new directory.
For example, if using 7-Zip*, right-click on the .zip file and select 7-Zip > Extract here. You can use any file compression software, such as 7-Zip, WinZip*, or WinRAR*.
Open Microsoft Visual Studio or similar C++ tool. These instructions assume that you loaded the most current version of Microsoft Visual Studio.
Open the project by using File > Open > Project/Solution and locating the ImageFilters.sln file.
The ImageFilters.sln file should appear in the Solution Explorer on the left.
The ImageFilters solution has two projects:
a) ImageFiltersWPF — The client application that utilizes the ImageProcessing DLL and shows how to interact with it using C#.
b) ImageProcessing — The C++ DLL that contains the multi-core image processing algorithms.

Figure 1. You must build both projects inside the Solution Explorer.
From the Solution Explorer, select the ImageFiltersWPF project; then hold down the CTRL key and select the ImageProcessing project.
Right-click on one of the highlighted files to pull up an Actions menu and select Build Selection. This starts the compiler for both.
Wait while the system quickly compiles the existing source files into the binary, .DLL, and .EXE files.

The following files display in the project solutions bin directory:

Figure 2. Compiled files in the bin > debug folder, including the ImageFiltersWPF executable.

Multithreading Technique

By default, applications run on a single processing core of a system. Because all new computing systems feature a CPU with multiple cores and threads, this implies that some complex calculations could be distributed intelligently, greatly speeding computation times.

OpenMP*(Open Multi-Processing) is an API first published in 1981 for Fortran 1.0 that supports multiplatform shared memory multiprocessing programming in C,C++, and Fortran on most platforms, instruction set architectures,and operating systems. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

In the case of the C++ DLL, which is where the execution is actually happening, OpenMP involves using compiler directives to execute the filtering routines in parallel. For picture processing, each pixel of the input image has to be processed in order to apply the routine to that image. Parallelism offers an interesting way of optimizing the execution time by spreading out the work across multiple threads, which can work on different areas of the image.

#pragma omp parallel

In each filtering routine in the application, processing is implemented as a loop. The goal is to study every pixel in the image, one by one. The "#pragma omp parallel" compiler directive causes the loop exercise to be divided and distributed to the cores.

#pragma omp parallel for if(openMP)			
	for (int i = 0; i < height; ++i) {
		auto offset = i * stride;
		BGRA* p = reinterpret_cast<BGRA*>(inBGR + offset);
		BGRA* q = reinterpret_cast<BGRA*>(outBGR + offset);
		for (int j = 0; j < width; ++j) {
			if (i == 0 || j == 0 || i == height - 1 || j == width - 1)
				q[j] = p[j];	// if conv not possible (near the edges)
			else {
				BYTE R, G, B;
				double red(0), blue(0), green(0);
				// Apply the conv kernel to every applicable 
				// pixel of the image
				for (int jj = 0, dY = -radius; jj < size; jj++, dY++) {
					for (int ii = 0, dX = -radius; ii < size; ii++, dX++) {
						int index = j + dX + dY * width;
						// Multiply each element in the local 
						// neighborhood 
						// of the center pixel by the corresponding
						// element in the convolution kernel
						// For the three colors
						blue += p[index].B * matrix[ii][jj];
						red += p[index].R * matrix[ii][jj];
						green += p[index].G * matrix[ii][jj];
					}
				}
				// Writing the results to the output image
				B = blue;
				R = red;
				G = green;
				q[j] = BGRA{ B,G,R,255 };
			}
		}
	}

Sample code for setting up parallel processing for BoxBlur.cpp

If you follow the comments in the code, "BoxBlur.cpp" is setting up offsets, handling calculations when edge conditions make convolution impossible, and applying the convolution kernel to each element for red, blue, and green colors.

#pragma omp parallel for if(openMP)
		for (int i = 0; i < height; ++i) {
			auto offset = i * stride;
			BGRA* p = reinterpret_cast<BGRA*>(tmpBGR + offset);
			BGRA* q = reinterpret_cast<BGRA*>(outBGR + offset);
			for (int j = 0; j < width; ++j) {
				if (i == 0 || j == 0 || i == height - 1 || j == width - 1)
					q[j] = p[j];	// if conv not possible (near the edges)
				else {
					double _T[2];
					_T[0] = 0; _T[1] = 0;
					// Applying the two Sobel operators (dX dY) to 
					// every available pixel
					for (int jj = 0, dY = -radius; jj < size; jj++, dY++) {
						for (int ii = 0, dX = -radius; ii < size; ii++, dX++) {
							int index = j + dX + dY * width;
							// Multiplicating each pixel in the 
							// neighborhood by the two Sobel 
							// Operators
							// It calculates the vertical and 
							// horizontal derivatives of the image 
							// at a point.
							_T[1] += p[index].G * M[1][ii][jj];
							_T[0] += p[index].G * M[0][ii][jj];
						}
					}
					// Then is calculated the magnitude of the 
					// derivatives
					BYTE a = sqrt((_T[0] * _T[0]) + (_T[1] * _T[1]));
					// Condition for edge detection
					q[j] = a > 0.20 * 255 ? BGRA{ a,a,a,255 } : BGRA{ 0,0,0,255 };
				}
			}
		}

		// Delete the allocated memory for the temporary grayscale image
		delete tmpBGR;
	}
	return 0;
}

Parallelism structure for SobelEdgeDetector.cpp

In the second example of "omp parallel for" taken from "SobelEdgeDetector.cpp", similar filtering operations take place, with the edge detector working with grayscale pictures.

Memory Management

In software development, developers must be careful about memory management to avoid serious impacts on application performance. In the case of the Harris corner detector and the Shi-Tomasi corner detector, memory management is crucial to creating three matrices and storing the results of Sx, Sy and Sxy.

// Creating a temporary memory to keep the Grayscale picture
	BYTE* tmpBGR = new BYTE[stride*height * 4];
	if (tmpBGR) {
		// Creating the 3 matrices to store the Sobel results, for each thread
		int max_threads = omp_get_max_threads();
		double *** Ix = new double**[max_threads];
		double *** Iy = new double**[max_threads];
		double *** Ixy = new double**[max_threads];
		for (int i = 0; i < max_threads; i++) {
			Ix[i] = new double*[size_kernel];
			Iy[i] = new double*[size_kernel];
			Ixy[i] = new double*[size_kernel];
			for (int j = 0;j < size_kernel;j++) {
				Ix[i][j] = new double[size_kernel];
				Iy[i][j] = new double[size_kernel];
				Ixy[i][j] = new double[size_kernel];
			}
		}

Setting up temporary memory for the Shi-Tomasi corner detector filter

Allocating such matrices for every pixel of the source would require considerable memory, and multithreading probably wouldn't be beneficial when applied. In fact, it could even result in slower calculations, due to the overhead of working with a large memory space.

In order to avoid memory issues and set up a scenario where multithreading makes the calculations faster, these three matrices can be considered as a set of matrices, with each available thread containing its own set of matrices. The application can then be set up to allocate, outside of the parallel section, as many sets of matrices as there are available threads for this application. To get the maximum number of threads the following function is used: "omp_get_max_threads()" from the "omp.h" file. This file is found with the rest of the header files in the ImageProcessing > External Dependencies directory.

As described at the Oracle* support site, this function should not be confused with the similarly named "omp_get_num_threads()".The "max" call returns the maximum number of threads that can be put to use in a parallel region. The "num" call returns the number of threads that are currently active and performing work. Obviously, those are different numbers. In a serial region "omp_get_num_threads" returns 1; in a parallel region it returns the number of threads that are being used.

The call "omp_set_num_threads" sets the maximum number of threads that can be used (equivalent to setting the OMP_NUM_THREADS environment variable).

In the processing loop, each thread accesses its proper set of matrices. These sets are stored in a dynamically allocated array of sets. To access the correct set the index of the actual thread is used with the function "omp_get_thread_num()".Once the routine is executed the three matrices are reset to their initial values, so that the next time the same thread has to execute the routine for another pixel, the matrix is already prepared for use.

Principle of Convolution

Image filtering is a good showcase for multi-core processing because it involves the principle of convolution. Convolution is a mathematical operation that accumulates effects; think of it as starting with two functions, such as f and g, to produce a third function that represents the amount of overlap as one function is shifted over another.

In this version of iteration, convolution is the process of adding each element of the image to its local neighbors, weighted by a kernel. By adding each element of the image to its local neighbors, weighted by the kernel, convolution can be used for blurring, sharpening, embossing, edge detection, and more. There are numerous resources available with a quick search, such as detailed discussions of kernels and image processing.

In the case of image filtering, convolution works with a kernel, which is a matrix of numbers. This kernel is then applied to every pixel of the image, with the center element of the convolution kernel placed over the source pixel (Px). The source pixel is then replaced by the weighted sum of itself and nearby pixels. Multithreading helps filter the image faster by breaking the process into pieces.

Figure 3. Convolution using weighting in a 3 x 3 kernel (source: GNU Image Manipulation Program).

In this example, the convolution kernel is a 3 x 3 matrix represented by the green box. The source pixel is 50, shown in red at the center of the 3 x 3 matrix. All local neighbors, or nearby pixels, are the pixels directly within the green square. The larger the kernel, the larger the number of neighbors.

In this case, the only weight different than zero is the second element of the first row and represented by a 1. All other elements in the 3 x 3 matrix are 0. The operation multiplies each element by zero, removing it, except for the single pixel represented by 42. So, the new source pixel is 42 x 1 = 42. Thus, the pixel just above the source pixel is overlapped by the weight 1 of the convolution kernel.

If you imagine each weighting as a fraction rather than zero, you can picture how images could be blurred by analyzing and processing each surrounding pixel.

Filtering Techniques

To see the result of a filtering technique, you'll have to build the project as described in the "Getting Started" section. Then double-click on the Image Filters > ImageFiltersWPF > Bin > Debug > ImageFiltersWPF.exe file.

ImageFilters W P F executable main window
Figure 4. ImageFiltersWPF executable main window. Use the small gray box at the top-right corner of the screen to locate directories with images you want to experiment with.

The interface is very simple. You can select images on your system using the directory search feature in the gray box in the upper-right corner. Use the "Stretch" button to make sure an image completely fills the graphical user interface (GUI). Select an image, then apply a filter. Watch the "Time in seconds" calculations at the bottom-left of the interface to see how long a filter would take to apply in a multi-core system versus a system with a single core.

There are eight filters in all; each alters the original image, but some filters create more dramatic changes.

Box blur filter

A box blur is a spatial domain linear filter in which each pixel in the resulting image has a value equal to the average value of its neighboring pixels in the input image. Due to its property of using equal weights, it can be implemented using a simple accumulation algorithm, and is generally a fast way to achieve a blur. The name refers to the boxy, pixelated result.

The weights of the convolution kernel for the box blur are all the same. Assuming that we have a 3 x 3 matrix, this means that we have nine elements inserted into the matrix in total.

Formula to calculate the elements

The weight for every element is calculated so that the sum of every element is 1.

First image is vivid, the other is blured
Figure 5. The original image on the left is vivid, with good detail, while the image on the right has had the Box Blur effect applied.

When using the app, it was calculated that a single core system would take 0.1375 seconds to apply the Box Blur while a multi-core system, in this case with an Intel® Core™ i7-4220 processor, took 0.004 seconds.

Let's look in depth at what is going on in BoxBlur.cpp to understand the multithreading principles.

include "stdafx.h"
#include <fstream>
#include "omp.h"

using namespace std;

extern "C" __declspec(dllexport) int __stdcall BoxBlur(BYTE* inBGR, BYTE* outBGR, int stride, int width, int height, KVP* arr, int nArr)
{
	// Pack the following structure on one-byte boundaries: 
	// smallest possible alignment
	// This allows us to use the minimal memory space for this type: 
	// exact fit - no padding 
#pragma pack(push, 1)
	struct BGRA {
		BYTE B, G, R, A;
	};

Beginning of the BoxBlur.cpp file

First, the file is set up to include the "stdafx.h" header, a standard system include. Finally, omp.h is the header file that brings in OpenMP instructions.

BoxBlur then uses the extern "C" function to declare calls and variables. The rest of the C++ file is devoted to multi-core functionality. First, using "#pragma pack(push, 1)",the file defines how to efficiently handle a BGRA (blue green red alpha) color component packing structure on one-byte boundaries using the smallest possible alignment.

Next, the file declares "#pragma pack (pop)" to set up the default packing mode, defines the Boolean operator for whether multiple cores have been detected, sets up the convolution kernel, and allocates memory.

Finally, if there are multiple cores (OpenMP = true), the file uses "#pragma omp parallel for if (openMP)". The code determines offsets and casts, and handles situations at the edges where convolution is not possible. Results are written to the output image, and the allocated memory is cleared for the convolution kernel. There are similar sections of code in each of the filters.

Gaussian blur filter

Gaussian blur is the result of blurring an image by a Gaussian kernel to reduce image noise and reduce detail. It is similar to Box Blur filtering; each pixel in the image gets multiplied by placing the center pixel of the Gaussian kernel on the image pixel and multiplying the values in the original image with the pixels in the kernel that overlap. The values resulting from these multiplications are added up, and that result is used for the value at the destination pixel.

The weights of the elements of a Gaussian matrix N x N are calculated by the following:

Gaussian matrix

Here x and y are the coordinates of the element in the convolution kernel. The top-left corner element is at the coordinates (0, 0), and the bottom-right at the coordinates (N-1, N-1).

For the same reason as the Box Blur, the sum of every element has to be 1. Thus, at the end, we need each element of the kernel to be divided by the total sum of the weights of the kernel.

Threshold filter

The threshold routine is the only technique that does not use the convolution principle. Threshold filters examine each value of the input dataset and change all values that do not meet the boundary conditions. The result is that, if the luminance is smaller than the threshold, the pixel is turned to black; otherwise, it remains the same.

Figure 6. Threshold filtering example.

Sobel edge detection

The Sobel operator is an edge detector that relies on two convolutions by using two different kernels. These two kernels calculate the horizontal and vertical derivatives of the image at a point. Though its goal is different, it is used to detect the edges inside a picture. Applying such a kernel provides a score, indicating whether or not the pixel can be considered as being part of an edge. If this score is greater than a given threshold, it can be considered as part of an edge.

Figure 7. Sobel edge detection relies on two different kernels.

This means that for each pixel there are two results of convolution, G_x and G_y. Considering them as a scalar of a 2D vector, the magnitude G is calculated as follows:

formula to calculate the magnitude G

Laplacian edge detector

This filter technique is quite similar to Box Blur and Gaussian Blur. It relies on a single convolution, using a kernel such as the one below to detect edges within a picture. The results can be applied to a threshold so that the visual results are smoother and more accurate.

Laplacian Edge D

Laplacian of Gaussian

The Laplacian edge detector is particularly sensitive to noise so, to get better results, we can apply a Gaussian Blur to the whole image before applying the Laplacian filter. This technique is named "Laplacian of Gaussian".

Harris corner detector

Convolutions are also used for the Harris corner detector and the Shi-Tomasi corner detector, and the calculations are more complex than earlier filter techniques. The vertical and horizontal derivatives (detailed in the Sobel operator) are calculated for every local neighbor of the source pixel Px (including itself). The size of this area (window) has to be an odd number so that it has a center pixel, which is called the source pixel.

Thus, Sx and Sy are calculated for every pixel in the window. Sxy is calculated as Sxy = Sx * Sy.

These results are stored in three different matrices. These matrices respectively represent the values Sx, Sy, and Sxy for every pixel around the source pixel (and also itself).

A Gaussian matrix (the one used for the Gaussian blur) is then applied to these three matrices, which results in three weighted values of Sx, Sy, and Sxy. We will name them Ix, Iy, and Ixy.

These three values are stored in a 2 x 2 matrix A:

A 2 x 2 matrix A

Then a score k is calculated, representing whether that source pixel can be considered as a part of a corner, or not.

Then a score k is calculated

Then, if k is greater than a given threshold, the source pixel is turned into white, as part of a corner. If not, it is set to black.

Shi-Tomasi corner detector

This detector is based on the Harris corner detector; however, there is a change relating to the condition of corner detection. This detector has better performance than the previous one.

Once the matrix A is calculated, in the same way as above, the eigenvalues and of the matrix A are calculated. The eigenvalues of a matrix are the solutions of the following equation det (A) = 0.

the eigen values of the matrix A

As with the Harris corner detector, with regard to the value of k, this source pixel will be or will not be considered as part of a corner.

Figure 8. Example of Shi-Tomasi corner detector filter, resulting in almost total conversion to black pixels. The filtering took 61 seconds using a single core, versus 14 seconds using multiple cores.

Conclusion

This C++ DLL application is a good example of how important it is to apply multithreading techniques to software development projects. In almost every scenario, the calculations involved on a four-core system to apply the various filters required about three times as long to complete using a single core versus using multi-core techniques.

Developers should not expect to get an N times speedup when running a program parallelized using OpenMP on an N processor platform. According to sources, there are several reasons why this is true:

When a dependency exists, a process must wait until the data it depends on is computed.
When multiple processes share a nonparallel proof resource (like a file to write in), their requests are executed sequentially. Therefore, each thread must wait until the other thread releases the resource.
A large part of the program may not be parallelized by OpenMP, which means that the theoretical upper limit of speedup is limited, according toAmdahl's law.
N processors in symmetric multiprocessing(SMP) may have N times the computation power, but thememory bandwidthusually does not scale up N times. Quite often, the original memory path is shared by multiple processors, and performance degradation may be observed when they compete for the shared memory bandwidth.
Many other common problems affecting the final speedup in parallel computing also apply to OpenMP, likeload balancingand synchronization overhead.

With current systems powered by processors such as the Intel® Core™i9-7980XE Extreme Edition processor, which has 18 cores and 36 threads, the advantages of developing code that is optimized to handle multithreading is obvious. To learn more, download the app, analyze it with an integrated development environment such as Microsoft Visual Studio, and get started with your own project.

Appendix A. About ImageFiltersWPF

ImageFiltersWPF is a Windows* WPF client application that uses Extensible Application Markup Language (XAML) to display its GUI. The entry point into this app is MainWindow.xaml/MainWindow.xaml.cs. Along with the main window are several supporting classes to help keep functionality clean.

ImageFileList.cs

This class's primary function is to generate a list of image files that can be selected in order to apply filters.

ImageFilterList.cs

This is a very simple list that encapsulates a list of the filters that the C++ DLL provides. Once created, it is used by the GUI element lbImageFilters.

BitmapWrapper.cs

This class accepts a source bitmap image and takes that bitmap and turns it into a byte array that can be consumed by the C++ DLL.

ImageProcessWrapper.cs

This class loads up the DLL and supplies the function to call, which passes the byte arrays to the DLL.

MainWindow.xmal.cs

This is the MainWindow GUI code. This function gets the current filter name and sends the bitmapwrapper instance to the DLL. It does this twice, once for single core, then again as multi-core. After each of those run, it updates labels that contain the number of seconds it took to process the image. Once the processing is complete, the new image is displayed.

Resources

OpenMP

Intel® Many Integrated Core (Intel® MIC) Architecture

File(s):	Download
License:	Intel Sample Source Code License Agreement

Optimized for...
OS:	Windows® 10
Hardware:	N/A
Software: (Programming Language, tool, IDE, Framework)	C++, C#, Microsoft Visual Studio*
Prerequisites:	Familiarity with Microsoft Visual Studio, C++ and C#, multi-core software development

Introduction

This software example demonstrates how to use multi-core technologies to edit images. There are two parts to this project, a .NET Windows application front end written using C# and Windows Presentation Foundation (WPF) and a C++ DLL which is responsible for the actual manipulation of the image.

The image editing is done by applying filters to images, where each filter is a different function in the C++ DLL. The C# front end passes the image bitmap data to the C++ DLL, the DLL processes the image by applying a chosen filter, the DLL then passes back to C# GUI the newly created image. Further, this app allows the user to see the performance difference between running single-core vs. multi-core.

Get Started

Download the code from GitHub* and read the article Using Modern C++ Techniques to Enhance Multicore Optimizations for a better understanding of how to perform multicore development.

Updated Log

Created June/19/2018

Intel® Parallel Studio XE | Intel® System Studio | Intel® Media Server Studio

Intel ® Advisor | OpenVINO™ Toolkit | Intel® Data Analytics Acceleration Library

Intel® Distribution for Python* | Intel® Inspector XE | Intel® Integrated Performance Primitives

Intel® Math Kernel Library | Intel® Media SDK | Intel® MPI Library | Intel® Threading Building Blocks

Intel® VTune™ Amplifer

Intel® Parallel Studio XE

Altair Creates a New Standard in Virtual Crash Testing

Altair advances frontal crash simulation with help from Intel® Software Development products.

CADEX Resolves the Challenges of CAD Format Conversion

Parallelism Brings CAD Exchanger* software dramatic gains in performance and user satisfaction, plus a competitive advantage.

Envivio Helps Ensure the Best Video Quality and Performance

Intel® Parallel Studio XE helps Envivio create safe and secured code.

ESI Group Designs Quiet Products Faster

ESI Group achieves up to 450 percent faster performance on quad-core processors with help from Intel® Parallel Studio.

F5 Networks Profiles for Success

F5 Networks amps up its BIG-IP DNS* solution for developers with help from
Intel® Parallel Studio and Intel® VTune™ Amplifer.

Fixstars Uses Intel® Parallel Studio XE for High-speed Renderer

As a developer of services that use multi-core processors, Fixstars has selected Intel® Parallel Studio XE as the development platform for its lucille* high-speed renderer.

Golaem Drives Virtual Population Growth

Crowd simulation is one of the most challenging tasks in computer animation―made easier with Intel® Parallel Studio XE.

Lab7 Systems Helps Manage an Ocean of Information

Lab7 Systems optimizes BioBuilds™ tools for superior performance using Intel® Parallel Studio XE and Intel® C++ Compiler.

Mentor Graphics Speeds Design Cycles

Thermal simulations with Intel® Software Development Tools deliver a performance boost for faster time to market.

Massachusetts General Hospital Achieves 20X Faster Colonoscopy Screening

Intel® Parallel Studio helps optimize key image processing libraries, reducing compute-intensive colon screening processing time from 60 minutes to 3 minutes.

Moscow Institute of Physics and Technology Rockets the Development of Hypersonic Vehicles

Moscow Institute of Physics and Technology creates faster and more accurate computational fluid dynamics software with help from Intel® Math Kernel Library and Intel® C++ Compiler.

NERSC Optimizes Application Performance with Roofline Analysis

NERSC boosts the performance of its scientific applications on Intel® Xeon Phi™ processors up to 35% using Intel® Advisor.

Nik Software Increases Rendering Speed of HDR by 1.3x

By optimizing its software for Advanced Vector Extensions (AVX), Nik Software used Intel® Parallel Studio XE to identify hotspots 10x faster and enabled end users to render high dynamic range (HDR) imagery 1.3x faster.

Novosibirsk State University Gets More Efficient Numerical Simulation

Novosibirsk State University boosts a simulation tool’s performance by 3X with Intel® Parallel Studio, Intel® Advisor, and Intel® Trace Analyzer and Collector.

Pexip Speeds Enterprise-Grade Videoconferencing

Intel® analysis tools enable a 2.5x improvement in video encoding performance for videoconferencing technology company Pexip.

Schlumberger Parallelizes Oil and Gas Software

Schlumberger increases performance for its PIPESIM* software by up to 10 times while streamlining the development process.

Ural Federal University Boosts High-Performance Computing Education and Research

Intel® Developer Tools and online courseware enrich the high-performance computing curriculum at Ural Federal University.

Walker Molecular Dynamics Laboratory Optimizes for Advanced HPC Computer Architectures

Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.

Intel® System Studio

CID Wireless Shanghai Boosts Long-Term Evolution (LTE) Application Performance

CID Wireless boosts performance for its LTE reference design code by 6x compared to the plain C code implementation.

GeoVision Gets a 24x Deep Learning Algorithm Performance Boost

GeoVision turbo-charges its deep learning facial recognition solution using Intel® System Studio and the OpenVINO™ toolkit.

NERSC Optimizes Application Performance with Roofline Analysis

NERSC boosts the performance of its scientific applications on Intel® Xeon Phi™ processors up to 35% using Intel® Advisor.

Daresbury Laboratory Speeds Computational Chemistry Software

Scientists get a speedup to their computational chemistry algorithm from Intel® Advisor’s vectorization advisor.

Novosibirsk State University Gets More Efficient Numerical Simulation

Novosibirsk State University boosts a simulation tool’s performance by 3X with Intel® Parallel Studio, Intel® Advisor, and Intel® Trace Analyzer and Collector.

Pexip Speeds Enterprise-Grade Videoconferencing

Intel® analysis tools enable a 2.5x improvement in video encoding performance for videoconferencing technology company Pexip.

Schlumberger Parallelizes Oil and Gas Software

Schlumberger increases performance for its PIPESIM* software by up to 10 times while streamlining the development process.

OpenVINO ™ Toolkit

GeoVision Gets a 24x Deep Learning Algorithm Performance Boost

GeoVision turbo-charges its deep learning facial recognition solution using Intel® System Studio and the OpenVINO™ toolkit.

Intel® Data Analytics Acceleration Library

MeritData Speeds Up a Big Data Platform

MeritData Inc. improves performance—and the potential for big data algorithms and visualization.

Intel® Distribution for Python*

DATADVANCE Gets Optimal Design with 5x Performance Boost

DATADVANCE discovers that Intel® Distribution for Python* outpaces standard Python.

Intel® Inspector XE

CADEX Resolves the Challenges of CAD Format Conversion

Parallelism Brings CAD Exchanger* software dramatic gains in performance and user satisfaction, plus a competitive advantage.

Envivio Helps Ensure the Best Video Quality and Performance

Intel® Parallel Studio XE helps Envivio create safe and secured code.

ESI Group Designs Quiet Products Faster

ESI Group achieves up to 450 percent faster performance on quad-core processors with help from Intel® Parallel Studio.

Fixstars Uses Intel® Parallel Studio XE for High-speed Renderer

As a developer of services that use multi-core processors, Fixstars has selected Intel® Parallel Studio XE as the development platform for its lucille* high-speed renderer.

Golaem Drives Virtual Population Growth

Crowd simulation is one of the most challenging tasks in computer animation―made easier with Intel® Parallel Studio XE.

Schlumberger Parallelizes Oil and Gas Software

Schlumberger increases performance for its PIPESIM* software by up to 10 times while streamlining the development process.

Intel® Integrated Performance Primitives

JD.com Optimizes Image Processing

JD.com Speeds Image Processing 17x, handling 300,000 images in 162 seconds instead of 2,800 seconds, with Intel® C++ Compiler and Intel® Integrated Performance Primitives.

Tencent Optimizes an Illegal Image Filtering System

Tencent doubles the speed of its illegal image filtering system using SIMD Instruction Set and Intel® Integrated Performance Primitives.

Tencent Speeds MD5 Image Identification by 2x

Intel worked with Tencent engineers to optimize the way the company processes millions of images each day, using Intel® Integrated Performance Primitives to achieve a 2x performance improvement.

Walker Molecular Dynamics Laboratory Optimizes for Advanced HPC Computer Architectures

Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.

Intel® Math Kernel Library

DreamWorks Puts the Special in Special Effects

DreamWorks Animation’s Puss in Boots uses Intel® Math Kernel Library to help create dazzling special effects.

GeoVision Gets a 24x Deep Learning Algorithm Performance Boost

GeoVision turbo-charges its deep learning facial recognition solution using Intel® System Studio and the OpenVINO™ toolkit.

MeritData Speeds Up a Big Data Platform

MeritData Inc. improves performance―and the potential for big data algorithms and visualization.

Qihoo360 Technology Co. Ltd. Optimizes Speech Recognition

Qihoo360 optimizes the speech recognition module of the Euler platform using Intel® Math Kernel Library (Intel® MKL), speeding up performance by 5x.

Intel® Media SDK

NetUP Gets Blazing Fast Media Transcoding

NetUP uses Intel® Media SDK to help bring the Rio Olympic Games to a worldwide audience of millions.

Intel® Media Server Studio

ActiveVideo Enhances Efficiency

ActiveVideo boosts the scalability and efficiency of its cloud-based virtual set-top box solutions for TV guides, online video, and interactive TV advertising using Intel® Media Server Studio.

Kraftway: Video Analytics at the Edge of the Network

Today’s sensing, processing, storage, and connectivity technologies enable the next step in distributed video analytics, where each camera itself is a server. With Kraftway* video software platforms can encode up to three 1080p60 streams at different bit rates with close to zero CPU load.

Slomo.tv Delivers Game-Changing Video

Slomo.tv's new video replay solutions, built with the latest Intel® technologies, can help resolve challenging game calls.

SoftLab-NSK Builds a Universal, Ultra HD Broadcast Solution

SoftLab-NSK combines the functionality of a 4K HEVC video encoder and a playout server in one box using technologies from Intel.

Vantrix Delivers on Media Transcoding Performance

HP Moonshot* with HP ProLiant* m710p server cartridges and Vantrix Media Platform software, with help from Intel® Media Server Studio, deliver a cost-effective solution that delivers more streams per rack unit while consuming less power and space.

Intel® MPI Library

Moscow Institute of Physics and Technology Rockets the Development of Hypersonic Vehicles

Moscow Institute of Physics and Technology creates faster and more accurate computational fluid dynamics software with help from Intel® Math Kernel Library and Intel® C++ Compiler.

Walker Molecular Dynamics Laboratory Optimizes for Advanced HPC Computer Architectures

Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.

Intel® Threading Building Blocks

CADEX Resolves the Challenges of CAD Format Conversion

Parallelism Brings CAD Exchanger* software dramatic gains in performance and user satisfaction, plus a competitive advantage.

Johns Hopkins University Prepares for a Many-Core Future

Johns Hopkins University increases the performance of its open-source Bowtie 2* application by adding multi-core parallelism.

Mentor Graphics Speeds Design Cycles

Thermal simulations with Intel® Software Development Tools deliver a performance boost for faster time to market.

Pexip Speeds Enterprise-Grade Videoconferencing

Intel® analysis tools enable a 2.5x improvement in video encoding performance for videoconferencing technology company Pexip.

Quasardb Streamlines Development for a Real-Time Analytics Database

To deliver first-class performance for its distributed, transactional database, Quasardb uses Intel® Threading Building Blocks (Intel® TBB), Intel’s C++ threading library for creating high-performance, scalable parallel applications.

University of Bristol Accelerates Rational Drug Design

Using Intel® Threading Building Blocks, the University of Bristol helps slash calculation time for drug development—enabling a calculation that once took 25 days to complete to run in just one day.

Walker Molecular Dynamics Laboratory Optimizes for Advanced HPC Computer Architectures

Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.

Intel® VTune™ Amplifer

CADEX Resolves the Challenges of CAD Format Conversion

Parallelism brings CAD Exchanger* software dramatic gains in performance and user satisfaction, plus a competitive advantage.

F5 Networks Profiles for Success

F5 Networks amps up its BIG-IP DNS* solution for developers with help from
Intel® Parallel Studio and Intel® VTune™ Amplifer.

GeoVision Gets a 24x Deep Learning Algorithm Performance Boost

GeoVision turbo-charges its deep learning facial recognition solution using Intel® System Studio and the OpenVINO™ toolkit.

Mentor Graphics Speeds Design Cycles

Thermal simulations with Intel® Software Development Tools deliver a performance boost for faster time to market.

Nik Software Increases Rendering Speed of HDR by 1.3x

Walker Molecular Dynamics Laboratory Optimizes for Advanced HPC Computer Architectures

Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.

Thank you for choosing the Intel® Graphics Performance Analyzers (Intel® GPA), available as a standalone product and as part of Intel® System Studio.

Introduction
What's New
System Requirements and Supported Platforms
Installation Notes
Technical Support and Troubleshooting
Known Issues and Limitations
Legal Information

Introduction

Intel® GPA provides tools for graphics analysis and optimizations for making games and other graphics-intensive applications run even faster. The tools support the platforms based on the latest generations of Intel® Core™ and Intel Atom™ processor families, for applications developed for Windows*, Android*, Ubuntu*, or macOS*.

Intel® GPA provides a common and integrated user interface for collecting performance data. Using it, you can quickly see performance opportunities in your application, saving time and getting products to market faster.

For detailed information and assistance in using the product, refer to the following online resources:

Home Page - view detailed information about the tool, including links to training and support resources, as well as videos on the product to help you get started quickly.
Getting Started - get the main features overview and learn how to start using the tools on different host systems.
Training and Documentation - learn at your level with Getting Started guides, videos and tutorials.
Online Help for Windows* Host - get details on how to analyze Windows* and Android* applications from a Windows* system.
Online Help for macOS* Host - get details on how to analyze Android* or macOS* applications from a macOS* system.
Online Help for Ubuntu* Host - get details on how to analyze Android* or Ubuntu* applications from an Ubuntu* system.
Support Forum - report issues and get help with using Intel® GPA.

What's New

Intel® GPA 2018 R2 offers the following new features:

New Features for Analyzing All Graphics APIs

System Analyzer

View all available Intel® GPU metrics metrics in System View on Windows* Platforms, with an ability to switch between these counter sets using the Ctrl+M hotkey

Graphics Frame Analyzer

Search for and pin interesting metrics to the top of the metrics table
Copy resource names in the Resource Viewer using CTRL+C

All Tools

Modified the Dark Mode color scheme for improved usability

**New Features for analyzing OpenGL* applications**

New Platforms

Support for macOS High Sierra (10.13.4) has been added for this release including support for:
- Real-time metrics in System Analyzer
- Per-region/per-event metrics in Graphics Frame Analyzer

Graphics Monitor

OpenGL applications downloaded from Apple AppStore can be launched through Graphics Monitor or Graphics Frame Analyzer launch dialog without Sandbox removal
User-configurable frame delimiters have been added, these delimiters: SwapBuffer, MakeCurrent context, Clear, Flush, Finish or BlitFramebuffer can be used individually or in combination

System Analyzer HUD

Updated Heads-up display for OpenGL applications on Windows, Ubuntu, and macOS platforms

**New Features for analyzing Microsoft DirectX* applications**

New Platforms

Metrics for AMD* Radeon RX Vega M (in the new Intel® NUC KIT NUC8I7HVK) are available in System Analyzer and Graphics Frame Analyzer for DirectX 11 and DirectX 12 applications

Graphics Monitor

Graphics applications launched in “Auto-detect launched applications” mode are automatically added to recent applications list

Graphics Frame Analyzer

Any DirectX* 11 shader resource view (SRV) can now be replaced with a simple 2x2 texture or clamped to a selected MIP map level independently from other input textures
Shader DXBC and ISA code update whenever a shader is modified
Support for DirectX 12 Unreal Engine 4.19 applications running on multi-GPU systems has been added
Multi-sampled render targets (including depth and stencil textures) are now viewable in DirectX 12 frames
Pixel History for DirectX 11 supports rendering to layers and mip levels, and respects applied experiments
View the per-target, post-transformation geometry for a range of selected draw calls in DirectX 11 frames

**New Features for analyzing Android* Open GLES applications**

System Analyzer

An ability to view and profile any Android process has been added to the System Analyzer settings

New Features for analyzing macOS* Metal applications

Graphics Frame Analyzer for Metal

Additional title support
Modified the Stream file format to improve performance and stability
Stream files play back instantly within Graphics Frame Analyzer

System Requirements and Supported Platforms

The minimum system requirements are:

Host Processor: Intel® Core™ Processor
Target Processor: See the list of supported Windows* and Android* devices below
System Memory: 8GB RAM
Video Memory: 512MB RAM
Minimum display resolution for client system: 1280x1024
Disk Space: 300MB for minimal product installation

The table below shows platforms and applications supported by Intel® GPA 2018 R2

Target System (the system where your game runs)	Host System (your development system where you run the analysis)	Target Application (types of supported applications running on the target system)
Windows* 7 SP1/8.1/10	Windows* 7 SP1/8.1/10	Microsoft* DirectX* 9/9Ex, 10.0/10.1, 11.0/11.1/11.2/11.3
Windows* 10	Windows* 10	Microsoft* DirectX* 12, 12.1
Google* Android* 4.1, 4.2, 4.3, 4.4, 5.x, 6.0	Windows* 7 SP1/8.1/10 or macOS* 10.11, 10.12 or Ubuntu* 16.04	OpenGL* ES 1.0, 1.1, 2.0, 3.0, 3.1, 3.2
Ubuntu* 16.04	Ubuntu* 16.04	OpenGL* 3.2, 3.3, 4.0, 4.1 Core Profile
macOS* 10.12 and 10.13	macOS* 10.12 and 10.13	OpenGL* 3.2, 3.3, 4.0, 4.1 Core Profile and Metal* 1 and 2

Intel® GPA does not support the following Windows* configurations: All server editions, Windows* 8 RT, or Windows* 7 starter kit.

Supported Windows* Graphics Devices

Intel® GPA supports the following graphics devices as targets for analyzing Windows* workloads. All these targets have enhanced metric support:

Target	Processor
Intel® UHD Graphics 630	8th generation Intel® Core™ processor
Intel® UHD Graphics 630	7th generation Intel® Core™ processor
Intel® UHD Graphics 620	7th generation Intel® Core™ processor
Intel® HD Graphics 620	7th generation Intel® Core™ processor
Intel® HD Graphics 615	7th generation Intel® Core™ m processor
Intel® HD Graphics 530	6th generation Intel® Core™ processor
Intel® HD Graphics 515	6th generation Intel® Core™ m processor
Iris® graphics 6100	5th generation Intel® Core™ processor
Intel® HD Graphics 5500 and 6000	5th generation Intel® Core™ processor
Intel® HD Graphics 5300	5th generation Intel® Core™ m processor family
Iris® Pro graphics 5200	4th generation Intel® Core™ processor
Iris® graphics 5100	4th generation Intel® Core™ processor
Intel® HD Graphics 4200, 4400, 4600, and 5000	4th generation Intel® Core™ processor

Although the tools may appear to work with other graphics devices, these devices are unsupported. Some features and metrics may not be available on unsupported platforms. If you run into in an issue when using the tools with any supported configuration, please report this issue through the Support Forum.

Driver Requirements for Intel® HD Graphics

When running Intel® GPA on platforms with supported Intel® HD Graphics, the tools require the latest graphics drivers for proper operation. You may download and install the latest graphics drivers from http://downloadcenter.intel.com/.

Intel® GPA inspects your current driver version and notifies you if your driver is out-of-date.

Supported Devices Android devices

Intel® GPA supports both Intel® and ARM*-based Android devices, with known limitations, for further information see this article.

Installation Notes

Installing Intel® GPA

Download the Intel® GPA installer from the Intel® GPA Home Page.

Installing Intel® GPA on Windows* Target and Host Systems

To install the tools on Windows*, download the *.msi package from the Intel® GPA Home Page and run the installer file.

The following prerequisites should be installed before you run the installer:

Microsoft .NET 4.0 (via redirection to an external web site for download and installation)

If you use the product in a host/target configuration, install Intel® GPA on both systems. For more information on the host/target configuration, refer to Best Practices.

For details on how to set up an Android* device for analysis with Intel® GPA, see Configuring Target and Analysis Systems.

Installing Intel® GPA on Ubuntu* Host System

To install Intel® GPA on Ubuntu*, download the .sh file from the Intel® GPA Home Page and run the installer script.

It is not necessary to explicitly install Intel® GPA on the Android* target device since the tools automatically install the necessary files on the target device when you run System Analyzer. For details on how to set up an Android* device for analysis with Intel® GPA, see Configuring Target and Analysis Systems.

Installing Intel® GPA on macOS* Host System

To install the tools on macOS*, download from the Intel® GPA Home Page and run the .pkg installer.

It is not necessary to explicitly install Intel® GPA on the Android* target device because the tools automatically install the necessary files on the target device when you run the System Analyzer. For details on how to set up an Android* device for analysis with Intel® GPA, see Configuring Target and Analysis Systems.

Technical Support and Troubleshooting

For technical support, including answers to questions not addressed in the installed product, visit the Support Forum.

Troubleshooting Android* Connection Problems

If the target device does not appear when the adb devices command is executed on the client system, do the following:

Disconnect the device
Execute $ adb kill-server
Reconnect the device
Run $ adb devices

If these steps do not work, try restarting the system and running $adb devices again. Consult product documentation for your device to see if a custom USB driver needs to be installed.

Known Issues and Limitations

Full Intel GPA metrics are not supported on macOS* 10.13.4 for Skylake-based and Kaby Lake-based Mac Pro systems. For full metric support, please do not upgrade to macOS* 10.13.4.
Metrics in the System Analyzer's system view are inaccurate for Intel® Graphics Driver for Windows* Version 15.65.4.4944. You can use Intel® Graphics Driver for Windows* Version 15.60.2.4901 instead.
Playback of the Metal stream files captured with earlier Intel® GPA versions is not supported. Old Metal stream files can be converted to the new stream format using the following steps:
1. Open Terminal and change the directory to /Applications/Intel/FrameAnalyzer.app/Contents/Resources/metal.
2. Capture a new stream of the old player running the .gpa_stream file that you want to convert by the following command:
```
./gen2/gpa-capture ./gpa-playback --layer capture -- <path-to-old-.gpa_stream-file>
```
3. The newly converted stream is automatically added to ~/Documents/GPA/ and is displayed in the Graphics Frame Analyzer open file dialog.
Intel® Graphics Performance Analyzers Sample (gpasample.exe) cannot be launched with Global Injection Mode enabled on Windows* 7 platforms
macOS users who are running OS X El Capitan or newer must disable System Integrity Protection (SIP) in order to profile Steam applications. If SIP is enabled on your machine, a message will appear at the top of Graphics Monitor directing you to disable it. If you would prefer not to disable SIP but need to launch into a Steam application, use the following process:
1. Launch and sign into Steam
2. Locate the executable of the desired application and copy the location, it typically looks something like this:
```
/Users/YOUR_USER_NAME/Library/Application\ Support/Steam/steamapps/common/YOUR_APPLICATION_BINARY 
```
3. Launch Graphics Monitor
4. Paste the location of desired application in the first input box and hit start
5. GPA will now be injected into the executable, allowing for live profiling and Trace/Frame Capture

*Other names and brands may be claimed as the property of others.

** Disclaimer: Intel disclaims all liability regarding rooting of devices. Users should consult the applicable laws and regulations and proceed with caution. Rooting may or may not void any warranty applicable to your devices.

conception of neural activity

Investigation into the capabilities of GANs, conducted by Intel Student Ambassador Prajjwal Bhargava, provides insights into using Intel® architecture-based frameworks to understand and create practical applications using this technology.

"GANs' (generative adversarial networks) potential is huge, because they can learn to mimic any distribution of data. That is, GANs can be taught to create worlds eerily similar to our own in any domain: images, music, speech, prose. They are robot artists in a sense, and their output is impressive—poignant even."¹

Excerpt from "GAN: A Beginner's Guide to Generative Adversarial Networks"

Challenge

Past efforts at building unsupervised learning capabilities into deep neural networks have been largely unsuccessful. A new modeling approach that uses opposing neural networks, one functioning as a generator and the other as a discriminator, has opened innovative avenues for research and practical applications.

Solution

The possibilities of using GANs to accelerate deep learning in an unsupervised training environment are progressively being revealed through ongoing exploration and experimentation. Prajjwal's work in this area promises to uncover paths likely to yield positive results as applications move from speculative to real-world implementations.

Background and Project History

An increasingly important area of generative modeling, known as generative adversarial networks (GANs), offers a means to endow computers with a better understanding of the surrounding world through unsupervised learning techniques. This field of inquiry has been the focus of Prajjwal Bhargava in his work for the Intel® AI Academy.

Prior to becoming a Student Ambassador for the Intel AI Academy, Prajjwal sharpened his expertise in convolutional neural networks for image recognition, data structure and algorithms, deep-learning coding techniques, and machine learning. These topics have been useful in his research on GANs. "Initially, I started off with computer vision," Prajjwal said. "Back then I was learning how convolutional neural networks worked and how they do what they do. That required going deeper into the architectures." After getting into them, he started working with Recurrent Neural Networks (RNNs) and complex architectures like Long Short-Term Memory (LSTM)."

"I later learned more about GANs," he continued, "and it was quite fascinating to me. I knew there were some significant challenges. For example, training a GAN— with the generator and discriminator getting updated independently—can have a serious impact on reaching convergence."

Prajjwal observed that the original GAN paper didn't fully address this issue. It became clear that a different mechanism was needed for effectively resolving this problem. He looked into the issue further and found the paper describing this approach, "Wasserstein GAN", to be very influential and revolutionary.

"The theory was explained brilliantly and it supported their experiment well," Prajjwal said. From this perspective, he started working on implementations using a variety of architectures to see which approaches could yield the greatest results.

"Since Ian Goodfellow presented his landmark paper at the NIPS [Neural Information Processing Systems] conference in 2014, I've always felt that this architecture [GANs] is quite revolutionary by itself. I feel that these networks have changed the way we look at deep learning compared to a few years back. It has enabled us to visualize data in ways that couldn't have been accomplished through other techniques."

Prajjwal Bhargava, Student Ambassador for Artificial Intelligence, Intel AI Academy

Prajjwal has been working on GANs for over a year, and he doesn't see an end to his research. Each new research paper that is published offers fresh ideas and different perspectives. His own paper, "Better Generative Modeling through Wasserstein GANs," provides the insights he has gained over the course of his work with Intel AI Academy.

"I want to try all possible variants of GANs," Prajjwal said. "There are so many, each one performing a new task in the best possible manner. However, I think the future calls for something universal and I think this applies to GANs as well. The more we are able to let our network generalize, the better it is. There's so much more to do and hopefully I will continue to contribute towards this research."

"Training and sampling from generative models is an excellent test of our ability to represent and manipulate high-dimensional probability distributions. High-dimensional probability distributions are important objects in a wide variety of applied math and engineering domains."²

Ian Goodfellow, Staff Research Scientist, Google Brain

Key Findings of the Experimentation

As Prajjwal continues to research GAN variants, the work that he has accomplished so far has led him to a key conclusion. In summary, he noted, "GANs are essentially models that try to learn distribution of real data by minimizing divergence (difference in probability distribution) through generation of adversarial data. In the original [Goodfellow] paper, convergence in mix max objective is interpreted as minimizing Jensen-Shannon divergence. Wasserstein is a better alternative than using Jensen-Shannon divergence. It gives a smooth representation in between."

"If we have two probability distributions—P and Q—there is no overlap when they are not equal, but when they are equal, the two distributions just overlap," Prajjwal continued. "If we calculate D(kl), we get infinity if two distributions are disjoint. So, the value of D(js) jumps off and the curve isn't differentiable: Ɵ is 0."

"The Wasserstein metric provides a smooth measure. This helps ensure a stable learning process using gradient descents," he added.

Real Samples
flowchart of double feedback loop
Figure 1. Double feedback loop used for a generative adversarial network (GAN).

The research being done on GANs suggests a wide variety of use cases across multiple industries, Prajjwal believes. Some of the promising possibilities include the following:

Accelerating drug discovery and finding cures for previously incurable diseases. The Generator could propose a drug for treatment and the Discriminator could determine whether the drug would be likely to produce a positive outcome.
Advancing molecule development in oncology, generating new anti-cancer molecules within a defined set of parameters.
Performing text translation to describe the content of images accurately.
Generating super-resolved images from downsampled original images to improve the perceptual qualities.
Boosting creativity in fields where variety and innovation are important, such as fashion or design.

"Unsupervised learning is the next frontier in artificial intelligence," Prajjwal said, "and we are moving rapidly in that direction, even though we still have a long way to go."

Enabling Technologies

The primary enabling technologies that were used for research during this project include:

PyTorch*, which includes the use of the Intel® Math Kernel Library (Intel® MKL), is a library based on Python* that was used to build the architecture for GAN research.
Intel® AI DevCloud powered by Intel® Xeon Phi™ processors (current versions of the Intel AI DevCloud use Intel® Xeon® Scalable processors).

"Intel MKL was really useful for optimizing matrix calculations and vector operations on my platform," Prajjwal commented, "and I have gone through technical articles on the Intel® Developer Zone (Intel® DZ) to better understand how to improve optimization on the architecture that I was using. A number of tutorials targeting Intel architecture were also quite useful."

One of the key challenges that Prajjwal encountered was training GANs efficiently on Intel architecture-based systems. The difficulties included managing updates for the Generator and Discriminator concurrently, rather than independently. As it stands, reaching convergence can be a challenge. Part of the solution will require optimizing the training models so that the workflow proceeds more efficiently, taking better advantage of Intel architecture capabilities and built-in features.

"It's been a year since I started working with Intel in the Intel AI Academy," Prajjwal noted. "And over this time, I've learned a lot. I've received much help and gained expertise working with Intel architecture-based hardware. It's great to see so many other Student Ambassadors working across the world in the same field. I've gotten to know so many people through conferences and online communities. Intel goes a long way to share the projects that we've done so that Student Ambassadors get recognition. Also, Intel provides a really good platform to publish our research and findings. I am really grateful that I got to become part of this AI Academy program and hope to do some more great work in the future."

AI is Expanding the Boundaries of Generative Modeling

Through the design and development of specialized chips, sponsored research, educational outreach, and industry partnerships, Intel is frmly committed to advancing the state of artifcial intelligence (AI) to solve difcult challenges in medicine, manufacturing, agriculture, scientifc research, and other industry sectors. Intel works closely with government organizations, non-government organizations, educational institutions, and corporations to advance solutions that address major challenges in the sciences.

In terms of real-world applications of GAN techniques, the collaborative work accomplished by the NASA Frontier Development Lab (FDL) offers a striking example. FDL brings together companies, Intel being one, to share resources and expertise in a cooperative effort to solve space exploration challenges.

During the Planetary Defense segment of the 2016 session, a GAN was developed to help detect potentially hazardous asteroids and determine the shape and the spin axis of the asteroid.

One of the participants on this project, Adam Cobb, described the challenge of handling the input data: "Our predominant form of input data consisted of a series of delay-Doppler images. These are radar images that are defned in both time delay and frequency. Although to the untrained eye...these images might look like they are optical images, they actually have a non-unique mapping to the true asteroid shape. This many-to-one relationship added an extra level of complexity to the already difcult challenge of going from 2D to 3D representations. In order to go about solving this task we applied deep-learning architectures such as autoencoders, variational autoencoders, and generative adversarial networks to generate asteroid shapes and achieved promising results."

Beyond the challenge of asteroid shape modeling, another challenge in the Planetary Defense area, Asteroid "Deﬂector Selector" Decision Support, used machine learning to determine the most effective deﬂection strategies to prevent an asteroid from colliding with Earth (see Figure 2 for an artist's rendering of this scenario).

space scene, asteroid and planet
Figure 2. NASA rendering of an asteroid in proximity to Earth.

The NASA FDL is hosted by The SETI Institute in Mountain View, California with support from the NASA Ames Research Center. Intel provided hardware and technology, software and training, as well as expertise to the endeavor. Other corporate participants included NVIDIA Corporation, IBM* and Lockheed Martin, ESA, SpaceResources Luxembourg, USC MASCLE, Kx Systems*, and Miso Technologies.

In these early stages of AI, at a time when commercial GAN implementations haven’t been widely released to the field, some of the best examples of the potential of this technique come from research papers and student implementations exploring the mechanisms to discover how GANs can be applied to real-world scenarios.

One of the more interesting examples along this line is image-to-image translation with CycleGANs. A collection of resources on this topic, including code, interactive demos, videos, and a research paper, have been compiled by members of the University of California, Berkeley research team and can be found here: https://phillipi.github.io/pix2pix/.

In image-to-image translation, the goal is to learn the mapping between an input image and an output image, using a training set of aligned image pairs. Practically speaking, paired training is not usually available, wherein the network can learn the mapping from domain X to domain Y. The objective in this approach is to learn a mapping G: X → Y, such that the distribution of images from G(X) is indistinguishable from the distribution Y, using adversarial loss.

Preformatter images that maintain a strong correlation between both domains are required, but getting data to accomplish that can be time consuming and ineffective. CycleGANs build upon a pix2pix architecture, which supports modeling of unpaired collections of images, and, in the process, it can learn to translate the image between two aesthetics without tightly integrating matches into a single X/Y training image.

Figure three shows some specifc image-to-image translation processes that highlight the capabilities of a CycleGAN.⁴

The Intel® AI technologies used in this implementation included:

Intel xeon inside badge

Intel Xeon Scalable processors: Tackle AI challenges with a compute architecture optimized for a broad range of AI workloads, including deep learning

Logos

Framework optimization: Achieve faster training of deep neural networks on a robust scalable infrastructure.

Intel AI Dev Cloud banner

For Intel AI Academy members, the Intel AI DevCloud provides a cloud platform and framework for machine-learning and deeplearning training. Powered by Intel Xeon Scalable processors, the Intel AI DevCloud is available for up to 30 days of free remote access to support projects by academy members

Join today at: https://software.intel.com/ai/sign-up

For a complete look at our AI portfolio, visit https://ai.intel.com/technology.

examples of image to image translations
Figure 3. Image-to-image translation examples (courtesy of Berkeley AI researchers).

Naveen Rao

"At Intel, we're encouraged by the impact that AI is having, driven by its rich community of developers. AI is mapping the brain in real time, discovering ancient lost cities, identifying resources for lunar exploration, helping to protect Earth's oceans, and fighting fraud that costs the world billions of dollars per year, to name just a few projects. It is our privilege to support this community as it delivers world-changing AI across verticals, use cases, and geographies."⁵

Naveen Rao, Vice President and General Manager, Artificial Intelligence Products Group, Intel

Resources

Source: distill.pub

Author’s note: The research was conducted using Intel® AI DevCloud, a cloud-hosted hardware and software platform available for developers, researchers and startups to learn, sandbox and start their Artificial Intelligence projects. This free cloud compute is available for Intel® AI Academy members.

Abstract

Researchers often try to capture as much information as they can, either by using existing architectures, creating new ones, going deeper, or employing different training methods. This paper compares different ideas and methods that are used heavily in Machine Learning to determine what works best. These methods are prevalent in various domains of Machine Learning, such as Computer Vision and Natural Language Processing (NLP).

Transfer Learning is the Key

Throughout our work, we have tried to bring generalization into context, because that’s what matters in the end. Any model should be robust and able to work outside your research environment. When a model lacks generalization, very often we try to train the model on datasets it has never encountered … and that’s when things start to get much more complex. Each dataset comes with its own added features which we have to adjust to accommodate our model.

One common way to do so is to transfer learning from one domain to another.

Given a specific task in a particular domain, for which we need labelled images for the same task and domain, we train our model on that dataset. In practice, the dataset is usually the largest in that domain so that we can leverage the features extracted effectively. In computer vision, it’s mostly Imagenet, which has 1,000 classes and more than 1 million images. When training your network upon it, it’s bound to extract features² that are difficult to obtain otherwise. Initial layers usually capture small, fine details, and as we go deeper, ConvNets try to capture task-specific details; this makes ConvNets fantastic feature extractors.

Normally we let ConvNet capture features by training it on a larger dataset and then modify. Fully connected layers in the end can do whatever we require for carrying out classification, and we can add a combination of linear layers. This makes it easy to transfer the knowledge of our network to carry out another task.

Transfer Learning in Natural Language Processing

A recent paper, Universal LM for Text Classification,³ showed how to apply transfer learning to Natural Language Processing. This method has not been applied widely in this field. We can use pretrained models and not embeddings that have been trained on WikiText 103. Embeddings are word representations that allow words with similar meaning to have similar representation. If you visualize their embeddings, they would appear close to one another. It’s basically a fixed representation, so their scope is limited in some ways. But, creating a language model that has learned to capture semantic relationships within languages is bound to work better on newer datasets, as evidenced by results from the paper. So far, it has been tested on Language Modeling tasks and the results are impressive. This applies to Seq2Seq learning as well in instances where length of inputs and outputs is variable. This can be expanded further to many other tasks in NLP. Read more: Introducing state of the art text classification with universal language models.

diagrams of LM training and tuning
Figure 1

Learning Without Forgetting

Another paper, Learning without Forgetting,⁴ provides context for what’s been done earlier to make our network remember what it was trained on earlier, and how it can made to remember new data without forgetting earlier learning. The paper discussed the researchers’ methods compared with other prevalent, widely used methods such as transfer learning, joint training, feature extraction, and fine tuning. And, they tried to capture differences in how learning is carried out.

For example, fine tuning is an effective way to extend the learning of neural networks. Using fine tuning, we usually train our model on a larger dataset – let’s say ResNet50 trained on Imagenet trained on ImageNet. A pretrained ResNet⁵ has 25.6 Million parameters. Resnets let you go deeper without incrementing the number of parameters over counterparts. The number of parameters is so great that you can expect to use the model to fit any other dataset in a very efficient manner: you simply load the model, remove the fully connected layers which are task specific, freeze the model, add linear layers as per your own needs, and train it on your own dataset. It’s that simple and very effective. The trained model has so many capabilities and reduced our workload by a huge factor; we recommend using fine tuning wherever you can.

What We’ve Actually Been Doing: Curve Fitting

Judea Pearl recently published a paper⁶ in which he states that although we have gained a strong grasp of probability, we still can’t do cause-and-effect reasoning. Instead, basically what we’ve doing is curve fitting. So many different domains can be unlocked with do-calculus and causal modelling.

The Three Layer Causal Hierarchy
Level (Symbol)	Typical Activity	Typical Questions	Examples
1. Association P(y\x)	Seeing	What is? How would seeing X change my belief in Y?	What does a symptom tell me about a disease? What does a survey tell us about the election results?
2. Intervention P(y\do(x), z)	Doing, Intervening	What if? What if I do X?	What if I take aspirin, will my headache by cured? What if we ban cigarettes?
3. Counterfactuals P(y_x\xʹ, yʹ)	Imagining, Retrospection	Why? Was it X that causes Y? What if I had acted differently?	Was it the aspirin that stopped my headache? Would Kennedy be alive had Oswald not shot him? What if I had not been smoking the past 2 years?

Returning to where we were, we implemented learning without forgetting to measure how well the model does compared to other discussed methods in some computer vision tasks. They define three types of parameters: θ_s,θ_o,and θ_n. θ_sare the shared set of parameters, while θ_ois a parameter the model has trained on previous tasks (with a different dataset). Θ_nis a parameter the model will have when trained on another dataset.

How to Perform Training

First, we used ResNet50 (authors used 5 conv layers + 2 FC layers of AlexNet) instead of stated architecture with pretrained weights. The purpose behind pretrained weights is that our model will be used in domain adaptation and will see increased use of fine tuning. It’s necessary that the convolutional layers have extracted rich features that will help in many computer vision tasks, preferably on ImageNet, which has 26.5 million parameters. If you want to go deep, consider using other ResNet variants like ResNet 101. After that, our model must be trained using the architecture as prescribed in the paper:

ResNet50 model
Figure 2.

The model in between is ResNet50 as per our implementation. We removed the last two layers and added two FC (fully connected) layers. We dealt with FC layers in a different manner appropriate to our task, but it can be modified for each use case. Add multiple FC layers depending on how many tasks you plan to perform.

After creating the architecture, it’s necessary to freeze the second FC layer. This is done to ensure that the first FC layer can perform better on this task when the model is learned on another task with a significantly lower learning rate.

This method solves a big challenge: after training, the older dataset is no longer required, whereas other methods of training do still require it.

Features of Learning Without Forgetting Vs. Commonly Used Deep Learning Training Models
	Fine Tuning	Duplicate and Fine Tuning	Feature Extraction	Joint Training	Learning Without Forgetting
New Task Performance	good	good	X medium	best	✔ best
Original Task Performance	X bad	good	good	good	✔ good
Training Efficiency	fast	fast	fast	X slow	✔ fast
Testing Efficiency	fast	X slow	fast	fast	✔ fast
Storage Requirement	medium	X large	medium	X large	✔ medium
Requires Previous Task Data	no	no	no	X yes	✔ no

This is a big challenge: to make incremental learning more natural, dependence on older datasets must be removed. After training the model we are required to freeze the base architecture (in our case it implies ResNet50) and the first FC layer with only the second FC layer turned on. We have to train the model with this arrangement.

The rationale for this training approach

The base model (ResNet in our case) earlier had fine-tuned weights. Convolutional layers do an excellent job of feature extraction. As we fine-tune the base model, we are updating the weights as per the dataset we’re using. When we freeze the base model and train with another FC layer turned on, it implies that we have gone task specific, but we don’t want go much deeper into that task. By training the base model on a particular task and re-training it, the model will capture the weights required to perform well on the default dataset. If we want to perform domain adaptation, earlier and middle layers should be very good at feature extraction and bring generalization into context rather than making it task-specific.

Learning without forgetting

training formula

After performing the training, we must join train all the layers. This implies turning on both FC layers of the base model and training them to converge.

Use any loss function your task requires. The authors used modified cross entropy (knowledge distillation loss), which proved to work well for encouraging the outputs of one network to approximate the outputs of another.

training formula

In our work, we tried loss function Triplet Loss and Cross entropy.

Observations

This method seems to work well when the number of tasks is kept to a minimum (in our case, two). It may outperform fine-tuning for new tasks because the base model is not being retrained repeatedly, only the FC layers. Performance is similar to joint training when new tasks are being added. But, this method is bound to work poorly on older tasks as new tasks are added.

This is because same convolutional layers are being used when we are freezing them, which means they are using the same feature extractor. We don’t expect them to outperform on all above-mentioned training tasks just by dealing with FC layers.

more task-specific layers, network expansion
Figure 3.

You can add more task-specific layers to introduce more generalization. But, as you go deep, you will make the model task-specific. This method addresses the problem of adapting to different domains of computer vision without relying on older datasets that were used in earlier training. It can be regarded as a hybrid of knowledge distillation and fine-tuning training methods.

This is an incremental step toward bringing generalization to neural networks, but we still lack ways to achieve full generalization, wherein we can expect to make our networks learn just like we do. We still have a long way to go, but research is in progress.

Technologies Involved

Since we were dealing with image-related datasets, we wanted the transfer between images and CPU to be fast, and DevCloud seemed to hasten the process. We performed all preprocessing on DevCloud; in addition, we trained our model incrementally. We also used Nvidia GTX 1080 for some parts of the training.

Intel Development Tools Used

The project made use of Jupyter notebook on the Intel AI DevCloud (using Intel® Xeon® Scalable processors) to write the code and for visualization purposes. We also used information from the Intel AI Academy forum. The code we used can be found in this GitHub* repository.

Join the Intel® AI Academy

Sign up for the Intel AI Academy and access essential learning materials, community, tools and technology to boost your AI development. Apply to become an AI Student Ambassador and share your expertise with other student data scientists and developers.

Contact Intel AI Student Ambassador Prajjwal Bhargava on Twitter or GitHub.

References:

abstract blue eye

The benefits and possibilities of computer vision are amplified and expanded through a generous influx of AI and IoT technologies.

“The computer vision market will include a mix of consumer-facing applications like augmented reality and virtual reality, robots, drones, and self-driving cars, along with business applications like medical image analysis, video surveillance, real estate development optimization, ad insertions, and converting paperwork into digital data. While the consumer-facing applications often generate more buzz, the big shift is that enterprises are moving beyond data analytics to embrace AI-based business applications that utilize computer vision capabilities.”¹

Aditya Kaul, Research Director, Tractica

Challenge

Computer vision has unlocked a multitude of possibilities for both consumer and enterprise business applications, but traditional technologies have been hampered by numerous vexing implementation obstacles.

Solution

With advanced computer vision technologies that embed intelligence at the network edge and solutions enhanced by artificial intelligence (AI), innovative new use cases are being developed that are generating increasing real-world value.

Background and Project History

From an early career as a DJ, Adam Milton-Barker gained an interest in coding while building websites to promote his business, which spiraled into deeper interests, including AI and the Internet of Things (IoT). Over the course of several years, numerous jobs, and time spent managing his own company, Adam continued to accumulate AI expertise; at one stage leading a medical neural network project for a team based in Sao Paulo. In 2014, an ad caught his attention: a challenge from the Microsoft Development Program offering an Intel® Galileo development board to each of the winners. At that point,” Adam said, “I was primarily involved in web, business administration applications, and natural linguistics programs, using AIML (Artificial Intelligence Markup Language). I also had a stint building games and apps as a member of the Facebook developer forum, as well as teaching coding. I had never come across IoT. I really liked the idea of the Internet of Things. And, because I had a lot of equipment in my home, an obvious project for me would be a security system.”

Adam developed a facial recognition solution that he dubbed TASS (TechBubble Assisted Security System) and was awarded the Intel Galileo from Microsoft* for the project idea. He then bought a standard Intel Galileo board to be able to include Linux* in his development efforts. TASS debuted at the Intel booth at Codemotion Amsterdam as part of the Intel Innovator program and the solution became the focus for a number of conference presentations and demos at worldwide venues. After considering launching TASS as a full-fledged product, Adam decided to release the specifications and project details to the open source community. “TASS,” he said, “is now open source, IoT-connected computer vision software that runs on the edge. There are several versions of TASS that have been created over the last few years, each using different techniques and technologies to provide facial recognition applications without the need for cloud services.”

The initial TASS project expanded in several productive directions. “During the next few years,” Adam said, “I was a semifinalist in the IBM Global Mobile Innovators Tournament with a project that included TASS and the IoT JumpWay*, which was then built on a Raspberry Pi*, but is now a free cloud service for people that want to learn about the IoT and AI. The project was one of the top five in Europe. I was also a first phase winner in the World’s Largest Arduino* Maker Challenge and I worked with a team at the AT&T Foundry Hackathon where we won first place for using a brain- computer interface to detect seizures; although, as a proof of concept the project never went beyond the demo stage. After a version of TASS won the Intel Experts Award at the Intel and Microsoft IoT Solutions World Congress Hackathon, I was invited to join the Intel Innovators program. This had been a goal of mine since the early days of my involvement in IoT. I joined the program in the areas of AI and IoT and have since added a new area—VR.”

The landmark accomplishments that Adam has achieved over several years were attained without his earning a technology degree or taking any formal college courses. “My work, project experiences, and self-learning path led me to the Intel® Software Innovators Program, which opened global opportunities. I’ve demonstrated my projects at some of the biggest tech conferences in the world. Ultimately, this led me to my dream job at Bigfinite as an IoT network engineer.”

“Moving to Barcelona and working at Bigfinite gave me a totally new life; I now work in an environment where I am not only surrounded by like-minded people, but people that know a lot more than me. It is an amazing place for me to continue learning, something that I have never experienced at other companies where I have worked. Bigfinite is also fully supportive of my personal projects and role in the Intel® Software Innovator program and promote my projects on our social media. We also have an initiative called beTED where I can continue helping people learn about AI and IoT at work.”

Adam Milton-Barker, Intel Software Innovator and Bigfinite IoT Engineer

“The project is ongoing,” Adam said. “I originally began developing it in 2014 and since then there have been many different versions. All of these versions are available on the IoT JumpWay GitHub* repos. As new technologies emerge, I create new versions of TASS.”

Refining Facial Recognition Technology

Through his development experience and ongoing research, Adam has identified key areas that could guide developers in productive directions when implementing facial recognition capabilities into their apps. Foremost among the concerns is the open set recognition issue. “The open set recognition issue is something that not many people talk about when they promote their computer vision projects or products,” Adam commented, “as it is an unmistakable flaw in almost all compute vision projects. What happens is this: Say that you had trained a network to know two people and then introduce an unknown person for classification. By default, the AI application will predict that it is one of the people it has been previously trained on. Because of this, any application that relies on detecting who is actually unknown will fail.”

Figure 1. Facial recognition is implemented through a polygonal grid linked to features.

Overcoming the issue, according to Adam, can be done in two different ways. First, you can introduce an unknown class composed of, for example, 500 images. This approach works well in small environments, but within a larger environment you have a greater likelihood of seeing someone that looks like someone from the unknown dataset. This implementation, however, doesn’t work in TensorFlow* Inception v3, but it does work within an OpenFace* implementation (which is available in the GitHub repository).

Another way to contend with the issue involves using FaceNet, which calculates the distance between faces in frames and a known database of images. On its own, this approach will typically not work well in the real world. If your application relies on thousands of known people, the program must loop through every single person in the known dataset and compare it to the person or persons in a frame. If you have a very powerful server and abundant resources, this may not be a serious issue. But, on the network edge where compute resources are limited, it becomes more of a challenge.

Along this line, Adam continued, “My next step will be to combine my earlier models with FaceNet and use FaceNet as a backup to check known faces, eliminating the need to loop through all of the known people. Because we know exactly what image to compare against—due to using the result from the first classification—if the second classification confirms, then it is more than likely that it is that person and not a false positive. The only requirement is to retrieve the classification from model 1 and use the ID to directly reference the corresponding image in the known database. Currently, I believe that this is the best way to solve the issue, but it is kind of a hacky way of doing things. This approach was suggested to me by a colleague, Stuart Gray, a fellow member of the Artificial Intelligence and Deep Learning group on Facebook.”

Two other issues that bear consideration:

Lighting, whether too dark or too bright, presents a challenge. Intel® RealSense™ technology minimizes lighting issues significantly, but developers need to be aware of scenarios where a poor lighting situation completely shuts down the recognition process.

Photos designed to fool a computer vision system and foil either security protections or the facial recognition accuracy represent a current challenge that requires more attention as facial recognition moves into mainstream applications.

Enabling Technologies

Adam uses a range of different Intel solutions in his projects, building new iterations of TASS to take advantage of emerging technologies. “Different versions have used different technologies,” Adam said. “Initially it was built on a Raspberry Pi. At the IoT World Congress Hackathon, we built it on an Intel® Joule™ development kit (now discontinued). The server version was built on an Intel® NUC DE3815TYKE and also an Intel NUC I7 using the OpenVINO™ toolkit. I have used Intel® RealSense™ cameras in some versions that helped with issues such as lighting. The more current versions use the UP Squared* Grove* IoT Development Kit and Intel® Movidius™ technology and they are trainable using the Intel® AI DevCloud. I will soon be working on a version that uses virtual reality using the hardware provided by Intel.”

Among the specific benefits that Adam gained from the use of Intel technology:

Intel RealSense technology helped improve management of lighting issues.
Intel AI DevCloud was effective for training small models.
Intel Movidius technology has enhanced the capabilities of running AI on the edge.
Sample code and other resources available through Intel helped gain a better understanding of the hardware used in the solutions.
OpenVINO substantially improved the project results, adding speed and accuracy to the solution.

“Each time I have implemented Intel technologies it has drastically increased the functionality of the project. In addition to increasing the capabilities of the project, the support I have received from the Intel Innovators in the Intel® AI Academy program has been amazing. The hardware and opportunities to demonstrate at various events that were provided through the program have helped the project reach new heights.”

Adam Milton-Barker, Intel Software Innovator and IoT Engineer at Bigfinite, Inc.

Bringing Vision to the Edge: The OpenVINO™ Toolkit

The release of the Open Visual Inference and Neural Network Optimization (OpenVINO) toolkit by Intel gives developers a rapid way to implement deep learning inference solutions using computer vision at the network edge. This addition to the current slate of Intel® Vision Products is based on convolutional neural network (CNN) principles, making it easier to design, develop, and deploy effective computer vision solutions that leverage IoT to support business operations.

The components in the toolkit include three APIs:

A deep learning inference toolkit supporting the full range of Intel Vision Products.
A deep learning deployment toolkit for streamlining distribution and use of AI-based computer vision solutions.
A set of optimized functions for OpenCV and OpenVX*.

Currently supported frameworks include TensorFlow, Caffe*, and MXNet. The toolkit helps boost solution performance with numerous Intel based accelerators, including CPUs and integrated graphics processing units (GPUs), field-programmable gate arrays, video processing units, and image processing units.

“Processing high-quality video requires the ability to rapidly analyze vast streams of data near the edge and respond in real time, moving only relevant insights to the cloud asynchronously. The OpenVINO toolkit is designed to fast-track development of high- performance computer vision and deep learning inference applications at the edge.”²

Tom Lantzsch, Senior Vice President and General Manager, IoT Group, Intel

Substantial performance improvements are available through the OpenVINO toolkit (click here and zoom in on the Increase Deep Learning Performance chart for full details). The components also enable a single- source approach to creating solutions, allowing developers to develop once and deploy anywhere, taking any model and optimizing for a large number of Intel hardware platforms.

A free download of the OpenVINO toolkit is available today, putting developers on a path to produce optimized computer vision solutions that maximize performance on Intel acceleration platforms. Ecosystem partners in the Intel® IoT Solutions Alliance offer additional tools and technologies to help build innovative computer vision and IoT solutions.

Forward-Looking Development Perspectives

Opportunities in the burgeoning fields of AI and IoT are abundant, and numerous resource and learning tools are available to anyone with the initiative to explore the various applications. International Data Corporation (IDC) projects that worldwide spending on IoT will reach USD 772 billion in 2018, up from USD 674 billion in 2017. IoT hardware represents the largest technology category in 2018; sales of modules, sensors, infrastructure, and security are projected to total USD 239 billion, with services listed as the next largest category.³

Figure 2. Aerial drone technology opens up numerous vision computing opportunities.

Industry reports project that strong growth will continue in the computer vision market:

By 2022, the video analytics market is projected to reach USD 11.17 billion.⁴
By 2023, the overall computer vision market should reach USD 17.38 billion.⁵
Deep learning revenue is projected to increase from USD 655 million in 2016 to USD 35 billion by 2025.⁶

Developers interested in taking advantage of these technology opportunities have a number of different channels for gaining knowledge and expertise in AI and deep learning.

“I would recommend the Coursera Deep Learning Specialization and Stanford Engineering Everywhere Machine Learning course for people wanting to know more about the inner workings of modern AI,” Adam said. “For those who just want to dive straight in head first as I did (and do), I have created a number of complete walkthroughs and provided source code that is freely available through the IoT JumpWay Developer Program.”

AI is Expanding the Boundaries of Computer Vision

Through the design and development of specialized chips, sponsored research, educational outreach, and industry partnerships, Intel is firmly committed to advancing the state of AI to solve difficult challenges in medicine, manufacturing, agriculture, scientific research, and other industry sectors. Intel works closely with government organizations, non- government organizations, educational institutions, and corporations to uncover and advance solutions that address major challenges in the sciences.

For example, working with the engineering team at Honeywell, Intel is combining IoT technology and computer vision tools to help ensure safe and secure buildings.

“The Internet of Things is creating huge advancements in the way we use video to ensure safe and secure buildings. With new emerging technology like analytics, facial recognition, and deep learning, Honeywell and Intel are connecting buildings like never before. Intel is an important partner in establishing the vision of smarter video solutions for the industry, and we look forward to continued collaboration that benefits customers.”⁷

Jeremy Kimber, Marketing Director, Video Solutions, Honeywell

The Intel® AI technologies used in this implementation included:

Intel Xeon Scalable processors Intel® Xeon® Scalable processors: Tackle AI challenges with a compute architecture optimized for a broad range of AI workloads, including deep learning.

Logos Framework Optimization: Achieve faster training of deep neural networks on a robust scalable infrastructure.

Intel Movidius Myriad Intel® Movidius™ Myriad™ Vision Processing Unit (VPU): Create and deploy on-device neural networks and computer vision applications.

Intel AI DevCloud: Free cloud compute for machine learning and deep learning training, powered by Intel Xeon Scalable processors.

Internet of Things: IoT consists of a network of devices that exchange and analyze data across linked and wireless interconnections.

For a complete look at the Intel® AI portfolio, visit https://ai.intel.com/technology.

“With the OpenVINO toolkit, we are now able to optimize inferencing across silicon from Intel, exceeding our throughput goals by almost six times. We want to not only keep deployment costs down for our customers, but also offer a flexible, high-performance solution for a new era of smarter medical imaging. Our partnership with Intel allows us to bring the power of AI to clinical diagnostic scanning and other healthcare workflows in a cost-effective manner.”⁸

David Chevalier, Principal Engineer, General Electric (GE) Healthcare*

Resources

References

moon and earth

"[Scientists] gathering of data far outpaces their ability to make sense of it. The data NASA collects far exceeds its ability to understand. The research world usually has less access to the latest and greatest compute tools than a lot of the companies out there. But as a scientist, I fundamentally believe that we need to make sure we support those efforts.”¹

Naveen Rao, General Manager of Artificial Intelligence Products, Intel

Research insights gained through artificial intelligence (AI) techniques deepen our understanding of the world around us, as well as delivering discoveries about off-world environments. For example, NASA’s Frontier Development Lab (FDL)—hosted by the SETI Institute in partnership with the NASA Ames Research Center—provides a platform for applying AI solutions to the challenges of space exploration. A recent project sponsored by Intel focused on using AI to identify useful resources on the moon. Ongoing research through FDL is revealing new ways in which AI can be used in space exploration, as well as charting paths for future scientific research across diverse fields of inquiry.

Challenge

Applications grounded in AI technologies continue to gain traction and demonstrate efficacy in science, medicine, finance, agriculture, and other sectors. At the same time, prospective early adopters seek tangible examples to guide project development and serve as proofs of concept for AI techniques.

Solution

Practical examples of the ways in which AI can address real-world challenges are appearing with increasing frequency. This, in turn, is encouraging wider acceptance of AI technology, with successful projects ranging from space exploration breakthroughs for NASA to 3D-printed orthopedic braces that add intelligence and personalization to medical devices. These achievements are helping demonstrate applied innovation techniques and establishing a foundation for new use cases.

Discoveries Based on Landmark AI Projects

Pioneering projects in AI are reshaping the nature of scientific inquiries and providing a richer, full-spectrum view of our surroundings, our bodies, and our human potential. Innovation is fueled by new technologies, with intelligent agents springing up everywhere from the core of data centers to the furthest reaches of the network edge. Advances driving these capabilities include improvements in processor capabilities, specialized integrated circuits (ICs) that are optimized for AI operations, computer vision advances, and software enhancement tailored to deep learning and machine learning. Innovators are imaginatively applying AI tools and techniques to further our knowledge about the natural world and extend research into space, as well as solving problems that benefit individuals at a one-to-one level.

To illustrate the ways in which AI is being practically applied, Shashi Jain, Innovation Manager in the Developer Relations Division of the Intel Visual Computing Group has led several projects that ventured into applied innovation techniques, combining diverse technologies into fresh solutions, encompassing pathfinding in the Internet of Things (IoT), machine learning, virtual reality (VR), as well as exploration into 3D-printing technology.

“We do experiments to find out what problems we can solve with our technology,” Shashi said. “As an example, a few years ago we developed an industrial wearable device built around the Intel® Edison module to help reduce back injuries in the workplace.”

“We also put microcontrollers on wine bottles,” Shashi continued, “and did some really interesting things identifying the right wine pairing for a meal. Or, finding all the bottles in your collection that meet certain criteria. Using this technique, you could track wine from the moment of bottling at the winery through distribution to a retailer to an individual wine cellar.”

“Another thing that we did,” Shashi said, “was to put a microcontroller with a sensor on a scoliosis brace. We captured pressure data to determine how well it was fitting and how long it was worn and presented this data to the user in an app, hoping it would improve compliance. We achieved that, but the real magic is what the designer of the brace did with the sensor data. The brace is fully 3D printed and it starts with a body scan. The designer used the sensor to incrementally improve the design of the brace, based on the individual patient’s own sensor data.”

“We are looking for all of these interesting use cases and fits for our technology and generating the insights that we can’t get any other way,” Shashi commented.

The following sections offer more insights into applied innovation techniques.

Identifying Moon Resources

The NASA FDL is ushering in a new age of discovery by hosting collaborative interdisciplinary projects that address specific scientific challenges using AI-based research. Intel, along with other key private-sector partners, contributes hardware and technology to advance this endeavor.

Shashi recently led Intel’s sponsorship of the FDL and brought together Intel engineers to collaborate with researchers, applying AI to identify potential resources on the moon. The research relied on a massive dataset of images from the lunar polar regions and AI-guided crater detection. “FDL,” Shashi explained, “FDL is part of a public- private partnership between the government, the SETI Institute, and industry partners to apply commercial AI to space exploration challenges. The program focuses on accelerating the pace of modern machine-learning and AI techniques at NASA. NASA has some 50 years’ worth of data on lunar missions and space missions out to the planets— and everything in between—and now we have a chance to do space exploration using that data without leaving Earth.”

Shashi continued, “We bring together teams of experts in their areas: machine learning, generally for a post-doctorate program, a doctorate program, or anything in-between. They can either be university researchers, industry researchers, or people who are published in the background. We bring them together for eight weeks to focus on the challenges of space exploration that are relevant to NASA or to the commercial space industry and spend a good amount of time defining the problems in advance.”

“Beyond Space, AI is proving a vital tool in identifying gene activations, diagnosing tumors, managing power and heat, developing new molecules and even teaching robots to walk in a constantly moving environment.”²

James Parr, Director, NASA Frontier Development Lab

Lunar water and volatiles project

In 2017, Intel sponsored the Lunar Water and Volatiles challenge, assembling a team to focus on recovering water and volatile chemicals from the moon. As this is an applied research accelerator, Intel guided the team to identify and define challenges relevant to actual users, who turned out to be engineers at the NASA Jet Propulsion Laboratory (JPL) and NASA’s Ames Research Center, as well as companies focused on lunar missions. What’s remarkable is that missions to recover moon and planetary resources are being planned and may be launched within five years.

“There are 10,000 decisions that need to be made for any of these missions to happen,” Shashi said. “And we are right at the front of that process. So, the engineers we talked with articulated that their missions included topographical maps of the moon. We said, ‘OK, maybe we can maybe help you that.’ The objective was to identify craters using the imagery these agencies had obtained from the Lunar Reconnaissance Orbiter and LCROSS missions. If you can identify craters, you can identify orientation and shadowing and create a better topological map of the moon by ordering and combining images. The output is very precise. Right now, there are only a few areas around the equator that are very well mapped for upcoming lunar missions. The machine-learning operations will open up the other regions of the moon for a deeper analysis, including the permanently shadowed regions, which are at the poles. This is where NASA believes most of the water is.”

Accelerated identification with machine learning

The team used a methodology to combine the optical imagery with an overlay of depth sensor imagery. “We can’t fully analyze a flat image to identify a potential water source,” Shashi explained. “However, once we overlay the depth sensor data with the optical imagery and run the computer vision algorithms, we can get a positive identification of those craters likely to have water. Using this approach, we can look all over the moon for the right kind of craters even if they are in shadowed regions.”

As shown in Figure 1, the two different datasets were used, relying on the craters themselves to register the optical images from the Lunar Reconnaissance Orbiter (LRO) Narrow Angled Camera (NAC) with elevation data captured using the Lunar Orbiter Laser Altimeter Digital Elevation Model (LOLA DEM). The computer vision algorithm developed by the team relied on a convolutional neural network (CNN) to analyze the optical images and elevation data using an adaptive convolution filter.

Lunar Reconnaissance Orbiter
Figure 1. Elevation measures (left) were aligned with optical images (right) to create precise lunar maps.

Using technology provided by Intel, the team sped up the crater identification process, requiring only 1 minute to classify 1,000 images. Results from this project showed that the AI-based crater detection provided 100X faster the identification speed in compared with human, with a success rate of 98.4 percent.3 The Intel research team is planning to improve the algorithms that were developed so that NASA can use the technology in a potential future moon mission to harvest resources. Complete validation of the machine- learning techniques could follow when manned missions to the moon resume. At that time, maps could be adjusted and refined and then resource accessibility can be reassessed.

“We have 50 years’ worth of NASA imagery from all sides of the moon. We’ve only recently begun to combine them and make one big, awesome map.”⁴

Shashi Jain, Innovation Manager, Intel Visual Computing Group

N A S A lunar vehicle
Figure 2. Eugene Cernan at the controls of the Apollo 17 lunar rover (courtesy of NASA).

Converging Technologies Lead to a Personalized 3D-Printed Back Brace

Another area rich with possibilities is using data and generative design processes to inform customizations of personalized medical devices. Intel contributed to a project for using captured data from a microcontroller to construct a 3D-printing orthopedic brace for people with scoliosis. Scoliosis—an abnormal curvature of the spine— affects millions of people and is common in children. Past-generation scoliosis braces have typically been heavy, uncomfortable, and burdensome, often causing wearers to remove them to gain some relief.

In comparison, a 3D-printed brace, designed by Studio Bitonti and commercialized by UNYQ, incorporates built-in sensors to monitor the wearer’s personal data. The sensor data contributes to the generative design process by using AI techniques to introduce incremental improvements, an area in which Intel provided expertise and design assistance. As a result, the customized, lightweight, comfortable braces can be worn for many hours during the day and are stylish enough to be worn as a fashion item outside of clothing.

The design prototype incorporated a compact Intel microcontroller that included an accelerometer and gyro, pressure sensors, and Bluetooth® technology capabilities. An app developed by Intel monitored and logged activity, pressure points, temperature, and other data. The designer, Francis Bitonti, recognized the potential in linking data to design and optimized the design to remove plastic that wasn’t therapeutic. Using a generative design technique, through multiple iterations, he enhanced the structure and design of the brace (Figure 3). Feeding data into a machine- learning system or deep-learning system provides a mechanism for shaping a design for aesthetics, functionality, and materiality.

U N Y Q Align design
Figure 3. The UNYQ Align* brace detects stress and weight points during a generative design process.

The capabilities of generative design extend to fashion as well, and collaborations with Bitonti Technology, Chromat, and Intel have produced landmark designs such as the Adrenaline Dress (see Figure 4), which employs fabrics that respond to changes in the wearer’s breathing, temperature, and heart rate, expanding and contracting dynamically. As Bitonti states on their website, “Our design process is a collaboration with artificial intelligence.”

In a presentation given at a TCT Inside 3D Printing conference, Shashi talked about 3D printing, smart devices, and new manufacturing methods: “Where I think it gets much more interesting is when we start taking third-party datasets: electronic medical records, sports data, FitBit data. Every one of us has a phone, every one of us has step counting and tracking sensors. We need to take this data and funnel it into these systems to generate these same design hints. We need to apply those third-party datasets to find optimizations that make a medical product fit a user’s lifestyle better.”⁵

Figure 4. Intel’s Adrenaline dress uses smart fabric to map to wearer’s emotions.

“We believe the next generation of material innovation will be both digital and physical. In other words, designers can work with a synthesis of information and design parameters and turn it into design.”⁶

Francis Bitonti, Studio Bitonti

Conducting Scientific Research with Drones and AI

Tracking whales and identifying individuals over hundreds of miles of ocean is a significant challenge that is made easier through a combination of Intel machine learning technology and unmanned aerial drones. A collaborative effort (dubbed Project Parley SnotBot) involving Parley for the Oceans, the Ocean Alliance, and Intel used drones to harvest whale spout water, emitted through the whale’s blowhole, to evaluate the biological data contained within it. Machine-learning algorithms devised by Intel can identify individual whales and perform real-time assessment of a number of factors, including the overall health of the whale. Despite limited visibility in the ocean and the unpredictable movements of the whales, drone tracking and analyzed data gives researchers a means to make decisions in the field and rapidly gain access to factors, such as DNA readings, presence of harmful viruses and bacteria, exposure to toxins, hormones associated with stress or pregnancy, and other conditions.

The founder of Parley for the Oceans, Cyrill Gutsch, commented, “Our vision is to create a global network of digital exploration tools that generate the big data we need to identify threats with new speed and precision, so we can act on them instantly.”⁷

Novel forms of data collection are one of the earmarks of AI-based solutions. Drones can collect data in difficult environments under challenging conditions. As in the previous example of tracking whales, AI techniques could be employed to use the thermal image data collected by drones to automate the identification of individual polar bears located in different environments.

The polar bear is one of the most elusive, wide-ranging animals on the planet. They are especially difficult to observe and track because of their white fur provides little contrast against the snow pack. With their habitat threatened by the impact of global climate disruption, polar bears are struggling to adapt and survive. As part of a research project to learn more about polar bear migration and behavior in the arctic, Intel teamed with Parley for the Oceans and noted wildlife photographer Ole Jørgen Liodden. Using an Intel® Falcon™ 8+ drone equipped with a thermal camera, the team was able to get close to the bears (within 50 to 100 meters) without disturbing them and collect data to better understand the bears’ habits and health status. The data helps inform wildlife researchers as well as climate change scientists to determine the effects of changing weather patterns on the animals living in this region as well as the environmental impacts.

Data tracking whales
Figure 5. Data captured tracking whales can help ensure their survival.

Traditional methods of observing polar bears include helicopter exploration, which is invasive and dangerous, and observation from a vessel, which is typically difficult because of the harsh arctic conditions. These methods can be easily retired in favor of using aerial drones equipped with cameras (Figure 7). Research projects, such as this one, provide opportunities for taking advantage both of unmanned aerial drones and AI-based data collection.

drone camera sleeping bear
Figure 6. Sleeping bear observed by the drone camera.

Figure 7. Unmanned aerial drones unlock opportunities for new and exciting scientific research.

“Polar bears are a symbol of the Arctic. They are strong, intelligent animals. If they become extinct, there will be challenges with our entire ecosystem. Drone technology can hopefully help us get ahead of these challenges to better understand our world and preserve the Earth’s environment.”⁸

Ole J. Liodden

Enabling Technologies from Intel

Hardware compute resources

Intel® AI DevCloud, powered by Intel® Xeon® Scalable processors, provides an ideal platform for machine-learning and deep-learning training and inference computing. Developers in the Intel® AI Academy like the easy access and the pre-configured environment of the Intel AI DevCloud. Portions of projects discussed in this success story resided at some time on the Intel® Deep Learning Cloud (Intel® DL Cloud) & System, tailored for enterprise developers.

Optimized frameworks

The Intel® Optimization for Caffe* framework, available through the Intel AI Academy, contains many optimization features tuned for CPU-based models. Intel’s contributions to Caffe*, a community-based framework developed by Berkeley AI research, improved performance when running algorithms on Intel® Xeon® processors.

Additionally, a customized deep-learning framework, Extended-Caffe*, provided an addition to the software stack so that CPU architecture can efficiently support 3D CNN computations. This makes it possible for researchers, data scientists, and developers to effectively implement projects using the CPU for 3D CNN model development, similar to the CNN techniques that proved successful for the Intel team working on the NASA FDL project.

“[People] think we are recreating a brain. But we want to go beyond that, we want to create a new kind of AI that can understand the statistics of data used in business, in medicine, in all sorts of areas, and that data is very different in nature than the actual world.”⁹

Amir Khosrowshahi, Chief Technology Officer, Artificial Intelligence Products Group, Intel Corporation

AI is Expanding the Boundaries of Scientific Exploration

Through the design and development of specialized chips, sponsored research, educational outreach, and industry partnerships, Intel is firmly committed to advancing the state of artificial intelligence (AI) to solve difficult challenges in medicine, manufacturing, agriculture, scientific research, and other industry sectors. Intel works closely with government organizations, non-government organizations, educational institutions, and corporations to discover and advance solutions that address major challenges across diverse sectors.

To bring a new generation of AI-savvy developers into the fold, Intel sponsors challenges and events designed to encourage imaginative solutions to difficult problems. For example, the Intel® AI Interplanetary Challenge, launched on May 21, 2018, brings together the Planetary Society and Intel AI experts with others interested in crafting solutions to real- world space exploration challenges.

“Intel’s AI portfolio of products, tools, and optimized frameworks is uniquely designed to enable researchers and data scientists to use AI to solve some of the world’s biggest challenges, and it’s ideal for a problem such as accelerating space travel. From the moment we heard about this challenge, we were committed to applying our expertise and technology solutions to the groundbreaking work being done on applications of AI for space research. Congratulations to the research teams, and to the Intel mentors, who are advancing technology that could take us to Mars and beyond.”¹⁰

Naveen Rao, Corporate Vice President and General Manager, Artificial Intelligence Products Group, Intel

The Intel® AI technologies used in this implementation included:

Intel Xeon Scalable processors: Tackle AI challenges with a compute architecture optimized for a broad range of AI workloads, including deep learning.

Logos Framework optimization: Achieves faster training of deep neural networks on a robust scalable infrastructure.

Intel AI DevCloud Intel® AI DevCloud: Offers a free cloud compute platform for machine-learning and deep-learning training and inference.

Join today at: https://software.intel.com/ai/sign-up

For a complete look at our AI portfolio, visit https://ai.intel.com/technology.

“Scientists need to partner with AI. They can greatly benefit from mastering the tools of AI, such as deep learning and others, in order to explore phenomena that are less defined, or when they need faster performance by orders of magnitude to address a large space. Scientists can partner with machine learning to explore and investigate which new possibilities have the best likelihood of breakthroughs and new solutions.”¹¹

Gadi Singer, Vice President and Architecture General Manager of Intel’s Artificial Intelligence Products Group

Resources

References

3d objects moving across plane

Abstract

This article will show game developers how to use reinforcement learning to create better artificial intelligence (AI) behavior. Using Intel® Distribution for Python—an improved version of the popular object-oriented, high-level programming language—readers will glean how to train pre-existing machine-language (ML) agents to learn and adapt. In this scenario, we will use Intel® Optimization for TensorFlow* to run Unity* ML-Agents in the localized environments.

Introduction

Unity ML-Agents are a good way for game developers to learn how to apply concepts of reinforcement learning while creating a scene in the popular Unity engine. We used the ML-Agents plugin to create a simulated environment. We then configure rigorous training to generate an output file from TensorFlow that can be consumed by the created scene in Unity and improve the simulation.

The basic steps are as follows:

Start with an introduction to reinforcement learning.
Perform the setup from the "requirements.txt" file that installs TensorFlow 1.4, and other dependencies.
Train a pre-existing ML-Agent.

System Configuration

The following configuration was used:

Standard ASUS laptop
4th Generation Intel® Core™ i7 processor
8 GB RAM
Windows® 10 Enterprise Edition

What is Reinforcement Learning?

Reinforcement learning is a method for "training" intelligent programs—known as agents—to constantly adapt and learn in a known or unknown environment. The system advances based on receiving points that might be positive (rewards) or negative (punishments). Based on the interaction between the agents and their environment, we then imply which action needs to be taken.

Some important points about reinforcement learning:

It differs from normal machine learning, as we don't look at a training dataset.
It works not with data, but with environments, through which we depict real-world scenarios.
It is based upon environments, so many parameters come into play as "RL" takes in lots of information to learn and act accordingly.
It uses potentially large-scale environments that are real-world scenarios; they might be 2D or 3D environments, simulated worlds, or a game-based scenario.
It relies on learning objectives to reach a goal.
It obtains rewards from the available environment.

The reinforcement learning cycle is depicted below.

reinforcment loop example
Figure 1. Reinforcement learning cycle.

How the Reward System Works

Rewards work by offering points when single or multiple agents transition from one state to another during interaction with their environment. These points are known as rewards. The more we train, the more rewards we receive, and thus the more accurate the system becomes. Environments can have many different features, as explained below.

Agents

Agents are software routines that make intelligent decisions. Agents should be able to perceive what is happening around them in the environment. The agent is able to perceive the environment based on making decisions that result in an action. The action that the agents perform must be the optimal one. Software agents might be autonomous, or might work together with other agents or people.

flowchart
Figure 2. Flow chart showing the environment workflow.

Environments

Environments determine the parameters within which the agent interacts with its world. The agent must adapt to the environmental factors in order to learn. Environments may be a 2D or 3D world or grid.

Some important features of environments:

a) Deterministic

b) Observable

c) Discrete or continuous

d) Single or multiagent

Each of these features is explained below.

Deterministic

If we can logically infer and predict what will happen in an environment based on inputs and actions, the case is deterministic. Being deterministic, the changes that happen are very predictable for the AI, and the reinforcement learning problem becomes easier because everything is known.

Deterministic Finite Automata (DFA)

In automata theory, a system is described as "DFA" if each of its transitions is uniquely determined by its source state and input symbol. Reading an input symbol is required for each state-transition. Such systems work through a finite number of steps and can only perform one action for a state.

Non-Deterministic Finite Automata (NDFA)

If we are working in a scenario where it cannot be guaranteed which exact state the machine will move into, then it is described as "NDFA." There is still a finite number of steps, but the transitions are not unique.

Observable

If we can say the environment around us is fully observable, that environment is suitable for implementing reinforcement learning. If you consider a chess game, the environment is predictable, with a finite number of potential moves. In contrast, a poker game is not fully observable, because the next card is unknown.

Discrete or continuous

Continuing with the chess/poker scenarios, when the next choice for a move or play is limited, it is in a discrete state. If there are multiple possible states, we call it continuous.

Single or multiagent

Solutions in reinforcement learning can use single agents or multiple agents. When we are dealing with non-deterministic problems, we use multiagent reinforcement learning. The key to understanding reinforcement learning is in how we use the learning techniques. In multiagent solutions, the agent interactions between different environments is enormous. The key is understanding what kind of information is generally available.

Single agents cannot tackle convergence, so when there is some portion of convergence in reinforcement learning it is handled by multiple agents in dynamic environments. In multiagent models, each agent's goals and actions impact the environment.

The following figures depict the differences between single-agent and multiagent models.

diagram
Figure 3. Single-agent system.

diagram
Figure 4. Multiagent system.

Getting Started

We will be using the Unity Integrated Development Engine (IDE) to demonstrate reinforcement learning in game-based simulations. After creating the simulation from scratch, we will use Unity ML-Agents to showcase how reinforcement learning is implemented in the created project and observe how accurate the results are.

Step 1: Create the environment

To start, we will create an environment for the Intel Distribution for Python.

Prerequisites

Make sure you have the Anaconda* IDE installed. Anaconda is a free and open-source distribution of the Python programming language for data science and machine learning-related applications. Through Anaconda, we can install different Python libraries which are useful for machine learning.

The download link is here: https://www.anaconda.com/download/.

The command to create a new environment with an Intel build is shown below.

conda create -n idp intelpython3_core python=3

After all dependencies are installed we proceed to step 2.

Step 2: Activate the environment

Now we will activate the environment. The command is shown below.

source activate idp

Step 3: Inspect the environment

As we have activated the environment, let us check the Python version. (It should reflect the Intel one.)

(idp) C:\Users\abhic>python

Python 3.6.3 |Intel Corporation| (default, Oct 17 2017, 23:26:12) [MSC v.1900 64 bit (AMD64)] on win32

Type "help", "copyright", "credits" or "license" for more information.

Intel® Distribution for Python is brought to you by Intel Corporation.

Please see: https://software.intel.com/en-us/python-distribution

Step 4: Clone the GitHub* repository

We need to clone or copy the Unity ML repo from the GitHub* link while inside the activated Intel-optimized environment (i.e., named idp). To clone the repo, we use the following command:

(idp) C:\Users\abhic\Desktop>git clone https://github.com/Unity-Technologies/ml-agents.git

Step 5: Install requirements

As cloning is proceeding, we need to install certain requirements. The requirements.txt is found in the Python subdirectory.

(idp) C:\Users\abhic\Desktop\ml-agents\python>pip install -r requirements.txt

This will install the mandatory dependencies.

Step 6: Create the build

The build is created inside the Unity IDE and the executable is generated. The crawler executable is shown below.

3d objects moving across plane
Figure 5. Crawler executable before training.

Step 7: Optimize the build

To make the training go faster with Intel Distribution for Python, issue the following command from the Python subdirectory:

(idp) C:\Users\abhic\Desktop\ml-agents\python>python learn.py manisha.exe --run-id=manisha –train

Once the training has completed a full run, we get the byte file needed to use inside the brain, within the child object of Academy:

INFO: unityagents: Saved Model
INFO: unityagents: Ball3DBrain: Step: 50000. Mean Reward: 100.000. STD of Reward: 0.000.
INFO: unityagents: Saved Model
INFO: unityagents: Saved Model
INFO: unityagents: List of nodes to export:
INFO: unityagents:       action
INFO: unityagents:       value_estimate
INFO: unityagents:       action_probs
INFO: tensorflow: Restoring parameters from ./models/manisha\model-50000.cptk
INFO: tensorflow: Restoring parameters from ./models/manisha\model-50000.cptk
INFO: tensorflow: Froze 15 variables.
INFO: tensorflow:Froze 15 variables.
Converted 15 variables to const ops.

The byte file we generated is used for making the simulation work with machine learning.

Advantages of Using Intel® Distribution for Python*

Python was not designed for multithreading. It runs on one thread, and while developers can run code on other threads, those threads cannot easily access any Python objects. Intel Distribution for Python features thread-enabled libraries, so consider Intel® Threading Building Blocks (Intel® TBB) as a potential tool for multithreading.

The advantages of using Intel Distribution for Python with Unity ML-Agents are as follows:

The training process is much faster.
The CPU version of TensorFlow involves less overhead.
Handling multiple agents using the Intel-optimized pipeline is easier and faster.

Unity* ML-Agents v 0.3

Unity ML-Agents are constantly evolving, with updates responding to community feedback. ML-Agents are based on imitation learning, which is different from reinforcement learning. The most common imitation learning method is "behavioral cloning." Behavioral cloning is a method that we apply to neural networks (specifically Convolutional Neural Networks, or CNNs) to replicate a behavior such as a self-driving car environment, where the goal is for the system is to drive the car as humans do.

Imitation Learning

Generally, when we are talking about "imitation learning," we refer to learning by demonstration. The demonstration is based upon the learning behavior patterns we get while analyzing and generating learning signals to the agent. In the table below, you can see the differences between imitation learning and reinforcement learning.

Imitation learning versus reinforcement learning.

Imitation learning

The process of learning happens through demonstration.

No such mechanism for rewards and punishments are required. Rewards are not necessary.

Generally evolved for real-time interaction.

After training, the agent becomes “human-like” at performing a task.

Reinforcement learning

Involves learning from rewards and punishments.

Based on trial-and-error methods.

Specifically meant for speedy simulation methods.

After training, the agent becomes “optimal” at performing a task.

TensorFlowSharp

In this section, we will cover some basics of TensorFlowSharp. First released in 2015, TensorFlow is Google's open-source library for dataflow programming and framework for deep learning. TensorFlow doesn't provide a C# native API, and the internal brain as it is written in C# is not natively supported. To enable the internal brain for machine learning, we need to utilize TensorFlowSharp, a third-party library which has the specific purpose of binding the .NET framework to TensorFlow.

Running the Examples

We will now go through an example of a Unity ML-Agents project to implement imitation learning. The process will involve the following steps.

Include the TensorFlowSharp Unity Plugin.
Launch Unity IDE.
Find the example folder which is inside Assets. There is a subset folder within the ML-Agents project folder named "Examples." We will work with the example named Crawler. Every change will occur inside this folder.

As we are working to create an environment for training, we will have to set the brain used by the agents to External. Doing this will allow the agents to communicate with the external training process when they are trying to make decisions.

We are exploring the example project Crawler. The setup is a creature with four limbs, from which extend a shorter limb, or forearm (see figure 5 above). For this scenario to be successful, we have the following goal: The agent must move its body along the x axis without falling.

We need to set some parameters to fine-tune the example. The environment contains three agents linked to a single brain. Inside Academy, locate the child object "CrawlerBrain" within the Inspector window. Set the Brain type to External.

Next, open Player Settings:

Go to Menu > Edit > ProjectSetting> Player.
Go to Options Resolution and Presentation.

Check that "Run in Background" is selected. Then check that the resolution dialog is set to "disabled." Finally, click "Build." Save within the Python folder. Name it "Manisha1" and then save it.

Unity interface
Figure 6. Saving the build within the Python* folder.

Train the Brain

Now we will work on training the brain. To open the Anaconda prompt, use the search option in Windows and type in Anaconda. The Anaconda prompt will open. Once inside the Anaconda prompt, we need to find out the environments available.

(base) C:\Users\abhic>conda info --envs

# conda environments:
#
base                  *  C:\ProgramData\Anaconda3
idp                      C:\Users\abhic\AppData\Local\conda\conda\envs\idp
tensorflow-gpu           C:\Users\abhic\AppData\Local\conda\conda\envs\tensorflow-gpu
tf-gpu                   C:\Users\abhic\AppData\Local\conda\conda\envs\tf-gpu
tf-gpu1                  C:\Users\abhic\AppData\Local\conda\conda\envs\tf-gpu1
tf1                      C:\Users\abhic\AppData\Local\conda\conda\envs\tf1

We will pass the following command:

(base) C:\Users\abhic>activate idp

Intel Distribution for Python and Intel Optimization for TensorFlow are installed in the environment idp. Next, we will activate the idp by opening the cloned folder in the desktop.

(idp) C:\Users\abhic\Desktop\ml-agents>

As we have saved the .exe file in the Python subdirectory, we will locate it there.

(idp) C:\Users\abhic\Desktop\ml-agents>cd python

Using the directory command dir we can list the items in the Python subfolder:

We are displaying the contents of the folder to make it easier to identify the files that reside inside in the Python subfolder. As major changes or the supportive code all resides within this subfolder it is easier and efficient to make changes to the way we are going to train the Machine learning agents within the subfolder. The python subfolder is important because the default code and other supportive library reside within this subfolder. As we have created the build for the Unity scene, we see that one auto-executable file is generated, along with data folders named "manisha1.exe" and "manisha1_Data."

Directory of C:\Users\abhic\Desktop\ml-agents\python

28-05-2018  06:28

. 28-05-2018 06:28 .. 21-05-2018 11:34 6,635 Basics.ipynb 21-05-2018 11:34 curricula 21-05-2018 11:34 2,685 learn.py 29-01-2018 13:48 650,752 manisha.exe 29-01-2018 13:24 650,752 manisha1.exe 28-05-2018 06:28 manisha1_Data 21-05-2018 11:58 manisha_Data 21-05-2018 12:00 models 21-05-2018 11:34 101 requirements.txt 21-05-2018 11:34 896 setup.py 21-05-2018 12:00 summaries 21-05-2018 11:34 tests 21-05-2018 11:34 3,207 trainer_config.yaml 21-05-2018 12:00 24 unity-environment.log 21-05-2018 12:00 unityagents 29-01-2018 13:55 36,095,936 UnityPlayer.dll 21-05-2018 12:00 unitytrainers 18-01-2018 04:44 42,704 WinPixEventRuntime.dll 10 File(s) 37,453,692 bytes 10 Dir(s) 1,774,955,646,976 bytes free

Look inside the subdirectory to locate the executable "manisha1." We are now ready to use Intel Distribution for Python and Intel Optimization for TensorFlow to train the model. For training, we will use the learn.py file. The command for using Intel-optimized Python is shown below.

(idp) C:\Users\abhic\Desktop\ml-agents\python>python learn.py manisha1.exe --run-id=manisha1 –train

(idp) C:\Users\abhic\Desktop\ml-agents\python>python learn.py manisha1.exe --run-id=manisha1 --train

INFO:unityagents:{'--curriculum': 'None',
 '--docker-target-name': 'Empty',
 '--help': False,
 '--keep-checkpoints': '5',
 '--lesson': '0',
 '--load': False,
 '--run-id': 'manisha1',
 '--save-freq': '50000',
 '--seed': '-1',
 '--slow': False,
 '--train': True,
 '--worker-id': '0',
 '': 'manisha1.exe'}
INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains: 1
        Lesson number: 0
        Reset Parameters:
Unity brain name: CrawlerBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 117
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 12
        Vector Action descriptions: , , , , , , , , , , ,
2018-05-28 06:57:56.872734: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
C:\Users\abhic\AppData\Local\conda\conda\envs\idp\lib\site-packages\tensorflow\python\ops\gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape."
INFO: unityagents: Hyperparameters for the PPO Trainer of brain CrawlerBrain:
        batch_size:     2024
        beta:   0.005
        buffer_size:    20240
        epsilon:        0.2
        gamma:  0.995
        hidden_units:   128
        lambd:  0.95
        learning_rate:  0.0003
        max_steps:      1e6
        normalize:      True
        num_epoch:      3
        num_layers:     2
        time_horizon:   1000
        sequence_length:        64
        summary_freq:   3000
        use_recurrent:  False
        graph_scope:
        summary_path:   ./summaries/manisha1
        memory_size:    256
INFO:unityagents: CrawlerBrain: Step: 3000. Mean Reward: -5.349. Std of Reward: 3.430.
INFO:unityagents: CrawlerBrain: Step: 6000. Mean Reward: -4.651. Std of Reward: 4.235.
The parameters above set up the training process. After the training process is complete (it can be lengthy) we get the following details in the console:
INFO: unityagents: Saved Model
INFO: unityagents: CrawlerBrain: Step: 951000. Mean Reward: 2104.477. Std of Reward: 614.015.
INFO: unityagents: CrawlerBrain: Step: 954000. Mean Reward: 2203.703. Std of Reward: 445.340.
INFO:unityagents: CrawlerBrain: Step: 957000. Mean Reward: 2205.529. Std of Reward: 531.324.
INFO:unityagents: CrawlerBrain: Step: 960000. Mean Reward: 2247.108. Std of Reward: 472.395.
INFO:unityagents: CrawlerBrain: Step: 963000. Mean Reward: 2204.579. Std of Reward: 554.639.
INFO:unityagents: CrawlerBrain: Step: 966000. Mean Reward: 2171.968. Std of Reward: 547.745.
INFO:unityagents: CrawlerBrain: Step: 969000. Mean Reward: 2154.843. Std of Reward: 581.117.
INFO:unityagents: CrawlerBrain: Step: 972000. Mean Reward: 2268.717. Std of Reward: 484.157.
INFO:unityagents: CrawlerBrain: Step: 975000. Mean Reward: 2244.491. Std of Reward: 434.925.
INFO:unityagents: CrawlerBrain: Step: 978000. Mean Reward: 2182.568. Std of Reward: 564.878.
INFO:unityagents: CrawlerBrain: Step: 981000. Mean Reward: 2315.219. Std of Reward: 478.237.
INFO:unityagents: CrawlerBrain: Step: 984000. Mean Reward: 2156.906. Std of Reward: 651.962.
INFO:unityagents: CrawlerBrain: Step: 987000. Mean Reward: 2253.490. Std of Reward: 573.727.
INFO:unityagents: CrawlerBrain: Step: 990000. Mean Reward: 2241.219. Std of Reward: 728.114.
INFO:unityagents: CrawlerBrain: Step: 993000. Mean Reward: 2264.340. Std of Reward: 473.863.
INFO:unityagents: CrawlerBrain: Step: 996000. Mean Reward: 2279.487. Std of Reward: 475.624.
INFO:unityagents: CrawlerBrain: Step: 999000. Mean Reward: 2338.135. Std of Reward: 443.513.
INFO:unityagents:Saved Model
INFO:unityagents:Saved Model
INFO:unityagents:Saved Model
INFO:unityagents:List of nodes to export :
INFO:unityagents:       action
INFO:unityagents:       value_estimate
INFO:unityagents:       action_probs
INFO:tensorflow:Restoring parameters from ./models/manisha1\model-1000000.cptk
INFO:tensorflow:Restoring parameters from ./models/manisha1\model-1000000.cptk
INFO:tensorflow:Froze 15 variables.
INFO:tensorflow:Froze 15 variables.
Converted 15 variables to const ops.

Integration of the Training Brain with the Unity Environment

The idea behind using Intel Distribution for Python is to make the training process more accurate. Some examples will require more time to complete the training process because of the large number of steps.

Since TensorFlowSharp is still in the experimental phase, it is disabled by default. To enable TensorFlow and make the internal brain available, follow these steps:

Make sure that the TensorFlow plugin is present in the Assets folder. Within the Project tab, navigate to this path: Assets->ML-Agents->Plugins->Computer.
Open the Edit->projectSettings->Player to enable TensorFlow and .NET support. Elect Scripting Runtime Version to Experimental(.net 4.6).
Open the Scripting-defined symbols and add the following text: ENABLE_TENSORFLOW.
Press the Enter key and save the project.

Bringing the Trained Model to Unity

After the training process is over, the TensorFlow framework creates a byte file for the project. Locate the model created during the training process under crawler/models/manisha1.

The executable file generated in the build for the Crawler scene will be the name we use for the next file to be generated. The file name will be the name of the environment with the extension of the bytes file when the training is complete.

If "manisha1.exe" is the executable file, then the byte file generated will be "manisha1.bytes," which follows the convention <env-name>.bytes.

Copy the generated bytes file from the models folder to the TF Models subfolder.
Open up the Unity IDE and select the crawler scene.
Select the brain from the scene hierarchy.
Change the type of brain to internal.
Drag the .bytes file from the project folder to the graph model placeholder in the brain inspector, and hit play to run it.

The output should look similar to the screenshot below.

Unity interface
Figure 7. Executable created without the internal brain activated.

We then build the project with the internal brain. An executable is generated.

Unity interface
Figure 8. The executable after building the project with the internal brain.

Summary

Unity and Intel are lowering the entry barrier for game developers who seek more compelling AI behavior to boost immersion. Intelligent agents, each acting with dynamic and engaging behavior, offer promise for more realism and better user experiences. The use of reinforcement learning in game development is still in its early phase, but it has the potential to be a disruptive technology. Use the techniques and resources listed in this article to get started creating your own advanced game-play, and see what the excitement is all about.

Resources

TensorFlow* is one of the leading Deep Learning (DL) and machine learning frameworks today. In 2017, Intel worked with Google* to incorporate optimizations for Intel® Xeon® processor-based platforms using Intel® Math Kernel Library (Intel® MKL)⁴. Optimizations such as these with multiple popular frameworks have led to orders of magnitude improvement in performance—up to 127 times ² higher performance for training and up to 198 times¹ higher performance for inference. For TensorFlow, Intel updated the optimizations and performance results for a number of DL models running on the Intel® Xeon® Scalable processor² ³.

Intel has mainly been reporting out Intel® Optimization for TensorFlow performance improvements on single nodes² ³. However, some complex DL models train more efficiently using multi-node training configurations. They either don't fit in one machine, or their time-to-train can be significantly reduced if they are trained on a cluster of machines. Therefore, Intel has also performed scaling studies on multi-node clusters of Intel Xeon Scalable processors. This article describes distributed training performance on a cluster of Intel® Xeon® platforms using a Horovod*-based configuration option for the TensorFlow framework.

Horovod, which was developed by Uber*, uses the Message Passing Interface (MPI) as the main mechanism of communication. It uses MPI concepts such as allgather and allreduce to handle the cross-replicas communication and weight updates. OpenMPI* can be used with Horovod to support these concepts. Horovod is installed as a separate Python* package. By calling Horovod's API from the Deep Learning Neural Network's model script, a regular build of TensorFlow can be used to run distributed training. With Horovod, there is no additional source code change required in TensorFlow to support distributed training with MPI.

Scaling Results Using Uber Horovod* with TensorFlow* 1.7

In this section, we show the performance numbers of Intel Xeon proceesor optimizations for TensorFlow 1.7 + ResNet-50* and Inception- v3* training, running on up to 64 nodes containing Intel® Xeon® Gold processors. A real training dataset was used to perform these runs. As shown in the charts below, by running one MPI process per node, ResNet-50 was able to maintain at least 89.1 percent scalability for up to 64 nodes, while Inception-v3 could achieve at least 89.4 percent⁵. So, with the higher throughput for ResNet-50 and Inception-v3, time to train is reduced significantly. Although this study shows the scaling for up to 64 nodes, it is expected that the same scalability rate would carry over to 128 nodes.

Performance Scaling for ResNet 50 and InceptionV3
Figure 1. Up to 89 percent (ResNet-50* and Inception-v3*) of scaling efficiency for TensorFlow* 1.7 can be achieved for 64 nodes of Intel® Xeon® Gold processors using one MPI process/node.

The user can also run the same models by having two MPI processes running on each node. As shown in the charts below, we can get up to 17 percent and 24 percent performance improvements for ResNet-50 and Inception-v3, respectively⁵, with no extra hardware cost. Please note that the batch size per node remains the same as what we used for running one MPI process per node.

Model	Batch Size per Node	Gain of TensorFlow* with Horovod* versus without Horovod on two Sockets
ResNet-50*	128	17%
Inception-v3*	128	24%

Thus, by running two MPI processes per node as shown in the two graphs below, ResNet-50 was able to maintain at least 94.1 percent scalability for up to 64 nodes, while Inception-v3 could achieve at least 87.4 percent⁵. So, with higher throughput for ResNet-50 and Inception-v3, time to train is reduced significantly, even faster than using one MPI process per node.

Performance Scaling for ResNet 50 and InceptionV3
Figure 2. Up to 94 percent of scaling (parallel efficiency) can be achieved for TensorFlow* 1.7 for 64 Intel^® Xeon^® Gold processors, using two MPI processes/node.

Gathering and Installing Relevant Software Tools

1. OpenMPI can be installed via Yellowdog Updater, Modified* (YUM) software on recent versions of CentOS*. Some existing clusters already have available OpenMPI. In this article, we will use OpenMPI 3.0.0. OpenMPI can be installed from the instructions at Open MPI: Version 3.0.

2. The latest GNU Compiler Collection* (GCC) version is needed; at least, GCC version 6.2 or newer is recommended. See GCC, the GNU Compiler Collection for the latest installation.

3. Python versions 2.7 and 3.6 have both been tested.

4. Uber Horovod supports running TensorFlow in distributed fashion. Horovod can be installed as a standalone Python package as follows:

pip install --no-cache-dir horovod (for example, horovod-0.11.3)

Install Horovod from the source.

5. The current TensorFlow benchmarks are recently modified to use Horovod. You can obtain the benchmark code from GitHub*, and run the tf_cnn_benchmarks.py as explained below.

$ git clone https://github.com/tensorflow/benchmarks
$ cd tensorflow/benchmarks/scripts
$ python tf_cnn_benchmarks.py

Running TensorFlow Benchmark Using Horovod with TensorFlow

Here, we discuss commands needed to run distributed TensorFlow using the Horovod framework. For the hardware platform, we use a dual-socket Intel® Xeon® Gold 6148 processor-based cluster system. For networking, a 10 GB ethernet is used. Mellanox InfiniBand* or Intel® Omni-Path Architecture (Intel® OPA) also can be used for networking the cluster.

Running two MPI processes on a single node:

export LD_LIBRARY_PATH=<path to OpenMP lib>:$LD_LIBRARY_PATH
export PATH=<path to OpenMPI bin>:$PATH
export inter_op=2
export intra_op=18 {# cores per socket}
export batch_size=64 
export MODEL=resnet50 {or inception3}
export python_script= {path for tf_cnn_benchmark.py script}

mpirun -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -cpus-per-proc 20 --map-by socket  --overscribe --report-bindings  -n 2 python  $python_script      --mkl=True --forward_only=False --num_batches=200 --kmp_blocktime=0 --num_warmup_batches=50 --num_inter_threads=$inter_op --distortions=False --optimizer=sgd --batch_size=$batch_size --num_intra_threads=$intra_op --data_format=NCHW --model=$MODEL --variable_update horovod --horovod_device cpu --data_dir <path-to-real-dataset> --data_name <dataset_name>

For one MPI process per node, the configuration is as follows. The other environment variables will be the same:

export intra_op=38
export batch_size=128 

mpirun -x LD_LIBRARY_PATH -x OMP_NUM_THREADS --bind-to none --report-bindings  -n 1 python  $python_script --mkl=True --forward_only=False --num_batches=200 --kmp_blocktime=0 --num_warmup_batches=50 --num_inter_threads=$inter_op --distortions=False --optimizer=sgd --batch_size=$batch_size --num_intra_threads=$intra_op --data_format=NCHW --model=$MODEL --variable_update horovod --horovod_device cpu --data_dir <path-to-real-dataset> --data_name <dataset_name>

Please note that if you want to train models to achieve good accuracy please use the configuration flag --distortions=True. Other hyper-parameters may also need adjusted.

For running a model on a multi-node cluster, a similar script as above. For example, to run on a 64-node cluster (two MPIs per node), where each node is an Intel Xeon Gold 6148 processor, the distributed training can be launched as shown below. All the export lists will be the same as above:

mpirun -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -cpus-per-proc 20 --map-by node  --report-bindings -hostfile host_names  -n 128 python  $python_script --mkl=True --forward_only=False --num_batches=200 --kmp_blocktime=0 --num_warmup_batches=50 --num_inter_threads=$inter_op --distortions=False --optimizer=sgd --batch_size=$batch_size --num_intra_threads=$intra_op --data_format=NCHW --model=$MODEL --variable_update horovod --horovod_device cpu --data_dir <path-to-real-dataset> --data_name <dataset_name>

Here, the host_names file is the list of hosts that you want to run the workload on.

What Distributed TensorFlow Means for Deep Learning Training on Intel® Xeon® Processors

Various efforts were taken to implement distributed TensorFlow on a CPU and graphics processing unit; for example, Remote Procedure Call (gRPC), Remote Direct Memory Access (RDMA), and TensorFlow built in MPI—all of these technologies are incorporated within the TensorFlow codebase. Uber Horovod is one distributed TensorFlow technology that was able to harness the power of Intel Xeon processors. It uses MPI underneath and it uses ring-based reduction and gather for DL parameters. As shown above, Horovod on Intel Xeon processors demonstrates great scaling for existing DL benchmark models, such as ResNet- 50 (up to 94 percent) and Inception-v3 (up to 89 percent) for 64 nodes⁵. In other words, time to train a DL network can be accelerated by as much as 57 times (ResNet-50) and 58 times (Inception-v3) using 64 Intel Xeon processor nodes, compared to a single Intel Xeon processor node. Thus, Intel recommends TensorFlow users use Intel® Optimization for TensorFlow and Horovod MPI for multi-node training on Intel Xeon Scalable processors.

Acknowledgements

The authors (Mahmoud Abuzaina, Ashraf Bhuiyan, Wei Wang) would like to thank Vikram Saletore, Mikhail Smorkalov, and Srinivas Sridharan for their collaboration with us on using Horovod with TensorFlow.

References

1. Performance is reported at Amazing Inference Performance with Intel® Xeon® Scalable Processors

2. The results are reported at TensorFlow* Optimizations on Modern Intel® Architecture

3. The updated results is in TensorFlow* Optimizations for the Intel® Xeon® Scalable Processor

4. Refer to GitHub for more details on Intel® MKL-DNN optimized primitives

5. System configuration

*TensorFlow Source Code**	https://github.com/tensorflow/tensorflow
TensorFlow Commit ID	024aecf414941e11eb643e29ceed3e1c47a115ad.
CPU
Thread(s) per core	2
Core(s) per socket	20
Socket(s)	2
NUMA node(s)	2
CPU family	6
Model	85
Model name	Intel® Xeon® Gold 6148 Processor @ 2.40GHz
Stepping	4
Hyper Threading	ON
Turbo	ON
Memory	192GB (12 x 16GB) 2666MT/s
Disks	Intel RS3WC080 x 3 (800GB, 1.6TB, 6TB)
BIOS	SE5C620.86B.00.01.0013.030920180427
OS	Red Hat* Enterprise Linux* Server release 7.4 (Maipo) Kernel* 3.10.0-693.21.1.0.1.el7.knl1.x86_64

Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown". Implementation of these updates may make these results inapplicable to your device or system.

Definition

Project overview

Surface quality is the essential parameter for steel sheet. In the steel industry, manual defect inspection is a tedious assignment. Consequently, it is difficult to guarantee the surety of a flawless steel surface. To meet user requirements, vision-based automatic steel surface investigation strategies have been proven to be exceptionally powerful and prevalent solutions over the past two decades¹.

The input is taken from the NEU surface defect database², which is available online. This database contains six types of defects including crazing, inclusion, patches, pitted surface, rolled-in-scale, and scratches.

Problem statement

The challenge is to provide an effective and robust approach to detect and classify metal defects using computer vision and machine learning.

Image preprocessing techniques such as filtering and extracting the features from the image is a good training model solution from which we can determine which type of defect the steel plate has. This solution can even be used in real-time applications.

Metrics

The evaluation is done using accuracy metrics. The following shows the accuracy of the system given:

equation

Because the classes are balanced, accuracy is an appropriate metric to evaluate the project. The accuracy tells us about how well the algorithm is classifying the defects.

Analysis

Data exploration

The NEU surface dataset² contains 300 pictures of each of six deformities (a total of 1800 images). Each image is 200 × 200 pixels. The images given in the dataset are in the .bmp format. The images in the dataset are gray-level images of 40.1 KB each. A few samples are shown in figure 1.

Figure 1. Samples of defects (a) crazing, (b) inclusion, (c) patches, (d) pitted surface, (e) rolled-in-scale, and (f) scratches.

Exploratory visualization

The following chart shows the histogram of images per class.

histograms of sample defects
Figure 2. Histogram samples of defects: (a) crazing, (b) inclusion, (c) patches, (d) pitted surface, (e) rolled-in-scale, and (f) scratches.

An image histogram acts as a graphical representation of the tonal distribution in a digital image. The horizontal axis of the graph represents the intensity variations; the vertical axis represents the number of pixels of that particular intensity. A histogram gives us an idea of the contrast of the image that I used as a feature. It is important to observe the histogram of the image to get an overview of the feature, like contrast. From figure 2 it is observed that the histogram of each class is visually distinguishable, which makes contrast an important feature to be included in the feature vector.

As said earlier, the classes are well balanced, justifying accuracy as an evaluation metric.

Algorithms and techniques

Different classifiers such as k-nearest neighbors (KNN), support vector classifier (SVC), gradient boosting, random forest classifier, AdaBoost (adaptive boosting), and decision trees will be compared.

Texture features such as contrast, dissimilarity, homogeneity, energy, and asymmetry will be extracted from the gray-level co-occurrence matrix (GLCM), and used for training the classifiers.

SVM

SVM is classified into linear and nonlinear. The linear SVM classifier is worthwhile to the nonlinear classifier to map the input pattern into a higher dimensional feature space. The data that can be linearly separable can be examined using a hyperplane, and the data that are linearly non-separable are examined methodically with kernel function, like a higher order polynomial. The SVM classification algorithm is based on different kernel methods; that is, radial basic function (RBF), and linear and quadratic kernel function. The RBF kernel is applied on two samples, x and x', which indicate as feature vectors in some input space and it can be defined as:

equation

The value of the kernel function is decreased according to distance, and ranges between zero (in the limit) and one (when x = x').

graph
Figure 3. Hyperplane in feature space.

AdaBoost algorithm

Input: Data set D = { (x₁ , y₁) ,( x₂ , y₂) ,......,(x_m , y_m) }

Base learning algorithm Ը; number of learning rounds T.

Algorithm:

Initialize the weight distribution: D₁(i) = 1/m.

for t = 1,...,T;

Train a learner ht from D using distribution Dt: h_t= Ը(D,D_t);

Measure the error of ht: equation

If E_t> 0:5 then break

Find weak classifier h_t(x) using a perturbed empirical distribution: equation

Update the distributions, where Z_t is the Normalization, which enables D_(t+1) to be distributed

equation

K-Nearest neighbor algorithm

A value of K is defined (K>0), along with the new data sample.
We select the K entries in our database that are near the new testing sample.
We find out the most analogous classification of these entries.
This is the classification we give to the new sample using the value of K.
If the result is not adequate, change the value of K until the reasonable level of correctness is achieved.

Decision trees algorithm

Create a root node for the tree.
If all examples are positive, return leaf node ‘positive’.
Else if all examples are negative, return leaf node ‘negative’.
Calculate the entropy of the current state.
For each attribute, calculate the entropy with respect to the attribute ‘x’.
Select the attribute that has maximum value of information gain (IG).
Remove the attribute that offers highest IG from the set of attributes.
Repeat until we run out of all attributes or the decision tree has all leaf nodes.

Random Forest

Random forest is nothing but an ensemble of decision trees. It avoids the problem of over-fitting that is usually seen in decision trees where there is a single decision tree for the entire dataset.

equation

Benchmark

I uploaded a basic model that uses the KNN algorithm to classify the images to GitHub* and achieves 75.27 percent accuracy. This will be the benchmark model on which I will try to improve the accuracy. The link is provided at the steel_plate repository.

Methodology

Data preprocessing

No preprocessing is used on the input images, as the defects of the steel plate heavily depend on the texture of its surface and, as we are using textural features, any preprocessing method such as smoothing or sharpening will change its texture.

Implementation

The following flowchart represents the entire workflow of the project.

workflow chart
Figure 4. Project workflow

The project starts with loading the images and extracting texture features such as contrast, dissimilarity, homogeneity, energy, and asymmetry. The features with the label are then given to test the train split function that is already present in the sci-kit-learn library. The train-test split function splits data and labels. The data is split 80 for training and 20 percent for testing.

The 80 percent data was given for training different classifiers and the testing was done on 20 percent of the data. The model that gave the highest accuracy was then selected as the main model.

The GLCM feature extraction is given below:

GLCM is an example network used to discover the work of art drawing in an image by showing the surface as a gray-level variation of the two-dimensional array. The highlighting of GLCM is considered between the arrangement of the elements to portray the contrast of the pixels and the energy of the region of interest. GLCM is calculated in four directions: 0^o, 45^o, 90^o, and 135^o and for four distances: 1, 2, 3, and 4.

GLCM seems to be a recognized numerical technique for feature extraction. GLCM is a group of how often different combinations of pixel gray levels could come about in an image. A co-occurrence matrix depicts the joint gray-level histogram of the image (or a region of the image) in the form of a matrix with the dimensions of N_g*N_g.

Directional analysis graph
Figure 5. Directional analysis of GLCM.

The integer array specifies the distance between the pixel of interest and its neighbor. Each row in the array is a two-element vector, which specifies the relationship or displacement of a pair of pixels. Because offset is often considered to be an angle, the following table lists the offset values that specify the common angles, given the pixel distance D.

formation of a GLCM matrix
Figure 6. Formation of GLCM matrix.

Features used in this method are as follows: contrast, dissimilarity, homogeneity, energy, and asymmetry.

Table 1. GLCM features.

Sr. No.	Features	Formulae
1.	Contrast	Contrast =
2.	Homogeneity	Homogeneity =
3.	Dissimilarity	Dissimilarity =
4.	Energy	Energy =
5.	Asymmetry	Asymmetry=

Gradient boosting is the combination of two methods; that is, the gradient descent method and AdaBoost. It builds a model in a forward fashion and optimizes the differential loss function. The algorithm is highly customizable for a specific application. AdaBoost has an advantage that it boosts the outliers near classification boundaries. It helps to increase the accuracy of the classifier.

The gradient boosting algorithm in detail is as follows:

Input: Training feature set {(Xi,Yi)}ⁿ_i=1 loss function L(y, F(x)) and number of iterations.

Algorithm

Initialize model with a constant value:
For m=1,2….., M
- Compute so-called pseudo-residuals:
- Fit a base learner h_m(x) to pseudo-residuals; that is, train it using the training set
- Compute multiplier γ_m by solving the following 1D optimization problem:
- Update the model:
- Output:

Initially, smoothing or sharpening of the image was considered in preprocessing of the images. It was later observed that using the above preprocessing disrupts the textural features of the image, which has a negative impact on the output of the classifier. So, complication of preprocessing was solved, and as mentioned in the Data Preprocessing section, no preprocessing was used in this project.

Refinement

The selection of algorithms and parameter tuning is an important aspect of machine learning. In this approach, the gradient boosting algorithm is selected, which is the combination of two machine learning approaches; that is, gradient descent and AdaBoost. AdaBoost algorithms boost the weak learners to minimize the false alarm and improve the accuracy. Boosting stages are finely tuned to get the promising accuracy.

In the gradient boosting model, the boosting factor (n_estimators) was tweaked to 80 (from the default value of 100).

Table 2. Hyperparameter values and accuracy.

n_estimators	Accuracy (%)
80	92.5
90	92.22
100	91.6
110	91.38
500	90.00

The default value of n_estimators is 100, which gives 91.6 percent accuracy in the initial results. When the value of n_estimators is set to 80 the accuracy increases to 92.5 percent, which is our final result.

Results

The following table shows the accuracy comparison of different classifiers:

Table 3. Performance evaluation.

Sr. No	Classifier	Accuracy (%)
1	KNN	75.27
2	AdaBoost	51.11
3	SVC	14.72
4	Decision Tree	88.33
5	Random Forest	89.44
6	Gradient Boosting	92.50

Figure 7. Accuracy comparison graph.

From the above table and graph we can observe that gradient boosting gives the highest accuracy of 92.5 percent. The confusion matrix of testing by using gradient boosting is given below.

As the extracted textural features are based on GLCM, variations in light intensities may negatively affect the result of the model.

Table 4. KFold CV results.

Sr. No	Classifier	CV accuracy (%)			CV mean accuracy (%)
Sr. No	Classifier	Folds= 5, Random state=9	Folds=10, Random state=70	Folds=15, Random state=35	CV mean accuracy (%)
1	KNN	75.3472	74.930556	74.722222	74.99999
2	AdaBoost	49.375000	47.569	50.416667	49.12022
3	SVC	15.486111	14.305556	13.819444	14.53704
4	Decision Tree	84.861111	85.625000	85.069444	85.18519
5	Random Forest	87.013889	88.819444	87.083333	87.63889
6	Gradient Boosting	87.708333	88.611111	88.750000	88.35648

graph of confusion matrix
Figure 8. Confusion matrix.

From the confusion matrix of the gradient boosting classifier output it is seen that out of 360 testing images 333 are correctly classified and 27 are misclassified.

Justification

The gradient boosting classifier achieved an accuracy of 92.5 percent, which is more than the KNN benchmark model with 75.27 percent accuracy.

In KNN the data points at the boundaries of classes can be misclassified, and this is where the gradient boosting algorithm excels over KNN for this specific problem, as weak classifiers are transformed into strong classifiers.

Conclusion

In the proposed system, the machine learning-based steel plate defect detection system was implemented.

The input images were taken from the NEU dataset², which is freely available.

No preprocessing was done, as mentioned in the Data preprocessing section.

The texture features were extracted by the GLCM, and extracted features were further classified into six respective classes (crazing, inclusion, patches, pitted surface, rolled-in-scale, and scratches) using different classification algorithms.

The test train split of the extracted features was done.

The gradient boosting classifier had the highest testing accuracy. Then, the hyperparameter of the boosting factor was tuned (which was difficult) to get even more accuracy, as mentioned in the refinement section. This approach achieved the classification accuracy of 92.5 percent.

In the future, this approach can be implemented using deep learning algorithms if the large dataset is available.

This was an interesting project, as this model can be implemented in real-life scenarios in the steel industry, which suffers from the problem of steel plate defects.

Intel^® AI DevCloud development tools

Intel^® AI DevCloud was used to train the network for the above implementation. Intel AI DevCloud is available for academic and personal research purposes for free and the request can be made from the Intel AI DevCloud website. The code developed can be found in this GitHub* repository.

Join the Intel^® AI Academy

Sign up for the Intel^® AI Academy and access essential learning materials, community, tools, and technology to boost your AI development. Apply to become an Intel^® Student Ambassador and share your expertise with other student data scientists and developers.

References

Yong Jie Zhao, Yun Hui Yan, Ke Chen Song, Vision-based automatic detection of steel surface defects in the cold rolling process: considering the influence of industrial liquids and surface textures, The International Journal of Advanced Manufacturing Technology, 2017, Volume 90, Number 5-8, Page 1665.
NEU surface defect database, http://faculty.neu.edu.cn/yunhyan/NEU_surface_defect_database.html
https://github.com/rajendraprasadlawate/steel_plate

crowd at a conference

Unity* Unite events continue to grow in popularity, drawing developers from around the globe in increasing numbers. Having already visited Seoul and Tokyo this year, the event moved to Beijing in May, settling at the China National Convention Center. Top technology experts and industry talents from across the world provided the audience with over 80 multi-themed technical courses and activities, including keynote presentations focusing on Unity's next-generation technology developments. Intel was there as well, presenting a technical keynote that demonstrated Intel technology in four new games.

Optimizing with Intel® Graphics Performance Analyzers

Peng Tao, senior software engineer at Intel China, gave a presentation with the title of "Increasing Gaming Audience: Intel® HD Graphics Can Also Run MR Games." He outlined how Intel® Graphics Performance Analyzers (Intel® GPA) can be used for game performance optimization during development. This helps achieve the goal of running mixed reality (MR) games smoothly on integrated graphics from Intel, and helps gamers run virtual reality (VR) games on a wider range of hardware platforms, expanding the target market for VR games.

Using the MR game Space Pirate Trainer* as an example, before the use of Intel GPA, the frame-rate on a certain platform (Intel® Core™ i5 processors, Intel HD Graphics 620, Intel® NUC) was only 12 frames per second (fps), which was far from a smooth gaming experience. That low performance resulted in lagging and dizziness in players. Even when some of the special effects were removed, a frame rate of only 30 fps was achieved, still far from optimal.

After game performance optimization using the Intel GPA toolkit, some image quality was compromised, but Space Pirate Trainer achieved a rate of 60 fps on this configuration, which met the Windows* MR application requirements.

Intel GPA is a free graphics performance analysis tool that enables game developers to optimize the performance potential of their gaming platforms. The toolkit includes GPA Monitor, which connects Intel GPA with applications; System Analyzer HUD, which displays application performance indicators in real time; Graphics Frame Analyzer, which enables the captured frame files to be viewed in detail; and GPA Platform Analyzer, which enables detailed analysis of the running source code on all threads.

Intel GPA helps developers perform detailed analysis without changing the game's source code. Intel recommends the use of this tool for game performance optimization by all developers.

Intel showed how to analyze intricate game scenarios using Intel GPA during the performance optimization of Space Pirate Trainer. The demands for hardware performance were reduced by approaching the optimization from the aspects of the shader, materials processing, lighting, post-processing, CPU performance, power consumption, and other areas.

Visionary Games and Evolutionary Gaming Experience

The four games that were demonstrated at the Intel booth for Unite Beijing 2018—Seeking Dawn*, TEOT: The End of Tomorrow*, Candleman: The Complete Journey*, and Enlightenment*—were recommended by Intel China team. Intel showed how its graphical support, architecture, and other technologies were used to improve overall game performance, and how Intel can assist game developers to optimize their gaming experience and satisfy the high demands of the gaming market.

Seeking Dawn combines elements of science fiction, survival, and exploration with different visual performances on different hardware platforms. On a gaming platform equipped with an Intel® Core™ i7 processor, physical effects and other aspects of Seeking Dawn showed considerable improvement when compared to a platform powered by an Intel Core i5 processor.

CEO Freeman introducing Seeking Dawn Game
Figure 2. Freeman, founder and CEO of Multiverse, introduced Seeking Dawn*.

Candle man is a uniquely creative, highly original game about the dreams of ordinary people. It has received positive reviews for its use of dynamic lighting, linear color space, and other visual techniques. Candle man was successfully migrated to Intel HD Graphics, resulting in smoother gameplay and more enabled visual effects than before optimization started.

Gao Ming Introducing Candleman
Figure 3. Gao Ming, game producer and co-founder of Spotlightor Interactive, introduced Candleman*.

The game TEOT: The End of Tomorrow offers realistic 3D scenes and an attractive storyline with interesting gameplay. Collaboration with Intel helped developers improve game performance, accurately detect bottlenecks, and provide more gaming solutions. Thanks to performance optimization, TEOT can now run smoothly on Intel HD Graphics, potentially offering more sales.

Convergence of the Latest Gaming Technology Trends

The interactive activities set up at the Intel booth attracted both game developers and game players. Intel showcased products and emerging technologies for upgrading visual experiences and improving performance. The Intel booth attracted over 1,000 registrations from developers. Intel conducted live interviews in every booth, and developers learned how to optimize gaming experiences via cooperation with Intel.

Large crowd at Unite Beijin 2018
Figure 4. Standing room only: The crowd at the Unite Beijing 2018 presentations.

Many leading gaming technology trends affected by Unity were also showcased at Unite Beijing 2018. The core topics of the event included:

Fully upgraded Unity 2018 release. Unity presented its latest 2018 version, which features improvements to the two core concepts of low-level rendering and default performance. Other improvements are in the areas of GPU Instancing support for GI (global illumination, or bounced lighting); presets for settings import and component editor; an all-new particle system upgrade; the new scriptable render pipeline (SRP) real-time rendering architecture option; and more. The new functions can turn the Unity engine into a high-end studio.
Unity 2018 feature analysis. Unity summarized the pioneering technologies available on its 2018 version, and provided suggestions to developers on how to utilize these new technologies. These new technologies include next-generation rendering features such as the SRP, Post-processing Stack v2, the Unity Shader Graph, and more. These technologies can help developers efficiently compile high-performance code for the C# Job System and the new generation Entity Component System. Unity also explained the new generation Unity particle system and the custom rendering texture feature, and discussed optimization tips.
Application of AI and machine learning in game development. Unity showcased their progress with artificial intelligence (AI) and machine learning, explaining how to use the revolutionary AI and machine learning tool Machine Learning Agents (Unity ML-Agents). Better AI features promise to bring new possibilities to game development and content creation, and help developers create smarter games and systems.
Efficient creations with real-time image rendering. Unity demonstrated its high quality, real-time rendering functions along with the Timeline, Cinemachine, Post-processing Stack, and other additional film production modules. Developers and artists can use Unity 2018 to create animations, add movie effects, edit scenes, and include other content while greatly reducing development and production time.

Unity* Experts Take the Stage

During the keynote presentations on May 11, several guests from Unity shared their views surrounding Unity's content creation engine, Unity's market strategy, and other topics.

Chief marketing officer Clive Downie shared some of Unity's market penetration data. He said that Unity is seeing success in VR titles across all major platforms—69 percent of VR content on the Oculus Rift* platform, 74 percent of VR content on the HTC Vive* platform, 87 percent of VR content on the Samsung* Gear VR platform, and 91 percent of content on the Microsoft HoloLens* platform—all developed using Unity.

Clive Downie from Unity
Figure 5. Chief marketing officer Clive Downie discussed Unity's impressive market penetration statistics.

Carl Callewaert, the global director of evangelism at Unity, did a deep technology dive by introducing a series of new features in Unity, such as the all-new art tool, next-generation rendering pipeline, real-time ray tracing, GPU lightmapper, and Nested Prefabs.

Carl Callewaert from Unity
Figure 6. Unity's Carl Callewaert, global director of evangelism, discussed the new rendering pipeline.

Andy Touch, Unity's global content evangelist, presented function highlights of Unity since 2015 by comparing the contents of a Unity demo in different stages, and introduced the feature of a High Definition Render Pipeline (HDRP) through Unity's latest real-time rendering film, Book of the Dead*.

Andy Touch from Unity
Figure 7. Unity's Andy Touch introduced the High Definition Render Pipeline.

Unity Evangelist Mark Schoennagel presented the lightweight version of Unity, long-awaited by developers. As a web-based application, the file size of the new Unity core is a mere 72 KB. At the same time, Unity also performed optimization on the asset pipeline, reducing file sizes further, and implemented optimizations on the lightweight project more efficiently.

Mark Schoennagel from Unity
Figure 8. Unity Evangelist Mark Schoennagel discussed the new, smaller footprint for the Unity* core.

Danny Lange, Vice President of Unity AI and Machine Learning, shared new developments in the field of machine learning. Unity strives to reduce the entrance threshold for machine learning and help developers apply this technology to their games. Unity's open-source AI toolkit, Unity ML-Agents, can help developers and researchers train machine agents in real and complex environments, and help developers enter the age of smart development.

Danny Lange from Unity
Figure 9. Unity's Danny Lange, VP of AI and Machine Learning, discussed Unity ML-Agents, an open-source toolkit to help developers boost immersion and realism.

Hu Min, Customer Management Director of Unity Ads for the Greater China Region, announced the official Unity Ads direct advertising solution for China's Android market. This is an easy-to-deploy and extremely secure solution that can help Chinese developers generate revenue through advertising.

Hu Min from Unity
Figure 10. Unity's Hu Min, Customer Management Director of Unity Ads for the Greater China Region, introduced the official Unity Ads direct advertising solution.

Anuja Dharkar, head of Learning, Global Education at Unity, introduced Unity's global education certification system. Developers can utilize a variety of official channels, such as online interactive tutorials and official offline training, to fully master Unity skills and obtain official certification. Unity's Senior Operations Director of the Greater China Region, Beeling Chua, introduced Unity's education strategy in the Greater China Region.

Anuja Dharkar from Unity
Figure 11. Unity's Anuja Dharkar, head of learning, introduced Unity's global education certification system.

Zhang Zhunbo, General Manager of the Greater China Region and Global Vice President at Unity, introduced Unity's service system in China, which includes technical support, technology presentations, education, the Unity Asset Store*, industry solutions, advertising services, strategic platform cooperation, and more. He also introduced products and services that were developed locally.

Zhang Zhunbo from Unity
Figure 12. Unity's Zhang Zhunbo, GM of GC and VP of Unity, introduced Unity's service system in China.

Abundance of Demos in the Exhibition Area

Multiple gaming hardware manufacturers, game developers, game development tool providers, and other vendors set up shop in the exhibition area.

Figure 13. Conference attendees had an opportunity to try out the latest gear and demos.

HTC Vive*, Windows Mixed Reality, 3Glasses*, Pico Interactive, HoloLens, Lenovo Mirage* AR, and others presented VR head-mounted displays to the public. Demo games attracted enthusiastic players and were very popular, with shooting games making up the bulk of the VR titles at the show.

With augmented reality (AR) development tools becoming widely available, plenty of AR content also appeared at the event. Automotive AR applications such as YuME and Meow!* have caught the attention of many players. Google demonstrated its ARCore* software, which has three main functions to help create realistic AR experiences: plane recognition, motion tracking, and light estimation.

Different types of games that were developed using the Unity engine were presented at the event, and many of those games were developed by independent developer teams in China. Horizontal games, puzzle solving games, martial arts role-playing games , action role-playing games, and others were represented, all showcasing the effectiveness of the Unity engine across gaming genres.

Besides VR gaming, the introduction of VR into education and other industries was also a major focus in the exhibition area. The VR product developed by Unity Education and its application in the Ruiyu Imagination Classes at Sichuan Normal University was showcased. Developed using the Unity engine, this experimental K12 product utilizes VR technology and includes all physics, chemistry, biology, and scientific experiments from elementary school through high school (sourced from Chinese textbooks). Experiencing laboratories and experiments through VR allows students to safely walk through methodology and application; errors can be corrected immediately, and processes repeated, in order to better remember and learn.

Interesting Technical Topics

More than 80 courses and activities ran at Unite Beijing 2018, focusing on Unity's next-generation technology developments, hopefully spurring game developers to envision greater possibilities.

demo show of Unity Interface
Figure 14. Demo shown at Unite Beijing 2018.

High-end real-time rendering. Unity highlighted its creative experience in film and animation and introduced how to use the new generation Unity HDRP to develop film clips with realistic qualities.
Quality real-time rendering animation function. Jiang Yibing, Global Technical Art Supervisor at Unity showcased the process of creating her upcoming animated short Windup*, and took the audience through the process from initial concept, to writing a captivating script, time-lining, camera setup, role creation, scene building, lighting control, and all the way to the final animation production.
Game production process based on image reconstruction. Zhang Liming, Technical Director of the Greater China Region at Unity, revealed the creation process of the Book of the Dead and introduced how the newly added rendering features in Unity 2018 were used to bring the real-time rendering quality to unprecedented heights.
Quickly creating virtual worlds in Unity. Yang Dong, Technical Director of the Unity Platform Department of the Greater China Region, explained the various function models of ProBuilder*, and how to use its features for level development.
Machine learning in Unity. The machine learning framework published by Unity allows developers to use the Python* API to apply machine learning capabilities to game production and various VR simulations within the Unity environment. Sun Zhipeng from Unity introduced Unity's machine learning framework and its tuning to the audience, and shared some practical cases.
Analysis of underlying memory usage in iOS*. Detailed analysis of underlying memory use in iOS was presented and the use of Unity Profiler, Xcode*, and an instrument to compile data was introduced. This allows for the discovery of where memory consumption is highest within the game and its corresponding optimization method.
Geometry and compute shaders in Unity. The familiar rendering pipeline was reviewed as a way to introduce the application of the geometry shader, which is used in Unity to implement more realistic grass renderings.
Best practices for baked lightmaps. This presentation focused on baked lightmap effects, baking time, and lightmap memory usage. Some best practices for baking system tuning and optimization were discussed so that developers could obtain the optimal baking effects in the shortest amount of time while optimizing lightmap memory usage.

Emerging Technologies Introduce New Innovations into Traditional Industries

Aside from having a wide range of applications in the world of gaming, VR/AR/MR, other new and emerging technologies can also be applied to manufacturing, automotive, construction, animation, design, and education.

an architectural rendering
Figure 15. Unity technology is now being applied to traditional industries such as construction, architecture, and design.

The application of Unity's technology in traditional industries was introduced under the broad subject of industrial applications. Speakers touched on the following areas:

Quickly generate VR/AR workflows for industrial engineering design. Zhong Wu, Global Development Consultant at Autodesk, showed how to utilize the seamless interconnection between industry-leading cloud services technology and Unity's workflow. The goal is to quickly and efficiently solve the challenges of the engineering design industry and to create an all-new, highly efficient design data presentation in VR/AR.
Upgrade project management communication. Using building information model (BIM) with VR, users can upgrade their project management communication scenarios and bring forth a more intuitive project management mode to overhaul the traditional project management communication model.
Apply interactive MR in broadcasting. DataMesh CTO Wu Hao shared how to use Unity to create interactive MR experiences and implement broadcasting-grade live mixed-reality content.
Apply MR in industrial IoT. Liu Hongyu, co-founder and architect of Holoview Lab (Shanghai) Co. Ltd., shared how to apply Unity to development for industrial Internet of Things (IoT), and used it with the HoloLens mixed-reality device, Microsoft Azure*, GE Predix*, and other platforms to share the development experience with mixed-reality IoT applications.
Transform video technology. High-dimension video digital assets based on the Unity engine will become core elements in the various layers of video content in the future, and the use of Unity within the video industry will probably also increase.
Create industrial AR/VR applications. Zhou Guichun, Research Engineer at United Technologies Corporations, shared how to obtain data from BIMs, and how to analyze data from 3D models developed using Unity. He also described his experience with developing Unity animations, and explained how to design UI/UX for AR/VR applications and Unity's AR/VR development framework.

The Future is Here

Over three days, Unity demonstrated the tremendous energy and enthusiasm in the growing Unity ecosystem with cutting-edge technology and the spirit of community at its core. Speaker after speaker presented Unity's most advanced technical features, while challenging developers to realize their potential to bring revolutionary changes to various industries.

As a major collaborator with Unity, Intel demonstrated how game developers can take full advantage of multicore technology to improve their game development capabilities. Intel showcased a full line of optimization tools that work with the Unity engine, and clearly established that they will continue to help drive gaming technology and concept innovation. If you want to get in on the action at an upcoming Unite conference, check out their upcoming schedule and make your plans to attend.

Introduction – Wide-Area Networks (WAN) Contribution to the Growing Attack Surface

Private Infrastructure and MPLS – Private, but Secure?

Public Infrastructure and SD-WAN – The Devil Is In the Details

Netrolix AI-WAN – More Secure Than SD-WAN and MPLS

Netrolix AI-WAN Defense in Depth

Five factors that secure data on the AI-WAN network

Securing the AI-WAN fabric

Total integration with enterprise layered security

How Intel Enables Netrolix AI-WAN Security

Netrolix AI-WAN Delivers a New Level of Internet WAN Security

Abstract

Introduction

Overview of the Normalized X-Corr Model

Diving into the Code

Normalized Correlation Layer

Applications

Further Scope

Acknowledgement

References

Abstract

Background

DirectX* 11 Multithreaded Rendering Model

Multithreaded Rendering Method

Case Study

Summary

Footnotes

Drawback of Pooling Layers

Capsule Networks

1. Primary capsules

2. Higher layer capsules

3. Loss calculation

Reconstruction Loss = (Reconstructed Image – Input Image)2

Total Loss = Margin Loss + alpha * Reconstruction Loss

Pros

Cons

References

Abstract

IoT Connectivity

Checklist

Software Requirements

Setting Up The Universal Windows Application

Testing Data

IDC Classifier Evaluation Results

Testing The Universal Windows Application

Inception V3 Results

False Negatives

Unsure

Things To Try

Testing On A Larger DataSet

Data Folder

Inception V3 Results

False Negatives

Unsure

Things To Try

Get Involved

Features List

Bugs/Issues

Known Bugs

Contributors

Introduction

Background

Generative adversarial network

Cycle-GAN

Coming with a Better Architecture

Problem with Vanilla GANs

Optimization in generator network

Optimization in discriminator and encoder-decoder pair

Optimization for Intel® AI DevCloud

Proposed Approach

Experimental Evaluation

Results

Conclusion

Getting Started with Parallel Processing

Getting Started

Multithreading Technique

#pragma omp parallel

Memory Management

Principle of Convolution

Filtering Techniques

Box blur filter

Reconstruction Loss = (Reconstructed Image – Input Image)²