Intel Developer Zone Articles

Carnegie Mellon University

Principal Investigator

Dr. Ole J. Mengshoel is a Principal Systems Scientist in the Department of Electrical and Computer Engineering at CMU Silicon Valley. His current research focuses on: scalable computing in artificial intelligence and machine learning; machine learning and inference in Bayesian networks; stochastic optimization; and applications of artificial intelligence and machine learning. Dr. Mengshoel holds a Ph.D. in Computer Science from the University of Illinois, Urbana-Champaign. His undergraduate degree is in Computer Science from the Norwegian Institute of Technology, Norway. Prior to joining CMU, he held research and leadership positions at SINTEF, Rockwell, and USRA/RIACS at the NASA Ames Research Center.

Description

Scalability of artificial Intelligence (AI) and machine learning (ML) algorithms, methods, and software has been an important research topic for a while. In ongoing and future work at CMU Silicon Valley, we take advantage of opportunities that have emerged due to recent dramatic improvements in parallel and distributed hardware and software. With the availability of Big Data, powerful computing platforms ranging from small (smart phones, wearable computers, IoT devices) to large (elastic clouds, data centers, supercomputers), as well as large and growing business on the Web, the importance and impact of scalability in AI and ML is only increasing. We will now discuss a few specific results and projects.

In the area of parallel and distributed algorithms, we have developed parallel algorithms and software for junction tree propagation, an algorithm that is a work-horse in commercial and open-source software for probabilistic graphical models. On the distributed front, we are have developed and are developing MapReduce-based algorithms for speeding up learning of Bayesian networks from complete and incomplete data, and experimentally demonstrated their benefits using Apache Hadoop* and Apache Spark*. Finally, we have an interest in matrix factorization (MF) for recommender systems on the Web, and have developed an incremental MF algorithm that can take advantage of Spark. Large-scale recommender systems, which are currently essential components of many Web sites, can benefit from this incremental method since it adapts more quickly to customer choices compared to traditional batch methods, while retaining high accuracy.

Caffe* is a deep learning framework - originally developed at the Berkeley Vision and Learning Center. Recently, Caffe2*, a successor to Caffe, has been officially released. Facebook has been the driving force in developing developing the open source Caffe2 framework. Caffe2 is a lightweight, modular, and scalable deep learning framework supported by several companies, including Intel. In our hands-on machine learning experience with Caffe2, we have found it to support rapid prototyping and experimentation, simple compilation, and better portability than earlier versions of Caffe.

We are experimenting with Intel’s PyLatte machine earning library, which is written in Python and is optimized for Intel CPUs. Goals of PyLatte includes ease of programming, high productivity, high performance, and leveraging the power of CPUs. A CMU SV project has focused on implementation of speech recognition and image classification models using PyLatte, using deep learning with neural networks. In speech recognition experiments, we have found PyLatte to be ease to use, with a flexible training step and short training time.

We look forward to continuing to develop parallel, distributed, and incremental algorithms for scalable intelligent models and systems as an Intel® Parallel Computing Center at CMU Silicon Valley. We create novel algorithms, models, and applications that utilize novel hardware and software computing platforms including multi- and many-core computers, cloud computing, MapReduce, Hadoop, and Spark.

Related websites:

http://sv.cmu.edu/directory/faculty-and-researchers-directory/faculty-and-researchers/mengshoel.html
https://users.ece.cmu.edu/~olem/omengshoel/Home.html
https://works.bepress.com/ole_mengshoel/

The earliest computers were often pushed to the limit performing even a single task, between hammering the hard drive, swapping memory frantically, and crunching through computations. With Microsoft Windows* 3.1 and then Windows 95, multi-tasking began to take form, as systems were finally able to handle more than one program at a time. Now, with the advent of double-digit cores in a single CPU, the concept of “megatasking” is gaining traction. The latest entries for the enthusiast are in the Intel® Core™ X-Series processor family, ranging from 4 to 18 cores. These Intel® Core™ i9 processors can simultaneously handle tasks that previously required multiple complete systems—enter extreme megatasking.

Intel® Core™ i9 Processor Extreme Consider the challenge of simultaneously playing, recording, and streaming a Virtual Reality (VR) game. Game studios rely on video-trailers to spark interest in new VR titles, but showing off the experience of a 3D game in a 2D video has always been a challenge, as a simple recording of what the player sees offers only part of the story. One way to solve this – mixed reality – captures the player against a green screen, and then blends the perspectives into a third-person view of the player immersed in that world. (For more information about this technique, refer to this article.) This often requires one PC to play and capture the game, and another PC to acquire the camera feed with the gamer. Add the idea of streaming that complete session live to a global audience of expectant fans, and you could be looking at a third system for encoding the output into a high-quality uploadable format. But an Intel team recently demonstrated that production crews can now complete all of these CPU-intensive tasks on a single Intel® Core™ i9 processor-based system, with each engaged core chugging merrily along.

Moore’s Law and System Specs

When originally expressed by Intel co-founder Gordon Moore in 1965, “Moore’s Law” predicted that the number of transistors packed into an integrated circuit would repeatedly double approximately every two years (Figure 1). While transistor counts and frequencies have increased, raw compute power is now often measured in the number of cores available. Each core acts as a CPU and can be put to work on a different task, enabling better multi-tasking. But simple multi-tasking becomes extreme megatasking with simultaneous, compute-intensive, multi-threaded workloads aligned in purpose.

Moore's Law change for technology
Figure 1. Moore's Law expresses the accelerating rate of change for technology (source: time.com)

The calculation originally used to measure supercomputer performance now applies to desktop gaming PCs: FLOPS, or FLoating point Operations Per Second. These are used to measure arithmetic calculations on numbers with decimal points, which are harder to make than operations on integers. The equation is:

FLOPS = (sockets) x (cores per socket) x (cycles per second) x (FLOPS per cycle)

Picture a single-socket CPU with six cores, running at 3.46 GHz, using either single-precision (8) or double-precision (16) FLOPS per cycle. The result would be 166 gigaflops (single-precision) and 83 gigaflops (double-precision). By comparison, in 1976, the Cray-1 supercomputer performed just 160 megaflops. The new Intel® Core™ i9-7980XE Extreme Edition Processor runs at about 4.3 GHz (faster if overclocked) and thus should calculate to 1.3 teraflops. For perspective, the world’s fastest supercomputer runs 10.65 million cores, performing at 124.5 petaflops. In 1961, a single gigaflop cost approximately USD 19 billion in hardware (around USD 145 billion today). By 2017, that cost had fallen to USD 30 million.

To achieve that raw compute power, the Intel® Core™ i9-7980XE Extreme Edition Processor uses several technology upgrades. With up to 68 PCIe* 3.0 lanes on the platform, gamers have the ability to expand their systems with fast Intel® Solid State Drives (Intel® SSDs), up to four discrete GFX cards, and ultrafast Thunderbolt™ 3 technology solutions. Updated Intel® Turbo Boost Max Technology 3.0 improves core performance. Intel® Smart Cache has a new power-saving feature that dynamically flushes memory based on demand. The Intel Core X-series processor family is also unlocked to provide additional headroom for overclockers. New features include the ability to overclock each core individually, Intel® Advanced Vector Extensions 512 (Intel® AVX-512) ratio controls for more stability, and VccU voltage control for extreme scenarios. Combined with tools like Intel® Extreme Tuning Utility (Intel® XTU) and Intel® Extreme Memory Profile (Intel® XMP), you have a powerful kit for maximizing performance.

Intel reports that content creators can expect up to 20 percent better performance for VR content creation, and up to 30 percent faster 4K video editing, over the previous generation of Intel® processors (see Figure 2). This means less time waiting, and more time designing new worlds and experiences. Gamers and enthusiasts will experience up to 30 percent faster extreme megatasking for gaming, over the previous generation.

Gregory Bryant, senior vice president and general manager of the Client Computing Group at Intel Corporation, told the 2017 Computex Taipei crowd that the new line of processors will unleash creative possibilities throughout the ecosystem. “Content creators can have fast image-rendering, video encoding, audio production, and real-time preview—all running in parallel seamlessly, so they spend less time waiting, and more time creating. Gamers can play their favorite game while they also stream, record and encode their gameplay, and share on social media—all while surrounded by multiple screens for a 12K experience with up to four discrete graphics cards.”

Figure 2. Intel® Core™ X-series processor family partial specifications.

Another way to measure system performance is through CPU utilization, which you can find in your own Microsoft Windows PC through Task Manager > Resource Monitor. Josh Bancroft, Intel Developer Relations Content Specialist working with the gaming and VR communities, was part of the Intel® Core™ Extreme Processors rollout at Computex Taipei in early 2017, and helped coin the term “extreme megatasking” in showing off CPU utilization. Bancroft used one of the new Core i9 X-Series processor-based PCs to show a green-screen VR mixed-reality demo, simultaneously playing a VR title at 90 fps, recording the game-play, compositing the player into the scene from a separate camera, and then combining and syncing the images precisely, and streaming the result live to Twitch*.

Later, Bancroft was part of the first Intel® Core™ i9 Extreme Processor rollout at E3 in Los Angeles, where he showed the same demo on a system with 18 cores. He still recalls that event fondly: “It was really exciting to do the world’s first public demo on an 18-core i9-based system. The case was gigantic, with two water loops with this blue, opaque fluid, and really cool-looking.”

The demo, hosted by Gregory Bryant, went off smoothly, but wasn’t without tension. “When you stack those 4 or 5 extreme tasks together, you can overload a system and bring it to its knees,” Bancroft explained. But the 18 cores performed flawlessly, with the CPU utilization graphs showing what was going on under the hood. “When we turned on the recording, when we turned on the streaming, when we did everything that cranked it up, you saw those 36 graphs jump up to 90-plus percent utilization. You could see all of those threads were working really hard.”

The demo illustrated Intel’s commitment to VR, PC gaming, and multi-core processing power in one neat package. Since VR requires enormous resources to pull this off smoothly, it’s a perfect world in which to demo new systems in general. Using Bancroft’s mixed-reality technique allows developers, streamers, and content creators to make trailers and show people a VR experience without actually having to put them in a headset. Best of all, one new system can replace the multiple devices previously required to pull it off.

Trailers are one of the most important tools in an indie developer’s marketing toolkit. Creating a compelling, enticing game trailer for VR is of vital importance to indies getting started on their own titles. However, the 3D experience of VR doesn’t translate well to a 2D trailer, which is where the mixed-reality technique comes in. Mixed-reality VR was pioneered by Vancouver, BC-based Northway Games*, run by husband-and-wife team Sarah and Colin Northway, who added enabling code in their Unity-based game Fantastic Contraption* (Figure 3). The ability to record what the gamer is seeing as they play, as well as how they would look in a third-person view, greatly helps market VR titles by communicating the experience. In addition, the Northways showed how entertaining their game was, by including shots of onlookers watching and laughing from a sofa.

Figure 3. Creating and streaming a mixed-reality trailer—like this one for Fantastic Contraption*—is now possible on a single PC.

Not Invented Here, Just Enhanced

Bancroft is quick to share the credit for his mixed-reality, single-machine demos, which he learned in a cramped studio, complete with scaffolding, lighting, a green screen, and multiple cameras. The Northways wrote a blog post that offered a step-by-step walkthrough of the tasks involved, and Bancroft relied on it heavily to get started. From there, he and his team came up with some additional tweaks, all developed and shared openly.

Many of the software programs require immense power; just playing a VR title for Oculus Rift* or HTC VIVE* at 90 fps is quite a task. At a lower frame-rate, players can experience dizziness, vomiting, and other physical reactions, so a machine has to start with the power to play a game properly, before engaging any more of a load.

For mixing and compositing, Bancroft is fond of MixCast*, a growing VR broadcast and presentation tool that simplifies the process of creating mixed-reality videos. Created by Blueprint Studios*—a Vancouver, BC-based leader in the interactive technology space—the tool enables dragging and dropping the MixCast VR SDK into Unity projects, so end-users can showcase their experience in real time.

In addition, Bancroft uses Open Broadcaster Software (OBS), a free and open source software program known to most streamers for compositing, recording, and live streaming. It offers high-performance, real-time audio- and video-capturing and mixing; video filters for image masking, color correction, and chroma keying; and supports streaming platforms such as Twitch*, Facebook*, and YouTube*.

Of course, there are multiple tools to create the same end result, but that’s the current software stack. A full description of Bancroft’s efforts can be found at <link to Mega-tasking step-by-step article>.

Jerry Makare is the Intel® Software TV video producer, and works closely with Josh Bancroft to create videos that test the raw-compute boundaries of extreme megatasking. He sees important benefits to using a single, powerful system for VR. “Being able to split our tasks into multiple places, especially rendering, is a big deal,” he said. “Once you start rendering, generally you end up killing your machine. There’s almost nothing else you can do. The ability for us to split these large, compute-intensive tasks like rendering and compositing into multiple buckets is a major time-saver.”

Makare is particularly eager to task an Intel® Core™ i9 processor-based system with building out a very large-scale room, using a 3-D modeling program to get a baseline for how much time it saves. He also looks forward to putting the new system to work on some real-world applications that his team can learn from.

Eye to the Future

With so much raw computing power now available, it’s exciting to think of the different ways in which these new systems could be used. Gamers can anticipate more vivid, immersive, and realistic experiences. Creating and editing video from raw, 4K footage was a complex, processing-intensive chore, but now professionals and novices alike can edit in native 4K, creating stunning visual effects, and compose music with more depth and nuance. The reach of VR extends beyond gaming into virtual walkthroughs, construction planning, city modeling, and countless simulation scenarios. Scientists in fields such as biology, geology, chemistry, medicine, and astronomy may unlock even more secrets, thanks to the raw computing power behind extreme megatasking.

Additional Resources

Intel® Core™ i9-7980XE Extreme Edition Processor Resources:
https://www.intel.com/content/www/us/en/products/processors/core/x-series/i9-7980xe.html
Intel® Developer Zone (Intel® DZ):
https://software.intel.com/en-us
Computex Taipei 2017 Keynote:
https://www.youtube.com/watch?v=C2DrvCTHAcA&feature=youtu.be&t=37m4s
E3 2017 Intel Press Room:
https://newsroom.intel.com/press-kits/2017-e3/

Before you Begin

The Intel® Context Sensing SDK for Linux* is a Node.js*, Go*, and Python*-based framework supporting the collection, storage, sharing, analysis, and use of sensor information.

This getting started guide contains steps to set up the broker and Go framework supported by the SDK, then run a sample provided in the SDK.

Additionally, this document contains tutorials to create a simple provider, sample application using the provider, a microservice to run the application, and steps to run the microservice to publish events to the broker.

Every command or chunk of code can be copy-pasted directly from the document without any required modifications unless explicitly stated.

Requirements

Software

OS: Ubuntu* 14.04 or 16.04
Go: 1.8.3
Docker*: 17.0.3

Network

The document assumes Intel proxies are configured on the host machine.

Verify you have access to below URLs:

hub.docker.intel.com
hub.docker.com

Getting Started

Setting up the Broker

There are two options to set up the broker:

Dockerized: Using the context repo from hub.docker.intel.com(preferred)
Non-dockerized: Using the context-broker-VERSION.tgz file

This document only covers the preferred Dockerized method.

The section assumes you have Docker already set up with Intel credentials. (Refer: Setting up Docker)

The broker requires a running instance of MongoDB*.

Use Docker to pull the mongo image onto your machine:
```
docker pull mongo
```
Create a container named mymongodb and run it for the very first time:
```
docker run --name=mymongodb -d mongo
```

Note: For subsequent runs, use: docker start mymongodb

Pull the broker image:

docker pull hub.docker.intel.com/context/context-broker:v0.10.5

Create a container named contextbroker and run it for the very first time:

docker run --name contextbroker -it -p 8888:8888 --link mymongodb -e MONGODB_HOST=mymongodb hub.docker.intel.com/context/context-broker:v0.10.5

Note: For subsequent runs, use: docker start -i contextbroker
-i or -it is used to run in the foreground to see the output in the current terminal.

To stop the context broker instance, use CTRL+C to interrupt when running foreground or docker stop contextbroker when running in background.

In order to remove the container if it's preventing the use of Docker, use: docker rm –f contextbroker

Setting up the SDK for Go

If you haven’t set up the required Go environment on your machine (Refer: Setting up the Go Environment)

Use the command go env to ensure both $GOPATH and $GOROOT are populated with paths for Go projects and Go distribution, respectively.

Download the Go X Net Package:
```
go get golang.org/x/net
```
Download the Logrus* package:
```
go get github.com/sirupsen/logrus
```

Note: In some cases you may encounter the error 'can't load package: package golang.org/x/net: no buildable Go source files in $GOPATH/src/golang.org/x/net'. Verify your setup by checking if $GOPATH/src/golang.org/x/net actually contains items from https://github.com/golang/net repo.

Copy the context_linux_go directory from the extracted release package to the $GOPATH/src directory.

Running an SDK Sample

Make sure a broker instance is running.

To run the local_ticktock sample, navigate to the $GOPATH/context_linux_go/samples/local_ticktock directory and enter: go run main.go

Note: All the providers and samples provided in the SDK can be found in the $GOPATH/context_linux_go/providers and $GOPATH/context_linux_go/samples directories respectively.

Tutorials

The tutorials showcase how to use the SDK to create a provider, a sample application that utilizes the provider, and a microservice that can run the sample application.

Creating a Simple Provider

Next, we'll create a provider that takes a time period in milliseconds as options and publishes the string “Hello World” to the broker at the supplied interval.

Create a directory named simpleprovider in the $GOPATH/context_linux_go/providers directory.
Create a file named simpleprovider.go inside the directory.

A provider requires implementing functions and structs required by the context core. The below steps showcase the minimum steps required to create a basic provider, with only the createItem function being specific to this tutorial.

In the following steps, you'll be adding lines of code to the simpleprovider.go file:

Encapsulate all the contents of the provider in a package:
```
package simpleprovider
```

Import the required packages, time and core:

import (
     "context_linux_go/core"     "time"
)

Declare a constant identifier that other providers can use to identify the data coming from our simpleprovider:

const (
     // SimpleProviderType is the URN for a data from this provider
     SimpleProviderType string = "urn:x-intel:context:thing:simpleprovider"
)

Define a schema to register with the broker.
This will enable the broker to identify the unique identifier and perform necessary schema validation:

// SimpleProviderSchema schema satisfied by this provider, the value is placed in the “data"
var SimpleProviderSchema = core.JSONSchema{
     "type": SimpleProviderType,
     "schema": core.JSONSchema{
          "type": "object",
          "properties": core.JSONSchema{
               "data ": core.JSONSchema{
                    "type": "string",
               },
          },
     },
     "descriptions": core.JSONSchema{
          "en": core.JSONSchema{
               "documentation": "Simple string producer",
               "short_name":    "SimpleString",
          },
     },
}

Define a struct that holds an instance of the provider.
We will use the stopChan variable to start/stop the provider and also provide a reference to the additional options that the provider can accept:

// Provider holds an instance of the simple provider. 
// Methods defined for this type must implement core.ProviderInterface
type Provider struct {
     ticker   *time.Ticker
     stopChan chan bool
     options  *Options
}

Define the options for this provider.
We will supply the time interval after which the string should be published:

// Options that are provider specific
type Options struct {
     core.ProviderOptions
     Period int // Period of ticking in milliseconds
}

We can supply multiple URN identifiers in a single provider. Define a static function to return all the types supported in this provider:

// Types is a static function that returns the types this Provider supports (URN and schema)
func Types() []core.ProviderType {
     return []core.ProviderType{
          core.ProviderType{URN: SimpleProviderType, Schema: SimpleProviderSchema}}
}

Define a function that can return the Types supported:

// Types is a provider specific function that queries the type of an ProviderInterface instance
func (p *Provider) Types() []core.ProviderType {
     return Types()
}

Define the New function, which can set options called from our sample:

// New creates a new simpleprovider.Provider with the specified options
func New(options *Options) *Provider {
     var dp Provider
     dp.options = options
     dp.stopChan = make(chan bool)
     return &dp
}

Implement the Start function. In this email, we'll supply the ItemData to publish and also decide when to publish with the help of the ticker:

// Start begins producing events on the item and error channels
func (p *Provider) Start(onItem core.ProviderItemChannel, onErr core.ErrorChannel) {
     p.ticker = time.NewTicker(time.Millisecond * time.Duration(p.options.Period))
     go func() {
          for {
               select {
               case <-p.ticker.C:
                    onItem <- p.createItem()
               case <-p.stopChan:
                    close(onItem)
                    close(onErr)
                    return
               }
          }
     }()
}

Implement the createItem function. This function populates the ItemDatawith our string:

// Generates a new simple provider item
func (p *Provider) createItem() *core.ItemData {
     var item = core.ItemData{
          Type: SimpleProviderType,
          // Value map must match the schema
          Value: map[string]interface{}{"data": "Hello World"},
     }
     return &item
}

Implement the GetItem function:

// GetItem returns a new simple provider item. Returns nil if itemType is not recognized
func (p *Provider) GetItem(itemType string) *core.ItemData {
     if itemType != SimpleProviderType {
          return nil
     }
     return p.createItem()
}

Implement the Stop function to stop producing items:

func (p *Provider) Stop() {
     p.stopChan <- true
     if p.ticker != nil {
          p.ticker.Stop()
      }
}

We must implement the GetOptions function to return a pointer to ProviderOptions in the Sensing core:

// GetOptions returns a pointer to the core options for use within the Sensing core
func (p *Provider) GetOptions() *core.ProviderOptions {
     return &p.options.ProviderOptions
}

Creating a Sample Application Utilizing a Provider

A basic sample application utilizing a provider requires creating channels for onStart, onError and onItem to interface with the provider. Additionally, the Sensing API takes options, onStart, and onError as input. We can also supply options required as input for the provider itself.

Create a directory named simpleProviderSample in the $GOPATH/context_linux_go/samples directory.
Create a file named main.go inside the directory.

In the following steps, you'll be adding code to the main.go file:

Encapsulate all the contents of our sample in a package:
```
package main
```

Import the required package: core, sensing, our simpleprovider, and fmt (to print to the terminal):

import (
	"context_linux_go/core""context_linux_go/core/sensing""context_linux_go/providers/simpleprovider""fmt"
)

Implement the main function. We will supply the channels for onStart, onError, and onItem from the context core:

func main() {
	onStart := make(core.SensingStartedChannel, 5)
	onError := make(core.ErrorChannel, 5)
	onItem := make(core.ProviderItemChannel, 5)

Supply the provider options in the main function for the sensing core such as broker ipAddress and port, an indicator to publish to the broker, the name of our sample application, onStart, and onError:
```
	options := core.SensingOptions{
		Server:      "localhost:8888",
		Publish:     true,
		Application: "go_simpleprovider_application",
		OnStarted:   onStart,
		OnError:     onError,
	}
```
Create a new instance of Sensing and provide the sensing options in the main function:
```
	sensing := sensing.NewSensing()
	sensing.Start(options)
```
Create an instance of the simpleprovider and supply the time period in the provider options in the main function:
```
	spProvider := simpleprovider.New(&simpleprovider.Options{Period: 1000, ProviderOptions: core.ProviderOptions{Publish: true}})
```
Note: The above line is a single line of code.

Enable sensing and provide a reference to our provider instance. In this example, we'll print the URN type and actual data every time our provider generates ItemData. We'll stop our provider if any error is detected.

	for {
		select {
		case <-onStart:
			fmt.Println("Started sensing")
			sensing.EnableSensing(spProvider, onItem, onError)
		case item := <-onItem:
			fmt.Println(item.Type, item.Value)
		case err := <-onError:
			fmt.Println("Error", err)
			sensing.Stop()
			return
		}
	}
} //end of main function

Creating a Microservice

We can encapsulate an application and other dependencies inside a single service using Docker.

Dockerizing our application helps to secure the implementation (source code) and dynamically configure connections to other services, such as the broker, without modifying the source code on host machines.

Create a file named SimpleProviderDockerfile in the $GOPATH/context_linux_go directory. You'll be editing this file in the steps below.
Note: There is no extension in the name of the file.

Provide the dependencies required by the SDK, as well as the Intel proxy information:

FROM golang:1.8.3-alpine3.5

RUN mkdir /app
ADD ./samples /app/
ADD . /go/src/context_linux_go/

ENV http_proxy=http://proxy-chain.intel.com:911
ENV https_proxy=http://proxy-chain.intel.com:912

RUN apk add --no-cache git \
    && go get golang.org/x/net/websocket \
    && go get github.com/sirupsen/logrus \
    && apk del git

WORKDIR /app/.

Provide a name (simple_provider_client) and a path to our sample application (simpleProviderSample/main.go), then run the sample application:
```
RUN go build -o simple_provider_client simpleProviderSample/main.go

CMD ["./simple_provider_client"]
```

Running your micro service

Ensure the broker is running on your machine (Refer: Setting up the Broker).

Build the image locally with a tag using the Docker file.
```
docker build --tag smp:latest -f SimpleProviderDockerfile . 
```
Note: The DOT at the end is required in the above command.
Create a container named smp, tagged as latest. Run the container for the very first time:
```
docker run --name=smp --network host -e http_proxy=”” -e https_proxy=”” smp:latest 
```
Note: For subsequent runs use: docker start -i smp
-i or -it is used to run in the foreground to see the output in the current terminal.

To stop the smp instance, use CTRL+C to interrupt when running in the foreground or docker stop smp when running in the background. In order to remove the container if it's preventing the use of Docker, use: docker rm –f smp

Miscellaneous

For your convenience, this section contains topics out of the scope of this document for your convenience, but that may be listed in the requirements.

Setting up Docker

If an install script was provided with this document, simply run it in the terminal: ./install_docker.sh If not, below are steps that need to be completed to successfully install Docker:

Follow the Docker manual installation instructions: https://docs.docker.com/engine/installation/linux/ubuntu/#install-using-the-repository
If you are behind a corporate proxy ,you may need to set Docker's proxy and DNS settings: Proxy Instructions
Determine your host machine's DNS servers:
```
nmcli dev show | grep 'IP4.DNS'
```
Set up daemon.json with the 'dns' key and your DNS addresses:
```
Example: { "dns" : [ "10.0.0.2" , "8.8.8.8" ] }
```

Add your user to the docker group:

sudo groupadd docker
sudo gpasswd -a ${USER} docker
sudo service docker restart
newgrp docker

Make sure you have access to hub.docker.intel.com by trying to log in in the web portal: https://hub.docker.intel.com
Associate Docker on your machine with your user account:
```
docker login hub.docker.intel.com
```

Setting up the Go Environment

Fetch the Golang distribution package:

wget -c https://storage.googleapis.com/golang/go1.8.3.linux-amd64.tar.gz

Extract the contents:

sudo tar -C /usr/local -xvzf go1.8.3.linux-amd64.tar.gz

Append the below line into your .bashrc file, usually located at $Home/.bashrc
```
export PATH=$PATH:/usr/local/go/bin
```
Apply the changes to the current session:
```
source ~/.bashrc
```

Accessing Go Documentation in your Browser

Access the Go documentation for the SDK from your browser to view additional API information and samples that are not demonstrated in this document.

Navigate to $GOPATH/context_linux_go and enter:
```
godoc -http=:6060
```
In a web browser, enter the URL:
```
http://localhost:6060/
```
Click on Packages from the menu on top of the webpage.

You should now be able to view the documentation contents of context_linux_go under standard packages section of the webpage.

This document will explain the different REST API commands supported by the Intel® Context Sensing SDK.

Prerequisites

Software

OS: Ubuntu* 14.04 or 16.04 or Windows® 10
Linux* terminal / Windows Command line
Curl*: Refer to Install Curl
Postman*: (optional third-party tool): https://www.getpostman.com/

Install Curl

We use curl to run the REST API commands, both on Linux and Windows.

You can download curl from: https://curl.haxx.se/dlwiz/. Alternatively, you can install Postman as a Chrome* browser extension and use that instead.

Generic Curl Options

Generic curl options are explained below for reference.

Curl Command Options	Actions
`-V --noproxy '*'`	Ignore all proxy settings.
`-X GET`	Using the GET type of REST call. You can replace it with POST, PUT, or other rest calls.
`-H "Authorization: Bearer none"`	Authorization parameters that the broker expects.
`-H "Content-Type: application/json"`	The content type.
`-d '{ --some JSON object-- }'`	The JSON body you send along with a POST/PUT type of message

Start the Broker

Make sure you have the broker set up. For steps, see Setting up the Broker.
This section assumes you have Docker* already set up with Intel credentials. (Refer: Setting Up Docker)
Make sure you have a running instance of MongoDB*, which is required by the broker.

Start the Terminal

Start another terminal (or command line in Windows), separate from the terminal that the broker is running on.

Rest APIs

The code sections below illustrate terminal commands or JSON input/output.

Note: This is an example of a command: command. Enter commands in a terminal window, and press Enter to run them.

1. GetStates

GetStates returns the current state of the Bus, which is all the last known data for all endpoints and all types within that endpoint seen by the broker.

Method	Resource	Filter	Description
GET	/states	none	Returns all the states without any filter applied

Actual Command Line

curl -v --noproxy '*' -X GET -H "Authorization: Bearer none" http://localhost:8888/context/v1/states

Result with Response Code : 200 OK

If everything goes well, you will get the following result on the terminal that runs the broker:

no authorization

If the broker database is empty, you will receive a JSON response on your REST API terminal, containing an empty list (example below). If you want to push some states and populate this data, see: Push States.

{
    "data": {
        "kind": "states",
        "items": []
    }
}

On the other hand, if you have any data populated, you will get a JSON response that has the last received/forwarded data of all the types of every endpoint. It will look something like the following:

{
  "data": {
    "kind": "states",
    "items": [
      {
        "value": {
          "datetime": "2017-09-21T23:56:40.948Z"
        },
        "dateTime": "2017-09-21T23:56:40.948Z",
        "type": "urn:x-intel:context:thing:ticktock",
        "owner": {
          "device": {
            "id": "08:00:27:a2:a9:32:sample_ticktock",
            "runtime": null,
            "name": null
          },
          "user": {
            "id": "5514838787b1784b6b6f9e9a",
            "name": null
          },
          "application": {
            "id": "2dcdg777z7uan4tbmbch3rvd",
            "name": null
          }
        }
      },      
      {
        "value": {
          "sent_on": 1509737046088,
          "motion_detected": false,
          "device_id": "RSPSensor8"
        },
        "type": "urn:x-intel:context:retailsensingplatform:motion",
        "dateTime": "2017-11-03T19:24:06.098Z",
        "owner": {
          "device": {
            "runtime": null,
            "* Connection #0 to host localhost left intact id": "0a:00:27:00:00:08:serviceMotionSensor",
            "name": null
          },
          "user": {
            "id": "5514838787b1784b6b6f9e9a",
            "name": null
          },
          "application": {
            "id": "2dcdg777z7uan4tbmbch3rvd",
            "name": null
          }
        }
      }
    ]
  }
}

2. GetItem

GetItem returns data from a specific Item over a period of time.

Method	Resource	Filter	Description
GET	/items	none	Returns all the items.

Actual Command line

curl -v --noproxy '*' -X GET -H "Authorization: Bearer none" http://localhost:8888/context/v1/items

Result with Response Code : 200 OK

If everything goes well, you will get a result on the terminal that runs the broker:

no authorization

On your Rest API Terminal, you will get a JSON response that contains the items:

{
    "data": {
        "kind": "items",
        "items": []
    }
}

3. PushStates

Allows you to push state/data to the broker. Make sure the state pushed complies with the registered JSON Schema.

Method	Resource	Filter	Description
PUT	/states	none	Pushes the current state/data to the broker.

In this example, we want to push the following:

{
  "states": [
    {
      "type": "urn:x-intel:context:type:media:audio",
      "activity": "urn:activity:listening",
      "value": {
        "type": "song",
        "title": "Very interesting Song",
        "description": "Song by Metallica on Garage Inc.",
        "genre": [ "metal" ],
        "language": "eng",
        "author": "Metallica"
      },
      "dateTime": "2013-04-29T16:01:00+00:00"
    }
  ],
  "owner": {
    "device": {
      "id": "c2f6a5c0-b0f0-11e2-9e96-0800200c9a66"
    }
  }
}

Actual Command Line

curl -v --noproxy '*' -X PUT -H "Authorization: Bearer none" -H "Content-Type: application/json" http://localhost:8888/context/v1/states -d '{ "states":[{"type":"urn:x-intel:context:type:media:audio","activity":"urn:activity:listening","value":{"type":"song","title":"Very interesting Song","description":"Song by Metallica on Garage Inc.","genre":["metal"],"language":"eng","author":"Metallica"},"dateTime":"2013-04-29T16:01:00+00:00"}],"owner":{"device":{"id":"c2f6a5c0-b0f0-11e2-9e96-0800200c9a66"}} }'

Result with Response Code : 204 No Content

If everything goes well, you will get a result 204 with no content:

On your terminal running the broker, you will get no content.

On your Rest API terminal, you will get no content.

4. SendCommand

SendCommand will send a command to be executed by the broker or pass it along to be executed by an endpoint and pass the endpoint's result back to the calling service.

An example transaction: Endpoint1 <--------> Broker <---------> Endpoint2 with function retrieveitem() called by URN, such as urn:x-intel:context:command:getitem

Method	Resource	Filter	Description
POST	/command	none	Returns the result of the command executed.

Actual Command Line

curl -v --noproxy '*' -X POST -H "Authorization: Bearer none" -H "Content-Type: application/json" http://localhost:8888/context/v1/command -d '{ "method": "urn:x-intel:context:command:getitem", "endpoint": { "macaddress": "0:0:0:0:0:0", "application": "sensing"}, "params": ["0:0:0:0:0:0:sensing", "urn:x-intel:context:type:devicediscovery"]}'

The -d option is used to send the JSON body. Refer to Generic Curl Options. The current body represents the following:

"method": "urn:x-intel:context:command:getitem": is the URN exposing the function retrieveitem() from endpoint2.
"endpoint": is the MAC address and application name of the service that needs to execute the method mentioned above (such as endpoint2).
Note: A Broker can also serve as an endpoint with a MAC address: 0:0:0:0:0:0 and application name:sensing. One may also specify a different MAC address and application name for an endpoint2 in the system mentioned above.
params: The arguments that are expected by the function in endpoint2, such as retrieveitem(). In the following example, we expect 2 arguments: 0:0:0:0:0:0:sensing and urn:x-intel:context:type:devicediscovery. This is function-specific.

Note: You can send variations of this command by changing the endpoint, the URN for function(method) that endpoint supports, and the parameters the function is changing.

{
	"method": "urn:x-intel:context:command:getitem",
	"endpoint": {
		"macaddress": "0:0:0:0:0:0",
		"application": "sensing"
	},
	"params": [
		"0:0:0:0:0:0:sensing",
		"urn:x-intel:context:type:devicediscovery"
	]
}

Result with Response Code : 200 OK

If everything goes well, you will get a result on the terminal that runs the broker:

no authorization

On your Rest API terminal, you will get a JSON response that has a list of all devices registered and active with the broker.

{
  "result": {
    "body": {
      "type": "urn:x-intel:context:type:devicediscovery",
      "value": {
        "devices": [          
        ]
      }
    },
    "response_code": 200
  }
}

Setting up Docker*

If an install script was provided with this document, simply run it in the terminal: ./install_docker.sh

If not, below are steps that need to be completed to successfully install Docker:

Docker manual installation instructions: https://docs.docker.com/engine/installation/linux/ubuntu/#install-using-the-repository
Determine your host machine's DNS servers: nmcli dev show | grep 'IP4.DNS'
Set up daemon.json with the 'dns' key and your DNS addresses. Example: { "dns" : [ "10.0.0.2" , "8.8.8.8" ]
Add your user to the docker group: sudo groupadd docker sudo gpasswd -a ${USER} docker sudo service docker restart newgrp docker
Make sure you have access to hub.docker.intel.com by trying to log in in the web portal: https://hub.docker.intel.com
Associate Docker on your machine with your user account: docker login hub.docker.intel.com

Setting up the Broker

There are two options to set up the broker:

Dockerized: Context repo from hub.docker.intel.com(preferred)
Non-dockerized: context-broker-VERSION.tgz file

Notes

This document only covers the preferred Dockerized method.
The section assumes you have Docker already set up with Intel credentials. (Refer: Setting Up Docker)
The broker requires a running instance of MongoDB.

1. Use Docker to pull the Mongo* image onto your machine.

docker pull mongo

2. Create a container named mymongodb and run it for the very first time.

docker run --name=mymongodb -d mongo

Note: For subsequent runs use: docker start mymongodb

3. Pull the Context Linux Broker image.

docker pull hub.docker.intel.com/context/context-broker:v0.10.5

4. Create a container named contextbroker and run it for the very first time.

docker run --name contextbroker -it -p 8888:8888 --link mymongodb -e MONGODB_HOST=mymongodb hub.docker.intel.com/context/context-broker:v0.10.5

Notes

For subsequent runs, use: docker start -i contextbroker -i or -it is used to run in the foreground and to see the output in the current terminal.
To stop the context broker instance, use CTRL+C to interrupt when running in the foreground or docker stop contextbroker when running in the background.
In order to remove the container preventing the use of Docker start, use: docker rm –f contextbroker

If you have access to our GitHub* repository(Non dockeized)

1. Go into /broker

python3 runserver.py

If everything goes well, you will get a result on the terminal that runs the broker:

Listening of Http(s)
8888

A ﬂexible sensor-based solution simplifies modernization for commercial
real estate

"Our mission is to get these dense mesh networks into commercial buildings and high-rise residential towers—to provision the lighting and emergency management and support tracking."

—Dr. Simon Benson, CEO and founder, Levaux

Executive Summary

Commercial buildings face complex infrastructure and operational challenges to deploying IoT technology. With the comprehensive visibility provided by the Levaux SenseAgent* solution powered by Intel® architecture, buildings can be optimised on a per room basis, creating operational and cost efficiencies. High-resolution data can translate into capital savings, extensive capabilities, and improved occupant experiences—all for the price of lighting.

Challenges

The benefits for modernized building management systems (BMS) based on connected infrastructure are many, but attaining these is challenging for both brownfield and greenfield buildings in the IoT era.

Technology in older buildings tends to be more difficult to work with and prohibitively expensive to change. Most BMS were installed at the time of building construction and are rarely, if ever, upgraded. These BMS rely on cables, rather than the more ﬂexible Wi-Fi or mesh networks, and wireless infrastructure is limited. Systems lack common protocols and interoperability, preventing holistic visibility into building operations. Hardwired equipment requires maintenance and programmatic upgrades to be handled manually—a labor- and cost-intensive process. Compliance management is also manual, with time-consuming testing of all emergency equipment conducted every three to twelve months.

Sensor density tends to be light, resulting in optimization based on data from limited areas of the building. Overall, older buildings are simply not getting enough data—or timely access to relevant reportage—to maximize efficiency and deliver good occupant experiences. In sum, brownfield buildings do not have the capabilities or ﬂexibility to adjust systems based on changing occupancy and conditions, or to meet modern standards for wellness and productivity in commercial buildings.

Greenfield buildings may circumvent many of these obstacles, but they also face issues. These include the challenge of integration across often incompatible systems from multiple vendors and protocols, securing connected equipment from cyberthreats, and gathering and analyzing relevant data quickly and economically. Building management is often decentralized, prohibiting the savings of centralized, remote management.

How can all types of buildings get the advantages of IoT and establish a scalable foundation for the future? A streamlined smart sensor solution from Levaux running on an Intel® architecture-based IoT gateway is delivering the value of edge-to-cloud insight for smart building management.

Solution

Drawing on deep expertise in technology and management of complex systems from its work in the military and communications industries, Levaux has created a smart building solution designed to support both brownfield and greenfield venues. Levaux’s SenseAgent is an innovative solution that manages remote sensors monitoring a wide spectrum of key building functionality. This purpose-built solution for the building industry combines hardware, middleware, and cloud software and creates a fast, reliable connection from physical environment to sensor. Paired with cloud, the solution requires no training and simplifies BMS.

A wireless sensor performs five smart core capabilities— lighting, safety, climate, security, and utilization. Each sensor gathers ten diﬀerent metrics, tracking building variables such as occupancy, movement, ambient light, humidity, and temperature. The sensor is an integrated, architecturally designed ceiling fitting and eliminates the need to use numerous vendors to achieve the same breadth of functionality.

levaux's senseagent covers five core capabilities
Figure 1. Levaux’s SenseAgent* covers five core capabilities for smart buildings in a single sensor

The high-density sensors can replace existing lighting controls at a cost equal to or less than most common lighting controllers and at a fraction of the capital cost of a traditional cabled lighting control solution.

Sensor data is aggregated, filtered, and processed by an Intel®-based IoT gateway, allowing for edge analytics, alerts, and notifications. Data needed for more in-depth or historical analysis is automatically sent to the cloud. The gateway’s powerful Intel® processor and storage capacity also support data backup, so building managers can safeguard and quickly access their IP. Machine learning provides recommendations based on usage patterns, automating key optimization functions. And, because Intel architecture enables parsing data needed for immediate action or longer-term analysis, the cost of transmitting all data to the cloud is reduced. The IoT gateway connects to the building’s backend and seamlessly interfaces with traditional BMS.

senseagent combined with intel architecture
Figure 2. SenseAgent* combined with Intel® architecture provides eﬀective, robust building management

The intuitive SenseAgent interface simplifies data analysis and changes. Levaux’s sensors can be programmed remotely, allowing the solution to evolve to meet changing building management requirements and opportunities.

	Automate service delivery	Increase visibility and control
Lighting	Sustainability	Efficiently manage energy consumption
Safety	Compliance	Proactive procedures to ensure safety of people and assets
Climate	Control	Predictive analytics for smart climate control
Security	Tracking	Integrated security tracking of people and assets
Utilisation	Productivity	Efficient utilisation of property and increased productivity of people

SenseAgent Benefits for Commercial Real Estate

Plug and play: Lighting control sensors are installed directly inline between the ceiling and luminaries. They provide power to the luminaire and lighting control via common lighting control protocols (DALI, PWM, 0–10 volt). Sensors communicate using a wireless mesh—so there is no requirement for expensive communication cabling and network infrastructure.
Support multiple applications and metrics: Replace existing lighting control solutions, while oﬀering additional functionality to cover a range of applications and metrics across building operations.
Ease of use: The SenseAgent cloud application is a web-based application that can be accessed from any device anywhere in the world. It has been designed and built using modern user-interface guidelines and technologies to be easy to use with minimal training.
Wireless infrastructure: SenseAgent deployments provide buildings with dense wireless mesh networks. A routed Bluetooth* mesh network with large bandwidth capability enables building assets to exploit future applications.
Improve operations: Creates intelligent, more sustainable environments, minimising energy consumption and supporting increased productivity

How It Works in Brief

The end-to-end engineered SenseAgent IoT solution couples electronic hardware with purpose-designed middleware and a cloud interface for a fully vertically integrated out-of-the-box system. The wireless building sensor system is designed to perform immediately upon deployment in brownfield or greenfield sites.

The solution was created with a robust middleware using object-orientated remote procedure calls (RPC) architecture implemented in pure C++ from the ground up, in conjunction with embedded hardware and an optimised mesh network stack. This supports smart connected systems with the lowest latency.

Designed from the start to be an enterprise-class, secure, scalable, and reliable IoT communication system that does not sacrifice speed, the solution avoids lightweight communication protocols and messaging queues.

Sensors perform autonomously without the need for constant cloud connectivity, using weekly operational schedules and programs. The sensors sample data every second, sharing it directly peer-to-peer over the mesh network for local processing to control electrical equipment using logic-based decision-making.

Embedded hardware running unique dynamic firmware loads operational schedule profiles from the cloud. Dynamic firmware accepts schedule updates over the air, allowing the sensor system to adapt to changes in functional purpose and making it more suitable for edge computing and process optimisation.

Machine learning is made possible by a mesh network of sensors managed by multipoint Intel-based IoT gateways. The gateways supervise the commands on the network sensors and the ﬂow of data to the cloud. Data that is stored and processed at the edge applies machine learning to generate knowledge to optimise sensor behaviour.

The lighting-based sensors create a wireless mesh network on the building ceiling and uses existing power supplies. New applications can be added or built on top of the mesh network over time. The solution is designed to scale and evolve with building management needs and to help future-proof investments

senseagent innovative architecture simplifies modernization
Figure 3. SenseAgent* innovative architecture simplifies smart building modernization

The foundation for IoT

The Levaux solution is just one example of how Intel works closely with the IoT ecosystem to help enable smart IoT solutions based on standardized, scalable, reliable Intel® architecture and software. These solutions range from sensors and gateways to server and cloud technologies to data analytics algorithms and applications. Intel provides essential end-to-end capabilities—performance, manageability, connectivity, analytics, and advanced security—to help accelerate innovation and increase revenue for enterprises, service providers, and the building industry.

Conclusion

With the Levaux SenseAgent and Intel-based IoT gateway, building managers have the insight to maximize efficiency, proactively address maintenance, optimize environments, and improve occupant wellness and productivity. By simplifying IoT integration with an aﬀordable, connected, end-to-end solution, Levaux and Intel are enabling the considerable advantages of making buildings smarter.

Learn More

For more information about Levaux, please visit senseagent.com or contact us at support@senseagent.com

For more information about Intel® IoT Technology and the Intel IoT Solutions Alliance, please visit intel.com/iot.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer, or learn more at intel.com/iot. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may aﬀect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others

Abstract

This article provides a comparative study of the performance of the Intel® Xeon® Gold processor when the Naive Bayes algorithm is taken from the textbook Artificial Intelligence: A Modern Approach (AIMA) by Stuart Russell and Peter Norvig. scikit-learn* (SkLearn), and the PyDAAL programming interface are run to show the advantage of the Intel® Data Analytics Acceleration Library (Intel® DAAL). The accuracy of the above-mentioned varieties of the Naive Bayes classifier in Intel® Xeon® processors was calculated and compared. It was observed that the performance of Naive Bayes is considerably better in PyDAAL (multinomial) as compared to the performance of SkLearn and AIMA. It was also observed that the performance was better in SkLearn as compared with AIMA.

Test and System Configuration

Environment setup

We used the following environment setup to run the code and determine the test processor performance.

Processor	System	Cores	Storage (RAM)	Python* Version	PyDAAL Version
Intel® Xeon® Gold 6128 processor 3.40 GHz	CentOS* (7.4.1708)	24	92 GB	3.6.2	2018.0.0.20170814

Test setup

We used the following conventions and methods to perform the test and compare the values:

To run the Naive Bayes classifier from PyDAAL, we used the Conda* virtual environment.
The Naive Bayes classifier described in AIMA is available in the learning_apps.ipynb file from the GitHub* code.
Calculated average execution time and accuracy of learning_apps.ipynb (converted to .py) with Naive Bayes learner from AIMA.
Calculated average execution time and accuracy of learning_apps.ipynb (converted to .py) with Naive Bayes classifier from SkLearn and PyDAAL.
To calculate the average execution time, Linux* time command is used:
- Example: time(cmd="python learning_apps.py"; for i in $(seq 10); do $cmd; done)
- Average execution time = time/10.
To calculate accuracy, the accuracy_score method in SkLearn is used in all cases.
Performance gain percentage = ((AIMA - PyDAAL)/AIMA) × 100 or ((SkLearn - PyDAAL)/SkLearn) × 100.
Performance Improvement (x) = AIMA(s)/PyDAAL(s) or Sklearn (s)/PyDAAL(s).
The higher the value of the performance gain percentage, the better the performance of PyDAAL.
Performance improvement (x) value greater than 1 indicates better performance for PyDAAL.
Only the Naive Bayes part of the learning_apps.ipynb file are compared.

Code and conditional probability

The Naive Bayes learner part of the code given in AIMA was compared to the corresponding implementation from SkLearn (Gaussian and multinomial) and PyDAAL (multinomial). The following are the relevant code samples:

AIMA

from learning import

temp_train_lbl = train_lbl.reshape((60000,1))
training_examples = np.hstack((train_img, temp_train_lbl))

MNIST_DataSet = DataSet(examples=training_examples, distance=manhattan_distance)
nBD = NaiveBayesLearner(MNIST_DataSet, continuous=False)
y_pred = np.empty(len(test_img),dtype=np.int)
for i in range (0,len(test_img)-1):
y_pred[i] = nBD(test_img[i])

temp_test_lbl = test_lbl.reshape((10000,1))
temp_y_pred_np = y_pred.reshape((10000,1))

SkLearn (Gaussian)

from sklearn.naive_bayes import GaussianNB

classifier=GaussianNB()
classifier = classifier.fit(train_img, train_lbl)

churn_predicted_target=classifier.predict(test_img)

SkLearn (multinomial)

from sklearn.naive_bayes import MultinomialNB

classifier=MultinomialNB()
classifier = classifier.fit(train_img, train_lbl)

churn_predicted_target=classifier.predict(test_img)

PyDAAL (multinomial)

from daal.data_management import HomogenNumericTable, BlockDescriptor_Float64, readOnly
from daal.algorithms import classifier
from daal.algorithms.multinomial_naive_bayes import training as nb_training
from daal.algorithms.multinomial_naive_bayes import prediction as nb_prediction

def getArrayFromNT(table, nrows=0):
bd = BlockDescriptor_Float64()
if nrows == 0:
nrows = table.getNumberOfRows()
table.getBlockOfRows(0, nrows, readOnly, bd)
npa = np.copy(bd.getArray())
table.releaseBlockOfRows(bd)
return npa

temp_train_lbl = train_lbl.reshape((60000,1))
train_img_nt = HomogenNumericTable(train_img)
train_lbl_nt = HomogenNumericTable(temp_train_lbl)
temp_test_lbl = test_lbl.reshape((10000,1))
test_img_nt = HomogenNumericTable(test_img)
nClasses=10
nb_train = nb_training.Online(nClasses)

# Pass new block of data from the training data set and dependent values to the algorithm
nb_train.input.set(classifier.training.data, train_img_nt)
nb_train.input.set(classifier.training.labels, train_lbl_nt)
# Update ridge regression model
nb_train.compute()
model = nb_train.finalizeCompute().get(classifier.training.model)

nb_Test = nb_prediction.Batch(nClasses)
nb_Test.input.setTable(classifier.prediction.data,  test_img_nt)
nb_Test.input.setModel(classifier.prediction.model, model)
predictions = nb_Test.compute().get(classifier.prediction.prediction)

predictions_np = getArrayFromNT(predictions)

The ‘learning_apps.ipynb’ of ‘aima-python-master’ is used as a reference code for the experiment. This file implements the classification of the MNIST dataset using ‘Naive Bayes classifier’ in a conventional way. But this consumes a lot of time for classifying the data.

In order to check for better performance, the same experiment is implemented using a high-performance data analytics library for Python* called PyDAAL. In this, the data structure mainly uses ‘NumericTables’, a generic datatype to represent data in the memory.

In the code, the data is loaded as train_img, train_lbl, test_img and test_lbl using the function ‘load_MNIST()’. The ‘train_img’ and the ‘test_img’ represent the train data and test data while train_lbl and test_lbl represent the labels used for training and testing. These input data are converted into 'HomogenNumericTable' after checking the ‘C-contiguous’ nature. This is done because the conversion can only happen if the input data is ‘C-contiguous’.

An algorithm object (nb_train) is created to train the multinomial Naive Bayes model in online processing mode. The two pieces of input, that is, data and labels, are set using the 'input.set' member methods of the ‘nb_train’ algorithm object. Further, the 'compute()' method is used to update the partial model. After creating the model, a test object (nb_Test) is defined. The testing data set and the trained model is passed to the algorithm using the methods input.setTable() and nbTest.input.setModel(), respectively. After finding the predictions using the ‘compute()’ method, the accuracy and time taken for the experiment are calculated. The ‘SkLearn’ library and the ‘time’ command in Linux are used for these calculations.

Another implementation of the same code was done using the ‘Multinomial Naive Bayes’ in SkLearn for the comparison with conventional method and PyDAAL.

On analyzing the time taken for the experiments, it is clear that PyDAAL has better time performance compared to the other methods.

The conditional probability distribution assumption made in AIMA is
- A probability distribution formed by observing and counting examples.
- If p is an instance of this class and o is an observed value, there are three main operations:
  - p.add(o) increments the count for observation o by 1.
  - p.sample() returns a random element from the distribution.
  - p[o] returns the probability for o (as in a regular ProbDist).
The conditional probability distribution assumption made in Gaussian Naive Bayes is Gaussian/normal distribution.
The conditional probability distribution assumption made in multinomial Naive Bayes is multinomial distribution.

Introduction

During the test, the Intel® Xeon® Gold processor was used to run the Naive Bayes from AIMA, SkLearn (Gaussian and multinomial), and PyDAAL (multinomial). To determine the performance improvement of the processors, we compared the accuracy percentage for all relevant scenarios. We also calculated the performance improvement (x) for PyDAAL when compared to the others. Naive Bayes (Gaussian) was not included in this calculation, assuming that it was more appropriate to compare the multinomial versions of both SkLearn and PyDAAL.

Observations

Intel® DAAL helps to speed up big data analysis by providing highly optimized algorithmic building blocks for all stages of data analytics (preprocessing, transformation, analysis, modeling, validation, and decision making) in batch, online, and distributed processing modes of computation.

Helps applications deliver better predictions faster
Analyzes larger data sets with the same compute resources
Optimizes data ingestion and algorithmic compute together for the highest performance
Supports offline, streaming, and distributed usage models to meet a range of application needs
Provides priority support―connect privately with Intel engineers for technical questions

Accuracy

We ran the Naive Bayes learner from AIMA and observed that both PyDAAL and SkLearn (multinomial) had the same percentage of accuracy (refer Test and System Configuration).

Figure 1 provides a graph of the accuracy values of Naive Bayes.

Figure 1. Intel® Xeon® Gold 6128 processor—graph of accuracy values.

Benchmark results were obtained prior to the implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown". Implementation of these updates may make these results inapplicable to your device or system.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information, see Performance Benchmark Test Disclosure.

Configuration: Intel® Xeon® Gold 6128 processor 3.40 GHz; System CentOS* (7.4.1708); Cores 24; Storage (RAM) 92 GB; Python* Version 3.6.2; PyDAAL Version 2018.0.0.20170814.
Benchmark Source: Intel Corporation. See below for further notes and disclaimers.¹

Performance improvement

The performance improvement (x) with respect to time among the Naive Bayes (AIMA and PyDAAL and also SkLearn and PyDAAL) was calculated, and observed that the performance (refer Test and System Configuration) was better on PyDAAL.

Figures 2 and 3 provide graphs of the performance improvement speedup values.

Figure 2. Intel® Xeon® Gold 6128 processor—graph of AIMA versus PyDAAL performance improvement.

Figure 3. Intel® Xeon® Gold 6128 processor—graph of SkLearn versus PyDAAL performance improvement.

Summary

The optimization test on the Intel Xeon Gold processor illustrates that PyDAAL takes less time (see figure 4) and hence provides better performance (refer Test and System Configuration) when compared to AIMA and SkLearn. In this scenario, both SkLearn (multinomial) and PyDAAL had the same accuracy. The conditional probability distribution assumption made in AIMA is a simple distance measure. However, in SkLearn and PyDAAL, it is either Gaussian distribution or multinomial, which is the reason for the difference in accuracy observed.

Figure 4. Intel® Xeon® Gold 6128 processor—graph of performance time.

References

AIMA code:
https://github.com/aimacode/aima-python
The AIMA data folder:
https://github.com/aimacode/aima-python (download separately)
Book:
Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig

¹Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information, visit www.intel.com/benchmarks.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Introduction

This is an educational white paper on transfer learning, showcasing how existing deep learning models can be easily and flexibly customized to solve new problems. One of the biggest challenges with deep learning is the large number of labeled data points that are required to train them to sufficient accuracy. For example, the ImageNet*² database for image recognition consists of over 14 million hand labeled images. While the number of possible applications of deep learning systems in vision tasks, text processing, speech-to-text translation and many other domains is enormous, very few potential users of deep learning systems have sufficient training data to create models from scratch. A common concern among teams considering the use of deep learning to solve business problems is the need for training data: “Doesn’t Deep Learning need millions of samples and months of training to get good results?” One powerful solution is transfer learning, in which part of an existing deep learning model is re-optimized on a small data set to solve a related, but new, problem. In fact, one of the great attractions of Transfer Learning is that, unlike most traditional approaches to machine learning, we can take models trained on one (perhaps very large) dataset and modify them quickly and easily to work well on a new problem (where perhaps we have only a very small dataset). Transfer learning methods are not only parsimonious in their training data requirements, but they run efficiently on the same Intel® Xeon® processor (CPU) based systems that are widely used for other analytics workloads including machine learning and deep learning inference. The abundance of readily-available CPU capacity in current datacenters, in conjunction with transfer learning, makes CPU based systems preferred choice for deep learning training and inference.

Today transfer learning appears most notably in data mining, machine learning and applications of machine learning and data mining¹ . Traditional machine learning techniques attempt at learning each task from scratch, while transfer learning transfers knowledge from some previous task to a target task when the latter has fewer high-quality training data.

References

Use this option if you need to replace an expired license file with a new one for an existing installation

Place the new license file "*.lic" in the following directory, making sure not to change the license file name:

On Windows*:
<installation drive>\Program Files\Common Files\Intel\Licenses
For example: "c:\Program Files\Common Files\Intel\Licenses"
Note: If the INTEL_LICENSE_FILE environment variable is defined, copy the file to the directory specified by the environment variable instead.
On Linux*: /opt/intel/licenses
On OS X*: /Users/Shared/Library/Application Support/Intel/Licenses

Note: You will likely need administrative/root privileges to copy the license to the named directory.

Make sure to remove expired license files from the directory to ensure the correct file is being used.

Introduction

Health professionals and researchers have access to plenty of healthcare data. However, the implementation of artificial intelligence (AI) technology in healthcare is very limited, primarily due to lack of awareness about AI. AI is still a problem for most healthcare professionals. The purpose of this article is to introduce AI to the healthcare professional, and its application to different types of healthcare data.

IT (information technology) professionals such as data scientists, AI developers, and data engineers are also facing challenges in the healthcare domain; for example, finding the right problem,¹ lack of data availability for training of AI models, and various issues with the validation of AI models. This article highlights the various potential areas of healthcare where IT professionals can collaborate with healthcare experts to build teams of doctors, scientists, and developers, and translate ideas into healthcare products and services.

Intel provides educational software and hardware support to health professionals, data scientists, and AI developers. Based on the dataset type, we highlighted a few use cases in the healthcare domain wheref AI was applied using various medical datasets.

Artificial Intelligence

AI is an intelligent technique that enables computers to mimic human behavior. AI in healthcare uses algorithms and software analyzing of complex medical data to find the relationships between patient outcomes and prevention/treatment techniques.² Machine learning (ML) is a subset of AI. It uses various statistical methods and algorithms, and enables a machine to improve with experience. Deep learning (DL) is subset of ML.³ It takes machine learning to the next level with multilayer neural network architecture. It indentifes a pattern or does other complex tasks like the human brain does. DL has been applied in many fields such as computer vision, speech recognition, natural language processing (NLP), object detection, and audio recognition.⁴ Deep neural networks (DNNs) and recurrent neural networks (RNNs), examples of deep learning architectures, are utilized in improving drug discovery and disease diagnosis.⁵

Relationship of AI, machine learning, and deep learning.

Figure 1. Relationship of artificial intelligence, machine learning, and deep learning.

AI Health Market

According to Frost & Sullivan (a growth partnership company), the AI market in healthcare may reach USD 6.6 billion by 2021, a 40 percent growth rate. AI has the potential to reduce the cost of treatment by up to 50 percent.⁶ AI applications in healthcare may generate USD 150 billion in annual savings by 2026, according to the Accenture analysis. AI-based smart workforce, culture, and solutions are consistently evolving to provide comfort to the healthcare industry in multiple ways, such as ⁷

Alleviating the burden on clinicians and giving medical professionals the tools to do their jobs more effectively.
Filling in gaps during the rising labor shortage in healthcare.
Enhancing efficiency, quality, and outcomes for patients.
Magnifying the reach of care by integrating health data across platforms.
Delivering benefits of greater efficiency, transparency, and interoperability.
Maintaining information security.

Healthcare Data

Hospitals, clinics, and medical and research institutes generate a large volume of data on a daily basis, which includes lab reports, imaging data, pathology reports, diagnostic reports, and drug information. Such data is expected to increase greatly in the next few years when people expand their use of smartphones, tablets, the IoT (Internet of things), and Fitness Gazette to generate information.⁸ Digital data is expected to reach 44 zettabytes by 2020, doubling every year.⁹ The rapid expansion of healthcare data is one of the greatest challenges for clinicians and physicians. Current literature suggests that big data ecosystem and AI are solutions to processing this massive data explosion along with meeting the social, financial, and technological demands of healthcare. Analysis of such big and complicated data is often difficult and it requires a high level of skill for data analysis. Moreover, the most challenging part is an interpretation of results and recommendations based on the outcome, and medical experience, and requires many years of medical involvement, knowledge, and specialized skill sets.

In healthcare the data are generated, collected, and stored in multiple formats including numerical, text, images, scans, and audios or videos. If we want to apply AI to our dataset, we first need to understand the nature of the data, and all questions that we want to answer from the target dataset. Data type helps us to formulate the neural network, algorithm, and architecture for AI modeling. Here, we introduce a few AI-based cases as examples to demonstrate the application of AI in healthcare, in general. Typically, it can be customized accordingly, based on the project and area of interest (that is, oncology, cardiology, pharmacology, internet medicine, primary care, urgent care, emergency, and radiology). Below is a list of AI applications based on the format of various datasets that are gaining momentum in the real world.

Healthcare Dataset: Pictures, Scans, Drawings

One of the most popular ways to generate data in healthcare is with images such as scan (PET Scan image with credit Susan Landau and William Jagust at UC Berkeley)¹⁰, tissue section¹¹, drawing¹², organ image¹³ (Figure 2A). In this scenario, specialists look for particular features in an image. A pathologist collects such images under the microscope from tissue sections (fat, muscle, bone, brain, liver biopsy, and so on). Recently, Kaggle organized the Intel and MobileODT Cervical Cancer Screening Competition to improve the precision and accuracy of cervical cancer screening using a big image data set (training, testing, and additional data set).¹⁴ The participants used different deep learning models such as the faster region-based convolution neural network (R-CNN) detection framework with VGG16,¹⁵ supervised semantics-preserving deep hashing (SSDH) (Figure 2B), and U-Net for convolutional networks.¹⁶ Dr Silva achieved 81 percent accuracy using the Intel® Deep Learning SDK and GoogLeNet* using Caffe* on the validation test.¹⁶

Similarly, Xu et al. investigated datasets of over 7,000 images of single red blood cells (RBCs) from eight patients with sickle cell disease. They selected the DNN classifier to classify the different RBC types.¹⁷ Gulshan et al. applied deep convolutional neural network (DCNN) in more than 10,000 retinal images collected from 874 patients to detect moderate and worse referable with about 90 percent sensitivity and specificity.¹⁸

Various types of healthcare image data

Figure 2. A) Various types of healthcare image data. B) Supervised semantics-preserving deep hashing (SSDH), a deep learning model, used in the Intel and MobileODT Cervical Cancer Screening Competition for image classification. Source: 10-13,16

Positron Emission Tomography (PET), computed tomography (CT), magnetic resonance imaging (MRI), and ultrasound images (Figure 2A) are another source of healthcare data where images of tissue inside are collected from internal organ (like brain, tumors) without invasion. Deep learning models can be used to measure the tumor growth over time in cancer patients on medication. Jaeger et al. applied convolutional neural network (CNN) architecture on a diffusion-weighted MRI. Based on an estimation of the properties of the tumor tissue, this architecture reduced false-positive findings, and thereby decreased the number of unnecessary invasive biopsies. The researchers noticed that deep learning reduced the motion and vision error, and thus provided more stable results in comparison to manual segmentation.¹⁹ A study conducted in China showed that deep learning helped to achieve 93 percent accuracy in distinguishing malignant and benign cancer on the elastogram of ultrasound shear-wave elastography of 200 patients.^20,21

Healthcare Dataset: Numerical

Example of numerical data

Figure 3. Example of numerical data.

Healthcare industries collect a lot of patient/research-related information such as age, height, weight, blood profile, lipid profile, sugar, blood pressure, and heart rate. Similarly, gene expression data (for example, fold change) and metabolic information (for example, level of metabolites) are also expressed by the numbers.

The literature showed several cases where the neural network was successfully applied in healthcare. For instance, Danaee and Ghaeini from Oregon State University (2017) used a deep architecture, stacked denoising autoencoder (SDAE) model, for the extraction of meaningful features from gene expression data of 1097 breast cancer and 113 healthy samples. This model enables the classiﬁcation of breast cancer cells and identification of genes useful for cancer prediction (as biomarkers) or as the potential for therapeutic targets.²² Kaggle shared the breast cancer dataset from the University of Wisconsin containing formation radius, texture, perimeter, area, smoothness, compactness, concavity, symmetry, and fractal dimension of the cancer cell nucleus. In the Kaggle competition, the participants had successfully built a DNN classifier to predict breast cancer type (malignant or benign). ²³

Healthcare Dataset: Textual

Example of textual data

Figure 4. Example of textual data.

Plenty of medical information is recorded as text; for instance, clinical data (cough, vomiting, drowsiness, and diagnosis), social, economic, and behavioral data (such as poor, rich, depressed, happy), social media reviews (Twitter, Facebook, Telegram*, and so on), and drug history. NLP, a type of neural network, translates free text into standardized data. It enhances the completeness and accuracy of electronic health records (EHRs). NLP algorithms extract risk factors from notes available on the EHR.
For example, NLP was applied on 21 million medical records. It identified 8500 patients who were at risk of developing congestive heart failure with 85 percent accuracy.²⁴ The Department of Veterans Affairs used NLP techniques to review more than two billion EHR documents for indications of post-traumatic stress disorder (PTSD), depression, and potential self-harm in veteran patients.²⁵ Similarly, NLP was used to identify psychosis with 100 percent accuracy on schizophrenic patients based on speech patterns.²⁶ IBM Watson* analyzed 140,000 academic articles, which cannot be read, understood, or remembered by humans, and suggested recommendations about a course of therapy for cancer patients.²⁴

27,31

Figure 5. Examples of electrogram data. Source:27,31

Healthcare Dataset: Electrogram

Architecture of deep learning with convolutional neural network model

Figure 6. Architecture of deep learning with convolutional neural network model useful in classification of EEG data. (Source: 28-29)

Electrocardiogram (ECG)²⁷, electroencephalogram (EEG), electrooculogram (EOG), electromyogram (EMG), and sleep test are some examples of graphical healthcare data. Electrogram is the process of recording the electrical activity of the target organ (such as heart, brain, and muscle) over a period of time using electrodes placed on the skin.

Schirrmeister et al. from the University of Freiburg designed and trained a deep ConvNets (deep learning with convolutional network) model to decode raw EEG data, which is useful for EEG-based brain mapping.^28,29 Paurbabaee et al. from Concordia University, Canada used a large volume of raw ECG time-series data and built a DCNN model. Interestingly, this model learned key features of the paroxysmal atrial fibrillation (PAF)—a life-threatening heart disease, and was thereby useful in PAF patient screening. This method can be a good alternative to traditional ad hoc and time-consuming user's handcrafted features.³⁰ Sleep stage classification is an important preliminary exam of sleep disorders. Using 61 polysomnography (PSG) time series data, Chambon et al. built a deep learning model for classification of sleep stage. The model showed a better performance, relative to traditional method, with little run time and computational cost.³¹

Healthcare Dataset: Audio and Video

Example of audio data

Figure 7. Example of audio data.

Sound event detection (SED) deals with detection of the onset and offset times for each sound event in an audio recording and associates a textual descriptor. SED has been drawing great interest recently in the healthcare domain for healthcare monitoring. Cakir et al. combined CNNs and RNNs in a convolutional recurrent neural network (CRNN) and applied it to a polyphonic sound event detection task. They observed a considerable improvement in the CRNN model.³²

Videos are a sequence of images; in some cases they can be considered as a time series, and in very particular cases as dynamical systems. Deep learning techniques helps researchers in both computer vision and multimedia communities to boost the performance of video analysis significantly and initiate new research directions to analyze video content. Microsoft started a research project called InnerEye* that uses machine learning technology to build innovative tools for the automatic, quantitative analysis of three-dimensional radiological images. Project InnerEye employs algorithms such as deep decision forests as well as CNNs for the automatic, voxel-wise analysis of medical images.³³ Khorrami et al. built a model on videos from the Audio/Visual Emotion Challenge (AVEC 2015) using both RNNs and CNNs, and performed emotion recognition on video data.³⁴

Healthcare Dataset: Molecular Structure

Molecular structure of 4CDG

Figure 8. Molecular structure of 4CDG (Source: rcbs.org)

Figure 8 shows a typical example of the molecular structure of one drug molecule. Generally, the design of a new molecule is associated with the historical dataset of old molecules. In quantitative structure-activity relationship (QSAR) analysis, scientists try to find known and novel patterns between structures and activity. At the Merck Research Laboratory, Ma et al. used a dataset of thousands of compounds (about 5000), and built a model based on the architecture of DNNs (deep neural nets).³⁵ In another QSAR study, Dahl et al. built neural network models on 19 datasets of 2,000‒14,000 compounds to predict the activity of new compounds.³⁶ Aliper and colleagues built a deep neural network–support vector machine (DNN–SVM) model that was trained on a large transcriptional response dataset and classified various drugs into therapeutic categories.³⁷ Tavanaei developed a convolutional neural network model to classify tumor suppression genes and proto-oncogenes with 82.57 percent accuracy. This model was trained on tertiary structures proteins obtained from protein data bank.³⁸ AtomNet* is the first structure-based DCNN. It incorporates structural target information and consequently predicts the bioactivity of small molecules. This application worked successfully to predict new, active molecules for targets with no previously known modulators.³⁹

AI: Solving Healthcare Problems

Here are a few practical examples where AI developers, startups, and institutes are building and testing AI models:

As emotional intelligence indicators that detect subtle cues in speech, inflection, or gesture to assess a person’s mood and feelings
Help in tuberculosis detection
Help in the treatment of PTSD
AI chatbots (Florence*, SafedrugBot*, Babylon Health*, SimSensei*)
Virtual assistants in helping patients and clinicians
Verifying insurance
Smart robots that explain lab reports
Aging-based AI centers
Improving clinical documentation
Personalized medicine

Data Science and Health Professionals: A Combined Approach

Deep learning has great potential to help medical and paramedical practitioners by:

Reducing the human error rate⁴⁰ and workload
Helping in diagnosis and the prognosis of disease
Analyzing complex data and building a report

The examination of thousands of images is complex, time consuming, and labor intensive. How can AI help?

A team from Harvard Medical School’s Beth Israel Deaconess Medical Center noticed a 2.9 percent error rate with the AI model and a 3.5 percent error rate with pathologists for breast cancer diagnosis. Interestingly, the pairing of “deep learning with pathologist” showed a 0.5 percent error rate, which is an 85 percent drop.⁴⁰ Litjens et al. suggest that deep learning holds great promise in improving the efficacy of prostate cancer diagnosis and breast cancer staging. ^41,42

Intel® AI Academy

Intel provides educational software and hardware support to health professionals, data scientist and AI developers, and makes available free AI training and tools through the Intel® AI Academy.

Intel recently published a series of AI hands-on tutorials, walking through the process of AI project development, step-by-step. Here you will learn:

Ideation and planning
Technology and infrastructure
How to build an AI model (data and modeling)
How to build and deploy an app (app development and deployment)

Intel is committed to providing a solution for your healthcare project. Please read the article on the Intel AI Academy to learn more about solutions using Intel® architecture (Intel® Processors for Deep Learning Training). In the next article, we explore examples of healthcare datasets where you will learn how to apply deep learning. Intel is committed to help you to achieve your project goals.

References

Faggella, D. Machine Learning Healthcare Applications – 2018 and Beyond. Techemergence.
Artificial intelligence in healthcare - Wikipedia. (Accessed: 12th February 2018)
Intel® Math Kernel Library for Deep Learning Networks: Part 1–Overview and Installation | Intel® Software. (Accessed: 14th February 2018)
Lecun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature521, 436–444 (2015).
Mamoshina, P., Vieira, A., Putin, E. & Zhavoronkov, A. Applications of Deep Learning in Biomedicine. Molecular Pharmaceutics13, 1445–1454 (2016).
From $600 M to $6 Billion, Artificial Intelligence Systems Poised for Dramatic Market Expansion in Healthcare. (Accessed: 12th February 2018)
Accenture. Artificial Intelligence in Healthcare | Accenture.
Marr, B. How AI And Deep Learning Are Now Used To Diagnose Cancer. Foboes
Executive Summary: Data Growth, Business Opportunities, and the IT Imperatives | The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things. Available at: . (Accessed: 12th February 2018)
Lifelong brain-stimulating habits linked to lower Alzheimer’s protein levels | Berkeley News. (Accessed: 21st February 2018)
Emphysema H and E.jpg - Wikimedia Commons (Accessed : 23^rd February 2018). https://commons.wikimedia.org/wiki/File:Emphysema_H_and_E.jpg
Superficie_ustioni.jpg (696×780). (Accessed: 23rd February 2018). https://upload.wikimedia.org/wikipedia/commons/1/1b/Superficie_ustioni.jpg
Heart_frontally_PDA.jpg (1351×1593). (Accessed: 27th February 2018). https://upload.wikimedia.org/wikipedia/commons/5/57/Heart_frontally_PDA.jpg
Kaggle competition-Intel & MobileODT Cervical Cancer Screening. Intel & MobileODT Cervical Cancer Screening. Which cancer treatment will be most effective? (2017).
Intel and MobileODT* Competition on Kaggle*. Faster Convolutional Neural Network Models Improve the Screening of Cervical Cancer. December 22 (2017).
Kaggle*, I. and M. C. on. Deep Learning Improves Cervical Cancer Accuracy by 81%, using Intel Technology. December 22 (2017).
Xu, M. et al. A deep convolutional neural network for classification of red blood cells in sickle cell anemia. PLoS Comput. Biol.13, 1–27 (2017).
Gulshan, V. et al. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA316, 2402 (2016).
Jäge, P. F. et al. Revealing hidden potentials of the q-space signal in breast cancer. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics)10433 LNCS, 664–671 (2017).
Ali, A.-R. Deep Learning in Oncology – Applications in Fighting Cancer. September 14 (2017).
Zhang, Q. et al. Sonoelastomics for Breast Tumor Classification: A Radiomics Approach with Clustering-Based Feature Selection on Sonoelastography. Ultrasound Med. Biol.43, 1058–1069 (2017).
Danaee, P., Ghaeini, R. & Hendrix, D. A. A deep learning approach for cancer detection and relevant gene indentification. Pac. Symp. Biocomput.22, 219–229 (2017).
Kaggle: Breast Cancer Diagnosis Wisconsin. Breast Cancer Wisconsin (Diagnostic) Data Set: Predict whether the cancer is benign or malignant.
What is the Role of Natural Language Processing in Healthcare? (Accessed: 1st February 2018)
VA uses EHRs, natural language processing to spot suicide risks. (Accessed: 1st February 2018)
Predictive Analytics, NLP Flag Psychosis with 100% Accuracy. (Accessed: 1st February 2018)
Heart_block.png (450×651). (Accessed: 23rd February 2018)
Schirrmeister, R. T. et al. Deep learning with convolutional neural networks for brain mapping and decoding of movement-related information from the human EEG Short title: Convolutional neural networks in EEG analysis. (2017).
Schirrmeister, R. T. et al. Deep learning with convolutional neural networks for EEG decoding and visualization. Hum. Brain Mapp.38, 5391–5420 (2017).
Pourbabaee, B., Roshtkhari, M. J. & Khorasani, K. Deep Convolutional Neural Networks and Learning ECG Features for Screening Paroxysmal Atrial Fibrillation Patients. IEEE Trans. Syst. Man, Cybern. Syst. 1–10 (2017). doi:10.1109/TSMC.2017.2705582
Chambon, S., Galtier, M. N., Arnal, P. J., Wainrib, G. & Gramfort, A. A deep learning architecture for temporal sleep stage classification using multivariate and multimodal time series. arXiv:1707.0332v2 (2017).
Cakir, E., Parascandolo, G., Heittola, T., Huttunen, H. & Virtanen, T. Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection. IEEE/ACM Trans. Audio, Speech, Lang. Process.25, 1291–1303 (2017).
Project InnerEye – Medical Imaging AI to Empower Clinicians. Microsoft
Khorrami, P., Le Paine, T., Brady, K., Dagli, C. & Huang, T. S. HOW DEEP NEURAL NETWORKS CAN IMPROVE EMOTION RECOGNITION ON VIDEO DATA.
Ma, J., Sheridan, R. P., Liaw, A., Dahl, G. E. & Svetnik, V. Deep neural nets as a method for quantitative structure-activity relationships. J. Chem. Inf. Model.55, 263–274 (2015).
Dahl, G. E., Jaitly, N. & Salakhutdinov, R. Multi-task Neural Networks for QSAR Predictions. (University of Toronto, Canada. Retrieved from http://arxiv.org/abs/1406.1231, 2014).
Aliper, A. et al. Deep learning applications for predicting pharmacological properties of drugs and drug repurposing using transcriptomic data. Mol. Pharm.13, 2524–2530 (2016).
Tavanaei, A., Anandanadarajah, N., Maida, A. & Loganantharaj, R. A Deep Learning Model for Predicting Tumor Suppressor Genes and Oncogenes from PDB Structure. bioRxiv October 22, 1–10 (2017).
Wallach, I., Dzamba, M. & Heifets, A. AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery. 1–11 (2015). doi:10.1007/s10618-010-0175-9
Kontzer, T. Deep Learning Drops Error Rate for Breast Cancer Diagnoses by 85%. September 19 (2016).
Litjens, G. et al. Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis. Sci. Rep.6, (2016).
Litjens, G. et al. A survey on deep learning in medical image analysis. Med. Image Anal.42, 60–88 (2017).

Carnegie Mellon University

Principal Investigator

Description

Related websites:

http://sv.cmu.edu/directory/faculty-and-researchers-directory/faculty-and-researchers/mengshoel.html
https://users.ece.cmu.edu/~olem/omengshoel/Home.html
https://works.bepress.com/ole_mengshoel/

Introduction

This article shows how to implement a persistent memory (PMEM)-aware queue using a linked list and the C++ bindings of the Persistent Memory Development Kit (PMDK) library libpmemobj.

A queue is a first in first out (FIFO) data structure that supports push and pop operations. In a push operation, a new element is added to the tail of the queue. In a pop operation, the element at the head of the queue gets removed.

A PMEM-aware queue differs from a normal queue in that its data structures reside permanently in persistent memory, and a program or machine crash could result in an incomplete queue entry and a corrupted queue. To avoid this, queue operations must be made transactional. This is not simple to do, but PMDK provides support for this and other operations specific to persistent memory programming.

We'll walk through a code sample that describe the core concepts and design considerations for creating a PMEM-aware queue using libpmemobj. You can build and run the code sample by following the instructions provided later in the article.

For background on persistent memory and the PMDK, read the article Introduction to Programming with Persistent Memory from Intel and watch the Persistent Memory Programming Video Series.

C++ Support in libpmemobj

The main features of the C++ bindings for libpmemobj include:

Transactions
Wrappers for basic types: automatically snapshots the data during a transaction
Persistent pointers

Transactions

Transactions are at the core of libpmemobj operations. This is because, in terms of persistence, the current x86-64 CPUs guarantee atomicity only for 8-byte stores. Real-world apps update in larger chunks. Take, for example, strings; it rarely makes sense to change only eight adjacent bytes from one consistent string state to another. To enable atomic updates to persistent memory in larger chunks, libpmemobj implements transactions.

Libpmemobj uses undo log-based transactions instead of redo log-based for visibility reasons. Changes made by the user are immediately made visible. This allows for a more natural code structure and execution flow, which in turn improves code maintainability. This also means is that in the case of an interruption in the middle of a transaction, all of the changes made to the persistent state will be rolled back.

Transactions have ACID (atomicity, consistency, isolation, and durability)-like properties. Here's how these properties relate to programming with the PKDK:

Atomicity: Transactions are atomic with respect to persistency; All the changes made within a transaction are committed when the transaction is completed successfully or none of them are.

Consistency: The PMDK provides functionality to enable the user to maintain data consistency.

Isolation: The PMDK library provides persistent memory-resident synchronization mechanisms to enable the developer to maintain isolation.

Durability: All of a transaction's locks are held until the transaction completes to ensure durability.

Transactions are done on a per thread basis, so the call returns the status of the last transaction performed by the calling thread. Transactions are power-safe but not thread-safe.

The <p> property

In a transaction, undo logs are used to snapshot user data. The <p> template wrapper class is the basic building block for automating snapshotting of the user data so app developers don't need to do this step manually (as is the case with the C implementation of libpmemobj). This wrapper class supports only basic types. Its implementation is based on the assignment operator and each time the variable of this wrapper class is assigned a new value, the old value of the variable is snapshotted. Use of the <p> property for stack variables is discouraged because snapshotting is a computationally intensive operation.

Persistent pointers

Libraries in PMDK are built on the concept of memory mapped files. Since files can be mapped at different addresses of the process virtual address space, traditional pointers that store absolute addresses cannot be used. Instead, PMDK introduces a new pointer type that has two fields: an ID to the pool (used to access current pool virtual address from a translation table), and an offset from the beginning of the pool. Persistent pointers are a C++ wrapper around this basic C type. Its philosophy is similar to that of std::shared_ptr.

libpmemobj Core Concepts

Root object

Making any code PMEM-aware using libpmemobj always involves, as a first step, designing the types of data objects that will be persisted. The first type that needs to be defined is that of the root object. This object is mandatory and used to anchor all the other objects created in the persistent memory pool (think of a pool as a file inside a PMEM device).

Pool

A pool is a contiguous region of PMEM identified by a user-supplied identifier called layout. Multiple pools can be created with different layout strings.

Queue Implementation using C++ Bindings

The queue in this example is implemented as a singly linked list, with a head and tail that demonstrates how to use the C++ bindings of libpmemobj.

Design Decisions

Data structures

The first thing we need is a data structure that describes a node in the queue. Each entry has a value and a link to the next node. As per the figure below, both variables are persistent memory-aware.

Data structure map
Figure 1. Data structure describing the queue implementation.

Code walkthrough

Now, let's go a little deeper into the main function of the program. While running the code you need to provide three arguments. One is the absolute location of the pool file, while the second one is the actual queue operation that needs to be performed. The supported operations in the queue are push (insert element), pop (return and remove element), and show (return element).

if (argc < 3) {
	std::cerr << "usage: "<< argv[0]
	<< " file-name [push [value]|pop|show]"<< std::endl;
	return 1;
}

In the snippet below, we check to see if the pool file exists. If it does, the pool is opened. If it doesn't exist, the pool is created. The layout string identifies the pool that we requested to open. Here we are opening the pool with layout name Queue as defined by the macro LAYOUT in the program.

const char *path = argv[1];
queue_op op = parse_queue_op(argv[2]);
pool<examples::pmem_queue> pop;

if (file_exists(path) != 0) {
	pop = pool<examples::pmem_queue>::create(
		path, LAYOUT, PMEMOBJ_MIN_POOL, CREATE_MODE_RW);
} else {
	pop = pool<examples::pmem_queue>::open(path, LAYOUT);
}

pop is the pointer to the pool from where we can access a pointer to the root object, which is an instance of examples::pmem_queue, and the Create function creates a new pmemobj pool of type examples::pmem_queue. The root object is like the root of a file system, since it can be used to reach all of the other objects in the pool (as long as these objects are linked properly and no pointers are lost due to coding errors).

auto q = pop.get_root();

Once you get the pointer to the queue object, the program checks the second argument in order to identify what type of action the queue should perform; that is, push, pop, or show.

switch (op) {
	case QUEUE_PUSH:
		q->push(pop, atoll(argv[3]));
		break;
	case QUEUE_POP:
		std::cout << q->pop(pop) << std::endl;
		break;
	case QUEUE_SHOW:
		q->show();
		break;
	default:
		throw std::invalid_argument("invalid queue operation");
}

Queue operations

Push

Let's look at how the push function is implemented to make it persistent programming-aware. As shown in the code below, the transactional code is implemented as a lambda function wrapped in a C++ closure (this makes it easy to read and follow the code). If a power failure happens the data structure does not get corrupted because all changes are rolled back. For more information how transactions are implemented in C++, read C++ bindings for libpmemobj (part 6) - transactions on pmem.io.

Allocation functions are transactional as well, and they use transaction logic to enable allocation/delete rollback of the persistent state; make_persistent() is the constructor, while delete_persistent() is the destructor.

Calling make_persistent() inside a transaction allocates an object and returns a persistent object pointer. As the allocation is now part of the transaction, if it aborts, the allocation is rolled back, reverting the memory allocation back to its original state.

After the allocation, the value of n is initialized to the new value in the queue, and the next pointer is set to null.

void push(pool_base &pop, uint64_t value) {
	transaction::exec_tx(pop, [&] {
		auto n = make_persistent<pmem_entry>();

		n->value = value;
		n->next = nullptr;

		if (head == nullptr && tail == nullptr) {
			head = tail = n;
		} else {
			tail->next = n;
			tail = n;
		}
	});
}

Data structure map for push functionality
2. Data structure for push functionality.

Pop

Similar to push, the pop function is shown below. Here we need a temporary variable to store a pointer to the next pmem_entry in the queue. This is needed in order to set the head of the queue to the next pmem_entry after deleting the head using delete_persistent(). Since this is done using a transaction, it is persistent-aware.

uint64_t pop(pool_base &pop){
	uint64_t ret = 0;
	transaction::exec_tx(pop, [&] {
		if (head == nullptr)
			transaction::abort(EINVAL);

		ret = head->value;
		auto n = head->next;

		delete_persistent<pmem_entry>(head);
		head = n;

		if (head == nullptr)
			tail = nullptr;
	});

	return ret;
}

Data structure map for pop functionality.
Figure 3. Data structure for pop functionality.

Build Instructions

Instructions to run the code sample

Download the source code from the PMDK GitHub* repository:

Git clone https://github.com/pmem/pmdk.git

Figure 4. Download source code from the GitHub* repository.
cd pmdk and run make on the command line as shown below. This builds the complete source code tree.

Figure 5. Building the source code.
cd pmdk/src/examples/libpmemobj++/queue
View command line options for the queue program:
./queue
Push command:
./queue TESTFILE push 8

Figure 6. PUSH command using command line.
Pop command:
./queue TESTFILE pop
Show command:
./queue TESTFILE show

Figure 7. POP command using command line.

Summary

In this article, we showed a simple implementation of a PMEM-aware queue using the C++ bindings of the PMDK library libpmemobj. To learn more about persistent memory programming with PMDK, visit the Intel® Developer Zone (Intel® DZ) Persistent Memory Programming site. There you will find articles, videos, and links to other important resources for PMEM developers.

About the Author

Praveen Kundurthy is a Developer Evangelist with over 14 years of experience in application development, optimization and porting to Intel platforms. Over the past few years at Intel, he has worked on topics spanning Storage technologies, Gaming, Virtual reality and Android on Intel platforms.

Using Caffe with the Intel *

Deep Learning Model Optimizer for Caffe* requires the Caffe framework to be installed on the client machine with all relevant dependencies. Caffe should be dynamically compiled and linked. A shared library named libcaffe.so should be available in the CAFFE_HOME/build/lib directory.

For ease of reference, the Caffe* installation folder is referred to as <CAFFE_HOME> and the Model Optimizer installation folder is referred to as <MO_DIR>.

The installation path to the Model Optimizer depends on whether you use the Intel® CV SDK or Deep Learning Deployment Toolkit. For example, if you are installing with sudo, the default <MO_DIR> directory is:

/opt/intel/deeplearning_deploymenttoolkit_<version>/deployment_tools/model_optimizer - In case of using Deep Learning Deployment Toolkit
/opt/intel/computer_vision_sdk_<version>/mo - In case of Intel CV SDK installation.

Installing Caffe

To install Caffe, complete the following steps:

For convenience, set the following environment variables:

  export MO_DIR=<PATH_TO_MO_INSTALL_DIR>
  export CAFFE_HOME=<PATH_TO_YOUR_CAFFE_DIR>

Go to the Model Optimizer folder:
```
cd $MO_DIR/model_optimizer_caffe/
```
For easiness of the installation procedure, you can find two additional scripts in the $MO_DIR/model_optimizer_caffe/install_prerequisites folder:
- install_Caffe_dependencies.sh - Installs the required dependencies like Git*, CMake*, GCC*, etc.
- clone_patch_build_Caffe.sh - Installs the Caffe* distribution on your machine and patches it with the required adapters from the Model Optimizer.
Go to the helper scripts folder and install all the required dependencies:
```
    cd install_prerequisites/
    ./install_Caffe_dependencies.sh 
```

Install Caffe* distribution. By default it installs the BVLC Caffe* from the master branch of the official repository. If you want to install other version of Caffe*, you can slightly edit the content of the clone_patch_build_Caffe.sh script. In particular, the following lines:

    CAFFE_REPO=https://github.com/BVLC/caffe.git # link to the repository with Caffe* distribution
    CAFFE_BRANCH=master # branch to be taken from the repository
    CAFFE_FOLDER=`pwd`/caffe # where to clone the repository on your local machine
    CAFFE_BUILD_SUBFOLDER=build # name of the folder required for building Caffe*

To launch installation, just run the following command:

    ./clone_patch_build_Caffe.sh

NOTE: In case of problem with the hdf5 library while building Caffe on Ubuntu* 16.04, see the following fix.

Once you have configured Caffe* framework on your machine, you need to configure Model Optimizer for Caffe* to properly work with it. For that, please refer to the Configure Model Optimizer for Caffe* page.

Adding Parameters to P_Fire

Particle parameters interface
Figure 2. Particle parameters.

To make this particle effect, we modify the P_Fire particle system included in the Unreal Engine starter content. In Figure 2, modules that we modify are highlighted in purple, and modules we add are highlighted in orange.

Modifying light

Lighting is one of the major benefits of using CPU particles and will form the core of this effect.

Settings interface
Figure 3. First flame emitter.

Select parameter distribution on the distribution drop-down menu

In the details panel of the first flame emitter in the P_Fire particle system, select Distribution Float Particle Parameter from the Brightness Over Life Distribution drop-down menu as shown at the top of Figure 3. This allows us to tie the amount of light emitted to a variable, in this case, the amount of fuel left in the fire.

Set name

The next step is to specify which particle parameter this distribution will be tied to. We'll use the name "FuelLeft". Enter this in the Parameter Name field, as show in Figure 3.

Set mapping mode

A powerful feature of particle parameters is input mapping. This feature allows us to specify the max and min input that we will accept and to scale those values to a given range in order to make a single input parameter function seamlessly for many different modules. This capability allows us to make different parts of the particle effects scale down at different points. Effects like the sparks and embers will only start to change once the fire starts burning low, and we will set their input range to reflect that. We'll use DPM Normal for all the distributions in this tutorial as we want to both clamp the input and scale it to a particular range. This is selected under the Param Mode drop-down menu shown in Figure 3.

Set input range

Next we specify the min and max output. For this effect, we'll use 0.0 for the min and 1.0 for the max, as shown in Figure 4. This means the light from this part of the fire will scale from 0 percent fuel (fully dark) to 100 percent fuel (a nice campfire glow).

Settings interface
Figure 4. Setting input range.

Set output range

The output range lets us specify the minimum and maximum brightness for this part of the fire. Set these to the values shown in Figure 5.

Settings interface
Figure 5. Setting output range.

Set default input value

Now we need to set a default input value in case the effect is not given a value. This is done with Constant (see Figure 6). For this particle system, we'll set the default at 1.0, or a full flame.

Settings interface
Figure 6. Setting the default value.

Setting up the rest of the modules

Second emitter light

To ensure the light emitted by the fire is consistent with the particles in the particle system, we modify the light module on the second emitter as well. Change the Brightness Over Life section on the light module on the second emitter to match the values shown in Figure 7. If we didn't scale this light source as well, the fire would still emit a full glow when it is just embers.

Settings interface
Figure 7. Second emitter light.

First and second emitter scale

Presently, the amount of light that our fire produces will change with fuel, but the size of the flames will not. To change this, we add a Size Scale emitter to both the first and second emitter as shown in Figure 2. This distribution will be a Vector Particle Parameter instead of a Float Particle Parameter. Since we are giving it the same parameter name as the Float Particle Parameter, Cascade copies the float value across all three fields for our vector. For both modules, we want the graphics to scale in size from 0 percent to 100 percent fuel, so the only fields we need to change are Parameter Name and Constant. Set both modules to match the values shown in Figure 8.

Settings interface
Figure 8. Emitter scale.

Smoke spawn rate

Smaller fires produce less smoke, and we can modify our particle system to reflect that. To do this, we set up a particle parameter on the rate section of the spawn module on the smoke emitter. However, unlike the previous particle parameters we set up, we only want to start scaling down the smoke spawned when we reach 40 percent fuel and below. To do this, set the Max Input to 0.4 instead of 1. Set Distribution to match the values shown in Figure 9.

Settings interface
Figure 9. Smoke spawn rate.

Embers spawn rate

Embers also scale with the size of the fire, but don't start scaling down until our fire gets really small. We'll start scaling down embers at 50 percent (0.5) for this effect. Set the Spawn Rate Distribution on the Embers emitter to match the values shown in Figure 10.

Settings interface
Figure 10. Embers spawn rate.

Distortion spawn rate

The distortion caused by the flames needs to be scaled in the same way that the flames are scaled. Since we scaled the flames from 0 percent to 100 percent fuel, we need to do the same with the distortion. Set the Spawn Rate Distribution on the Distortion emitter to match the values shown in Figure 11.

Settings interface
Figure 11. Distortion spawn rate.

Set up a blueprint

Now that our fire effect can be scaled with the amount of fuel, we need to set up a blueprint to set the amount of fuel. In this tutorial, the amount of fuel slowly depletes, and then fills back up again to demonstrate the effect. To create a blueprint for this effect, drag the particle system into the scene, and then click Blueprint/Add Script in the details panel.

Setting up the variables

For this effect we will need just two variables, as shown in Figure 12 below:

FuelLeft: A float that keeps track of how much fuel is in our fire, ranging from 1 for 100 percent fuel to 0 for 0 percent fuel. The default is set to 1, so the fire starts at full flame.

FuelingRate: A float that dictates how quickly we deplete or fill fuel. For this tutorial, we'll set the default value to -0.1 (-10 percent per second) for this tutorial.

When both variables have been created, the variable section of the blueprint should match that of Figure 12.

Settings interface
Figure 12. Fire variables.

Changing fuel left

For this effect, we need to change the amount of fuel left every tick and apply it to the particle system. To do this, we multiply the Fueling Rate by Delta Seconds and add it to Fuel Left. This value then gets set to Fuel Left.

To apply Fuel Left to the particle system, we use the Set Float Parameter node. For the target, we use our modified P_Fire particle system component, and for Param we use Fuel Left. The parameter name needs to be the name we used in our particle system, which in this tutorial is FuelLeft.

Settings interface
Figure 13. Modifying fuel left.

Bounding fuel left

Eventually our fire will run out of fuel. In this tutorial, we want to switch to fueling the fire instead of depleting it at that point. To do this, we continue to work on the tick and check whether our new fuel value is too low (less than or equal to -0.1) or too high (greater than or equal to 1.0). The reason we set the low bounds to -0.1 is so that the fire will stay depleted for a bit before refueling. This doesn't cause any problems because any values passed to our particle system below 0 are treated as 0 due to the min input we set up.

If we find that Fuel Left is out of bounds, we multiply the Fueling Rate variable by -1. If Fuel Left is being decreased, this will cause it to be increased in subsequent ticks, or vice versa if it is being increased.

Settings interface
Figure 14. Bounding fuel left.

Principal Investigator

Dr. Patsopoulos the past few years has been leading the genetics of multiple sclerosis. He has analyzed the raw genetic data of more than 100,000 individuals, modelling TBs of data to unravel the genetic architecture of multiple sclerosis. He has been applying and developing advanced statistical models to enable analysis of large-scale data sets with millions of genetic positions and analyzed subjects.

Description

Leveraging our hands-on experience with large-scale genetic data sets and the exhaustive number of analyses one can perform we have designed a framework for fine-mapping. Modern genetics studies involve millions of analyzed positions in the genome, most of which are linked together. Fine-mapping is the application of algorithms to identify statistically independent positions in the genome that contribute to disease susceptibility. We have developed Effect Fine Mapping (EFM), a framework that not only identifies independent positions but further quantifies the probability of any linked genetic variants to be the one truly associated with the disease. This empowers the translational studies of genetic associations by providing highly-accurate lists of disease associated genetic variants. EFM can analyze millions of genetic variants and millions of subjects in any production machine, and is optimized for multi-threaded CPUs, like the Xeon Phis.

Publications

List of peer-reviewed publications: http://patslab.bwh.harvard.edu/publications-full-list

Preprints: https://www.biorxiv.org/search/author1%3Apatsopoulos%20numresults%3A10%20sort%3Arelevance-rank%20format_result%3Astandard

Related Websites

Laboratory: http://patslab.bwh.harvard.edu

Software support by IPCC: http://patslab.bwh.harvard.edu/efm

Software supported by IPCC (git): https://bitbucket.org/patslab/efm

Thank you for choosing the Intel® Graphics Performance Analyzers (Intel® GPA), available as a standalone product and as part of Intel® System Studio.

Introduction
What's New
System Requirements and Supported Platforms
Installation Notes
Technical Support and Troubleshooting
Known Issues and Limitations
Legal Information

Introduction

Intel® GPA provides tools for graphics analysis and optimizations for making games and other graphics-intensive applications run even faster. The tools support the platforms based on the latest generations of Intel® Core™ and Intel Atom™ processor families, for applications developed for Windows*, Android*, Ubuntu*, or macOS*.

Intel® GPA provides a common and integrated user interface for collecting performance data. Using it, you can quickly see performance opportunities in your application, saving time and getting products to market faster.

For detailed information and assistance in using the product, refer to the following online resources:

Home Page - view detailed information about the tool, including links to training and support resources, as well as videos on the product to help you get started quickly.
Getting Started - get the main features overview and learn how to start using the tools on different host systems.
Training and Documentation - learn at your level with Getting Started guides, videos and tutorials.
Online Help for Windows* Host - get details on how to analyze Windows* and Android* applications from a Windows* system.
Online Help for macOS* Host - get details on how to analyze Android* or macOS* applications from a macOS* system.
Online Help for Ubuntu* Host - get details on how to analyze Android* or Ubuntu* applications from an Ubuntu* system.
Support Forum - report issues and get help with using Intel® GPA.

What's New

Intel® GPA 2018 R1 offers the following new features:

New Features for Analyzing All Graphics APIs

Graphics Frame Analyzer

API Log pane now contains a new Frame Statistic tab, and separate tabs for Resource History and Pixel History. The Resource History tab enables you to select a target resource, and in the Pixel History tab you can select pixel coordinates.
API Log and Metrics can be exported now.
Input/Output Geometry viewer now provides additional information about the topology, primitive count, and bounding box.
Frame Overview pane shows full-frame FPS along with a GPU duration time.
Information about systems where a frame is captured and replayed is shown.

New Features for Analyzing Microsoft DirectX* Applications

Graphics Monitor

New User Interface is now available on Windows*
Remote profiling of DirectX* 9 or DirectX*10 frames is discontinued.

Graphics Frame Analyzer

New User Interface for DirectX* 11 frames. The following Legacy User Interface features are transferred to the new interface:
- Render Target overdraw view
- Shader replacement experiment allowing the user to import the HLSL shader code and view performance impacts on the entire frame
Default layout of D3D Buffers is now based on a specific buffer usage in a frame.
Samples count is shown as a parameter for 2D Multisample Textures or 2D Multisample Texture Arrays.
API Call arguments including structures, arrays and enums are correctly shown for DirectX11 frames.
API Log contains calls from the D3D11DeviceContext interface only.
List of bound shader resources (input elements, SRVs, UAVs, CBVs, Sampler, RTVs, DSV) is shown along with a shader code.
Target GPU adapter can be selected on multi-GPU machines for DirectX* 11 and DirectX* 12 frames.
Intel Gen Graphics Intermediate Shader Assembly (ISA) code is added for DirectX* 11 frames.
Input-Assembly layout is shown for DirectX*11 and DirectX*12 frames in the Geometry viewer.

**New Features for Analyzing macOS Metal* Applications**

Multi-Frame Analyzer

Ability to export the Metal source or LLVM disassembly codes for a selected shader.
Shader replacement experiment allowing the user to import a modified shader and view the performance impacts on the entire frame.

Many defect fixes and stability improvements

Known Issues

Full Intel GPA metrics are not supported on macOS* 10.13.4 for Skylake-based and Kaby Lake-based Mac Pro systems. For full metric support, please do not upgrade to macOS* 10.13.4.
Metrics in the System Analyzer's system view are inaccurate for Intel® Graphics Driver for Windows* Version 15.65.4.4944. You can use Intel® Graphics Driver for Windows* Version 15.60.2.4901 instead.

System Requirements and Supported Platforms

The minimum system requirements are:

Host Processor: Intel® Core™ Processor
Target Processor: See the list of supported Windows* and Android* devices below
System Memory: 8GB RAM
Video Memory: 512MB RAM
Minimum display resolution for client system: 1280x1024
Disk Space: 300MB for minimal product installation

Direct installation of Intel® GPA on 32-bit Windows* systems is not supported. However, if you need to analyze an application on a 32-bit Windows* target system, you can use the following workaround:

Copy the 32-bit *.msi installer distributed with the 64-bit installation from your analysis system to the target system.
Run the installer on the target system to install System Analyzer and Graphics Monitor.
Start the Graphics Monitor and the target application on the 32-bit system and connect to it from the 64-bit host system.

For details, see the Running System Analyzer on a Windows* 32-bit System article.

The table below shows platforms and applications supported by Intel® GPA 2018 R1

Target System (the system where your game runs)	Host System (your development system where you run the analysis)	Target Application (types of supported applications running on the target system)
Windows* 7 SP1/8/8.1/10	Windows* 7 SP1/8/8.1/10	Microsoft* DirectX* 9/9Ex, 10.0/10.1, 11.0/11.1/11.2/11.3
Windows* 10	Windows* 10	Microsoft* DirectX* 12, 12.1
Google* Android* 4.1, 4.2, 4.3, 4.4, 5.x, 6.0 The specific version depends on the officially-released OS for commercial version of Android* phones and tablets. See the list of supported devices below. NOTE: Graphics Frame Analyzer does not currently support GPU metrics for the Intel® processor code-named Clover Trail+.	Windows* 7 SP1/8/8.1/10 or macOS* 10.11, 10.12 or Ubuntu* 16.04	OpenGL* ES 1.0, 1.1, 2.0, 3.0, 3.1, 3.2
Ubuntu* 16.04	Ubuntu* 16.04	OpenGL* 3.2, 3.3, 4.0, 4.1 Core Profile
macOS* 10.12 and 10.13	macOS* 10.12 and 10.13	OpenGL* 3.2, 3.3, 4.0, 4.1 Core Profile and Metal* 1 and 2

Intel® GPA does not support the following Windows* configurations: All server editions, Windows* 8 RT, or Windows* 7 starter kit.

Supported Windows* Graphics Devices

Intel® GPA supports the following graphics devices as targets for analyzing Windows* workloads. All these targets have enhanced metric support:

Target	Processor
Intel® UHD Graphics 630	8th generation Intel® Core™ processor
Intel® UHD Graphics 630	7th generation Intel® Core™ processor
Intel® UHD Graphics 620	7th generation Intel® Core™ processor
Intel® HD Graphics 620	7th generation Intel® Core™ processor
Intel® HD Graphics 615	7th generation Intel® Core™ m processor
Intel® HD Graphics 530	6th generation Intel® Core™ processor
Intel® HD Graphics 515	6th generation Intel® Core™ m processor
Iris® graphics 6100	5th generation Intel® Core™ processor
Intel® HD Graphics 5500 and 6000	5th generation Intel® Core™ processor
Intel® HD Graphics 5300	5th generation Intel® Core™ m processor family
Iris® Pro graphics 5200	4th generation Intel® Core™ processor
Iris® graphics 5100	4th generation Intel® Core™ processor
Intel® HD Graphics 4200, 4400, 4600, and 5000	4th generation Intel® Core™ processor
Intel® HD Graphics 2500 and 4000	3rd generation Intel® Core™ processor
Intel® HD Graphics	Intel® Celeron® processor N3000, N3050, and N3150 Intel® Pentium® processor N3700

Although the tools may appear to work with other graphics devices, these devices are unsupported. Some features and metrics may not be available on unsupported platforms. If you run into in an issue when using the tools with any supported configuration, please report this issue through the Support Forum.

Driver Requirements for Intel® HD Graphics

When running Intel® GPA on platforms with supported Intel® HD Graphics, the tools require the latest graphics drivers for proper operation. You may download and install the latest graphics drivers from http://downloadcenter.intel.com/.

Intel® GPA inspects your current driver version and notifies you if your driver is out-of-date.

Supported Devices Based on Intel® Atom™ Processor

Intel® GPA supports the following devices based on Intel® Atom™ processor:

Processor Model	GPU	*Android Version**	Supported Tools
Intel® Atom™ Z35XX	Imagination Technologies* PowerVR G6430	Android* 4.4 (KitKat), Android* 5.x (Lollipop)	System Analyzer Graphics Frame Analyzer Trace Analyzer [Beta]
Intel® Atom™ Z36XXX/Z37XXX	Intel® HD Graphics	Android* 4.2.2 (Jelly Bean MR1) Android* 4.4 (KitKat) Android* 5.x (Lollipop)	System Analyzer Graphics Frame Analyzer Trace Analyzer [Beta]
Intel® Atom™ Z25XX	Imagination Technologies* PowerVR SGX544MP2	Android* 4.2.2 (Jelly Bean MR1) Android* 4.4 (KitKat)	System Analyzer Graphics Frame Analyzer Trace Analyzer [Beta]
Intel® Atom™ x7-Z8700, x5-Z8500, and x5-Z8300	Intel® HD Graphics	Android* 5.x (Lollipop), Android* 6.0 (Marshmallow)	System Analyzer Graphics Frame Analyzer Trace Analyzer [Beta]

Supported ARM*-Based Devices

The following devices are supported with Intel® GPA:

Model	GPU	*Android Version**
Samsung* Galaxy S5	Qualcomm* Adreno 330	Android* 5.0
Samsung* Galaxy Nexus (GT-i9500)	Imagination Technologies* PowerVR SGX544	Android* 4.4
Samsung* Galaxy S4 Mini (GT-I9190)	Qualcomm* Adreno 305	Android* 4.4
Samsung* Galaxy S III (GT-i9300)	ARM* Mali 400MP	Android* 4.3
Google* Nexus 5	Qualcomm* Adreno 330	Android* 5.1
Nvidia* Shield tablet	NVIDIA* Tegra* K1 processor	Android* 5.1

Your system configuration should satisfy the following requirements:

Your ARM*-based device is running Android* 4.1, 4.2, 4.3, 4.4, 5.0, 5.1, or 6.0
Your Android* application uses OpenGL* ES 1.0, 1.1, 2.0, 3.0, 3.1, or 3.2
Regardless of your ARM* system type, your application must be 32-bit

For support level details for ARM*-based devices, see this article.

Installation Notes

Installing Intel® GPA

Download the Intel® GPA installer from the Intel® GPA Home Page.

Installing Intel® GPA on Windows* Target and Host Systems

To install the tools on Windows*, download the *.msi package from the Intel® GPA Home Page and run the installer file.

The following prerequisites should be installed before you run the installer:

Microsoft DirectX* Runtime June 2010
Microsoft .NET 4.0 (via redirection to an external web site for download and installation)

If you use the product in a host/target configuration, install Intel® GPA on both systems. For more information on the host/target configuration, refer to Best Practices.

For details on how to set up an Android* device for analysis with Intel® GPA, see Configuring Target and Analysis Systems.

Installing Intel® GPA on Ubuntu* Host System

To install Intel® GPA on Ubuntu*, download the .tar package, extract the files, and run the .deb installer.

It is not necessary to explicitly install Intel® GPA on the Android* target device since the tools automatically install the necessary files on the target device when you run System Analyzer. For details on how to set up an Android* device for analysis with Intel® GPA, see Configuring Target and Analysis Systems.

Installing Intel® GPA on macOS* Host System

To install the tools on macOS*, download the .zip package, unzip the files, and run the .pkg installer.

It is not necessary to explicitly install Intel® GPA on the Android* target device because the tools automatically install the necessary files on the target device when you run the System Analyzer. For details on how to set up an Android* device for analysis with Intel® GPA, see Configuring Target and Analysis Systems.

Technical Support and Troubleshooting

For technical support, including answers to questions not addressed in the installed product, visit the Support Forum.

Troubleshooting Android* Connection Problems

If the target device does not appear when the adb devices command is executed on the client system, do the following:

Disconnect the device
Execute $ adb kill-server
Reconnect the device
Run $ adb devices

If these steps do not work, try restarting the system and running $adb devices again. Consult product documentation for your device to see if a custom USB driver needs to be installed.

Known Issues and Limitations

General

Your system must be connected to the internet while you are installing Intel® GPA.
Selecting all ergs might cause a significant memory usage in Graphics Frame Analyzer.
Intel® GPA uses sophisticated techniques for analyzing graphics performance which may conflict with third-party performance analyzers. Therefore, ensure that other performance analyzers are disabled prior to running any of these tools. For third-party graphics, consult the vendor's website.
Intel® GPA does not support use of Remote Desktop Connection.
Graphics Frame Analyzer (DirectX* 9,10,11) runs best on systems with a minimum of 4GB of physical memory. Additionally, consider running the Graphics Frame Analyzer (DirectX* 9,10,11) in a networked configuration (the server is your target graphics device, and the client running the Graphics Frame Analyzer is a 64-bit OS with at least 8GB of memory).
On 64-bit operating systems with less than 8GB of memory, warning messages, parse errors, very long load times, or other issues may occur when loading a large or complex frame capture file.

Analyzing Android* Workloads

Graphics Frame Analyzer does not currently support viewing every available OpenGL/OpenGL ES* texture format.
Intel® GPA provides limited support for analyzing browser workloads on Android*. You can view metrics in the System Analyzer, but the tools do not support creating or viewing frame capture files or trace capture files for browser workloads. Attempting to create or view these files may result in incorrect results or program crashes.
Intel® GPA may fail to analyze OpenGL* multi-context games.

Analyzing Windows* Workloads

The Texture 2x2 experiment might work incorrectly for some DirectX* 12 workloads.
Intel® GPA may show offsets used in DirectX* 12 API call parameters in scientific format.
Render Target visualization experiments “Highlight” and “Hide” are applied to all Draw calls in a frame. As a result, some objects may disappear and/or be highlighted incorrectly.
Frame Analyzer may crash if the ScissorRect experiment is deselected. The application will go back to Frame File open view.
Downgrade from 17.2 to 17.1 might not be successful.
The Overdraw experiment for Render Targets with 16-bit and 32-bit Alpha channel is not supported now.
To view Render Targets with 16-bit and 32-bit Alpha channel, you should disable Alpha channel in the Render Targets viewer.
To ensure accurate measurements on platforms based on Intel® HD Graphics, profile your application in the full-screen mode. If windowed mode is required, make sure only your application is running. Intel® GPA does not support profiling multiple applications simultaneously.
For best results when analyzing frame or trace capture files on the same system where you run your game, follow these steps:
- Run your game and capture a frame or trace file.
- Shut down your game and other non-essential applications.
- Launch the Intel® GPA.
To run Intel® GPA on hybrid graphics solutions (a combination of Intel® Processor Graphics and third-party discrete graphics), you must first disable one of the graphics solutions.
Secure Boot, also known as Trusted Boot, is a security feature in Windows* 8 enabled in BIOS settings which can cause unpredictable behavior when the "Auto-detect launched applications" option is enabled in Graphics Monitor Preferences. Disable Secure Boot in the BIOS to use the auto-detection feature for analyzing application performance with Intel® GPA. The current version of the tools can now detect Secure Boot, and warns you of this situation.
To view the full metric set with the tools for Intel® Processor Graphics on systems with one or more third-party graphics device(s) and platforms based on Intel® HD Graphics, ensure that Intel is the preferred graphics processor. You can set this in the Control Panel application for the third-party hardware. Applications running under Graphics Monitor and a third-party device show GPU metrics on DirectX* 9 as initialized to 0 and on DirectX* 10/11 as unavailable.
When using the Intel® GPA, disable the screen saver and power management features on the target system running the Graphics Monitor — the Screen Saver interferes with the quality of the metrics data being collected. In addition, if the target system is locked (which may happen when a Screen Saver starts), the connection from the host system to the target system will be terminated.
Intel® GPA does not support frame capture or analysis for:
- applications that execute on the Debug D3D runtime system
- applications that use the Reference D3D Device
System Analyzer HUD may not operate properly when applications use copy-protection, anti-debugging mechanisms, or non-standard encrypted launching schemes.
Intel® GPA provides analysis functionality by inserting itself between your application and Microsoft DirectX*. Therefore, the tools may not work correctly with certain applications which themselves hook or intercept DirectX* APIs or interfaces.
Intel® GPA does not support Universal Windows Platform applications where the graphics API uses compositing techniques such as HTML5 or XAML interop. Only traditional DirectX* rendering is supported. To workaround this limitation, port your application as a Desktop application, and then use the full Intel® GPA suite of tools.
In some cases, the Overview tab in Graphics Frame Analyzer (DirectX* 9,10,11) can present GPU Duration values higher than Frame Duration values measured during game run time. This could be a result of Graphics Frame Analyzer (DirectX* 9,10,11) playing the captured frame back in off-screen mode which can be slower than on-screen rendering done in the game.
To make playback run on-screen use this registry setting on the target system: HKEY_CURRENT_USER\Software\Intel\GPA\16.4\ForceOnScreenPlaybackForRemoteFA = 1 and connect to the target with Graphics Frame Analyzer (DirectX* 9,10,11) running on a separate host. If these requirements are met, the playback runs in off-screen mode on the target. If the frame was captured from the full-screen game, but playback renders it in a windowed mode, then try pressing Alt+Enter on the target to switch playback to full-screen mode.
Frame capture using Graphics Monitor runs best on 64-bit operating systems with a minimum of 4GB of physical memory.
On 32-bit operating systems (or 64-bit operating systems with <4GB of memory), out of memory or capture failed messages can occur.
Scenes that re-create resource views during multi-threaded rendering have limited support in the current Intel® GPA version, and might have issues with frame replays in Graphics Frame Analyzer.

*Other names and brands may be claimed as the property of others.

** Disclaimer: Intel disclaims all liability regarding rooting of devices. Users should consult the applicable laws and regulations and proceed with caution. Rooting may or may not void any warranty applicable to your devices.

Unreal Engine Logo

The release of Epic’s Unreal* Engine 4.19 marks a new chapter in optimizing for Intel technology, particularly in the case of optimizing for multicore CPUs. In the past, game engines traditionally followed console design points, in terms of graphics features and performance. In general, most games weren’t optimized for the CPU, which can leave a lot of PC performance sitting idle. Intel’s work with Unreal Engine 4 seeks to unlock the potential of games as soon as developers work in the engine, to fully take advantage of all the extra CPU computing power that a PC platform provides.

Intel's enabling work for Unreal Engine version 4.19 delivered the following:

Increased the number of worker threads to match a user’s CPU
Increased the throughput of the cloth physics system
Integrated support for Intel® VTune™ Amplifier

Each of these advances enable Unreal Engine users to take full advantage of Intel® Architecture and harness the power of multicore systems. Systems such as cloth physics, dynamic fracturing, CPU particles, and enhanced operability with Intel tools such as Intel VTune Amplifier and the C++ compiler will all benefit. This white paper will discuss in detail the key improvements and provide developers with more reasons to consider the Unreal Engine for their next PC title.

Unreal* Engine History

Back in 1991, Tim Sweeney founded Epic MegaGames (later dropping the “Mega”) while still a student at the University of Maryland. His first release was ZZT*, a shareware puzzle game. He wrote the game in Turbo Pascal using an object-oriented model, and one of the happy results was that users could actually modify the game’s code. Level editors were already common, but this was a big advance.

In the years that followed, Epic released popular games such as Epic Pinball*, Jill of the Jungle*, and Jazz Jackrabbit*. In 1995, Sweeney began work on a first-person shooter to capitalize on the success of games such as DOOM*, Wolfenstein*, Quake*, and Duke Nukem*. In 1998, Epic released Unreal*, probably the best-looking shooter of its time, offering more detailed graphics and capturing industry attention. Soon, other developers were calling and asking about licensing the Unreal Engine (UE) for their own games.

In an article for IGN in 2010, Sweeney recalled that the team was thrilled by the inquiries, and said their early collaboration with those partners defined the style of their engine business from day one. They continue to use, he explained, “a community-driven approach, and open and direct communication between licensees and our engine team.” By focusing on creating cohesive tools and smoothing out technical hurdles, their goal was always to unleash the creativity of the gaming community. They also provided extensive documentation and support, something early engines often lacked.

Today, the UE powers most of the top revenue-producing titles in the games industry. In an interview with VentureBeat in March 2017, Sweeney said developers have made more than USD 10 billion to date with Unreal games. “We think that Unreal Engine’s market share is double the nearest competitor in revenues,” Sweeney said. “This is despite the fact that Unity* has more users. This is by virtue of the fact that Unreal is focused on the high end. More games in the top 100 on Steam* in revenue are Unreal, more than any other licensable engine competitor combined.”

Intel Collaboration Makes Unreal Engine Better

Game developers who currently license the UE can easily take advantage of the optimizations described here. The work will help them grow market share for their games by broadening the range of available platforms, from laptops and tablets with integrated graphics to high-end desktops with discrete graphics cards. The optimizations will benefit end users on most PC-based systems by ensuring that platforms can deliver high-end effects such as dynamic cloth and interactive physics. In addition, optimized Intel tools will continue to make Intel Architecture a preferred platform of choice.

According to Jeff Rous, Intel Developer Relations Engineer, the teams at Intel and Epic Games have collaborated since the late 1990s. Rous has personally worked on UE optimization for about six years, involving extensive collaboration and vibrant communication with Epic engineers over email and conference calls, as well as visits to Epic headquarters in North Carolina two or three times a year for week-long deep dives. He has worked on specific titles, such as Epic’s own Fortnite* Battle Royale, as well as UE code optimization.

Prior to the current effort, Intel worked closely with Unreal on previous UE4 releases. There is a series of optimization tutorials at the Intel® Developer Zone, starting with the Unreal* Engine 4 Optimization Tutorial, Part 1. The tutorials cover the tools developers can use inside and outside of the engine, as well as some best practices for the editor, and scripting to help increase the frame rate and stability of a project.

Intel® C++ Compiler Enhancements

For UE 4.12, Intel added support for the Intel C++ Compiler into the public engine release. Intel C++ Compilers are standards-based C and C++ tools that speed application performance. They offer seamless compatibility with other popular compilers, dev environments, and operating systems, and boost application performance through superior optimizations and single instruction multiple data (SIMD) vectorization, integration with Intel® Performance Libraries, and by leveraging the latest OpenMP* 5.0 parallel programming models.

Scalar and vectorized loop versions

Figure 1: Scalar and vectorized loop versions with Intel® Streaming SIMD Extensions, Intel® Advanced Vector Extensions, and Intel® Advanced Vector Extensions 512.

Since UE 4.12, Intel has continued to keep the code base up to date, and tests on the infiltrator workload show significant improvements for frame rates.

Texture compression improvement

UE4 also launched with support for Intel’s fast texture compressor. ISPC stands for Intel® SPMD (single program, multiple data) program compiler, and allows developers to easily target multicore and new and future instruction sets through the use of a code library. Previous to integrating the ISPC texture compression library, ASTC (Adaptive Scalable Texture Compression), the newest and most advanced texture compression format, would often take minutes to compress per texture. On the Sun Temple* demo (part of the UE4 sample scenes pack), the time it took to compress all textures went from 68 minutes to 35 seconds, with better quality over the reference encoder that was used previously. This allows content developers to build their projects faster, saving hours per week of a typical developer’s time.

Optimizations for UE 4.19

Intel’s work specifically with UE 4.19 offers multiple benefits for developers. At the engine level, optimizations improve scaling mechanisms and tasking. Other work at the engine level ensures that the rendering process isn’t a bottleneck due to CPU utilization.

In addition, the many middleware systems employed by game developers will also benefit from optimizations. Physics, artificial intelligence, lighting, occlusion culling, virtual reality (VR) algorithms, vegetation, audio, and asynchronous computing all stand to benefit.

To help understand the benefits of the changes to the tasking system in 4.19, an overview of the UE threading model is useful.

UE4 threading model

Figure 2 represents time, going from left to right. The game thread runs ahead of everything else, while the render thread is one frame behind the game thread. Whatever is displayed thus runs two frames behind.

Game, render, audio threading model of Unreal Engine 4

Figure 2: Understanding the threading model of Unreal Engine 4.

Physics work is generated on the game thread and executed in parallel. Animation is also evaluated in parallel. Evaluating the animation in parallel was used to good effect in the recent VR title, Robo Recall*.

The game thread, shown in Figure 3, handles updates for gameplay, animation, physics, networking, and most importantly, actor ticking.

Developers can control the order in which objects tick, by using Tick Groups. Tick Groups don’t provide parallelism, but they do allow developers to control dependent behavior to better schedule parallel work. This is vital to ensure that any parallel work does not cause a game thread bottleneck later.

Game thread and related jobs illustration

Figure 3: Game thread and related jobs.

As shown below in Figure 4, the render thread handles generating render commands to send to the GPU. Basically, the scene is traversed, and then command buffers are generated to send to the GPU. The command buffer generation can be done in parallel to decrease the time it takes to generate commands for the whole scene and kick off work sooner to the GPU.

breaking draw calls into chunks

Figure 4: The render thread model relies on breaking draw calls into chunks.

Each frame is broken down into phases that are done one after another. Within each phase, the render thread can go wide to generate the command lists for that phase:

Depth prepass
Base pass
Translucency
Velocity

Breaking the frame into chunks enables farming them into worker tasks with a parallel command list that can be filled up with the results of those tasks. Those get serialized back and used to generate draw calls. The engine doesn’t join worker threads at the call site, but instead joins at sync points (end of phases), or at the point where they are used if fast enough.

Audio thread

The main audio thread is analogous to the render thread, and acts as the interface for the lower-level mixing functions by performing the following tasks:

Evaluating sound queue graphs
Building wave instances
Handling attenuation, and so on

The audio thread is the thread that all user-exposed APIs (such as Blueprints and Gameplay) interact with. The decoding and source-worker tasks decode the audio information, and also perform processing such as spatialization and head-related transfer function (HRTF) unpacking. (HRTF is vital for players in VR, as the algorithms allow users to detect differences in sound location and distance.)

The audio hardware thread is a single platform-dependent thread (for example, XAudio2* on Microsoft Windows*), which renders directly to the output hardware and consumes the mix. This isn’t created or managed by UE, but the optimization work will still impact thread usage.

There are two types of tasks—decoding and source worker.

Decoding: decodes a block of compressed source files. Uses double buffering to decode compressed audio as it's being played back.
Source Worker: performs the actual source processing for sources, including sample rate conversion, spatialization (HRTF), and effects. The Source Worker is a configurable number in an INI file.
- If you have four workers and 32 sources, each will mix eight sources.
- The Source Worker is highly parallelizable, so you can increase the number if you have more CPU power.

Robo Recall was also the first title to ship with the new audio mixing and threading system in the Unreal Engine. In Robo Recall, for example, the head-related transfer function took up nearly half of the audio time.

CPU worker thread scaling

Prior to UE 4.19, the number of available worker threads on the task graph was limited and did not take Intel® Hyper-Threading Technology into account. This caused a situation on systems with more than six cores where entire cores would sit idle. Correctly creating the right number of worker threads available on the task graph (UE’s internal work scheduler) allows for content creators to scale visual-enhancing systems such as animation, cloth, destruction, and particles beyond what was possible before.

In UE 4.19, the number of worker threads on the task graph is calculated based on the user’s CPU, up to a current max of 22 per priority level:

if (NumberOfCoresIncludingHyperthreads > NumberOfCores)
    {
      NumberOfThreads = NumberOfCoresIncludingHyperthreads - 2;
    }
    else
    {
      NumberOfThreads = NumberOfCores - 1;
    }

The first step in parallel work is to open the door to the possibility that a game can use all of the available cores. This is a fundamental issue to make scaling successful. With the changes in 4.19, content can now do so and take full advantage of enthusiast CPUs through systems such as cloth physics, environment destruction, CPU-based particles, and advanced 3D audio.

Hardware thread utilization

Figure 5: Unreal Engine 4.19 now has the opportunity to utilize all available hardware threads.

In the benchmarking example above, the system is at full utilization on an Intel® Core™ i7-6950X processor at 3.00 GHz system, tested using a synthetic workload.

Destruction benefits

One benefit from better thread utilization in multicore systems is in destruction. Destruction systems use the task graph to simulate dynamic fracturing of meshes into smaller pieces. A typical destruction workload consists of a few seconds of extensive simulation, followed by a return to the baseline. Better CPUs with more cores can keep the pieces around longer, with more fracturing, which greatly enhances realism.

Rous believes there is more that developers can do with destruction and calls it a good target for improved realism with the proper content. “It’s also easy to scale-up destruction, by fracturing meshes more and removing fractured chunks after a longer length of time on a more powerful CPU,” he said. “Since destruction is done through the physics engine on worker threads, the CPU won’t become the rendering bottleneck until quite a few systems are going at once.”

$Simulation of dynamic fracturing of meshes$

Figure 6: Destruction systems simulate dynamic fracturing of meshes into small pieces.

Cloth System Optimization

Cloth systems are used to add realism to characters and game environments via a dynamic 3D mesh simulation system that responds to the player, wind, or other environmental factors. Typical cloth applications within a game include player capes or flags.

The more realistic the cloth system, the more immersive the gaming experience. Generally speaking, the more cloth systems enabled, the more realistic the scene.

Developers have long struggled with the problem of making cloth systems appear realistic. Otherwise, characters are restricted to tight clothing, and any effects of wind blowing through clothing is lost. Modeling a cloth system has been a challenge, however.

Early attempts at cloth systems

According to Donald House at Texas A&M University, the first important computer graphics model for cloth simulation was presented by Jerry Weil in 1986. House and others presented an entire course on “Cloth and Clothing in Computer Graphics,” and described Weil’s work in detail. Weil developed “a purely geometric method for mimicking the drape of fabric suspended at constraint points,” House wrote. There were two phases in Weil’s simulation process. First, geometrically approximate the cloth surface with catenary curves, producing triangles of constraint points. Then, by applying an iterative relaxation process, the surface is smoothed by interpolating the original catenary intersection points. This static draping model could also represent dynamic behavior by applying the full approximation and relaxation process once, and then successively moving the constraint points slightly and reapplying the relaxation phase.

Around the same time, continuum models emerged that used physically based approaches to cloth behavior modeling. These early models employed continuum representations, modeling cloth as an elastic sheet. The first work in this area is a 1987 master’s thesis by Carl R. Feynman, who superimposed a continuum elastic model on a grid representation. Due to issues with simulation mesh sizes, cloth modeling using continuum techniques has difficulty capturing the complex folding and buckling behavior of real cloth.

Particle models gain traction

Particle models gained relevance in 1992, when David Breen and Donald House developed a non-continuum interacting particle model for cloth drape, which “explicitly represents the micro-mechanical structure of cloth via an interacting particle system,” as House described it. He explained that their model is based on the observation that cloth is “best described as a mechanism of interacting mechanical parts rather than a substance, and derives its macro-scale dynamic properties from the micro-mechanical interaction between threads.” In 1994 it was shown how this model could be used to accurately reproduce the drape of specific materials, and the Breen/House model has been expanded from there. One of the most successful of these models was by Eberhard, Weber, and Strasser in 1996. They used a Lagrangian mechanics reformulation of the basic energy equations suggested in the Breen/House model, resulting in a system of ordinary differential equations from which dynamics could be calculated.

The dynamic mesh simulation system is the current popular model. It responds to the player, wind, or other environmental factors, and results in more realistic features such as player capes or flags.

The UE has undergone multiple upgrades to enhance cloth systems; for example, in version 4.16, APEX Cloth* was replaced with NVIDIA’s NvCloth* solver. This low-level clothing solver is responsible for the particle simulation that runs clothing and allows integrations to be lightweight and very extensible, because developers now have direct access to the data.

More triangles, better realism

In UE 4.19, Intel engineers worked with the UE team to optimize the cloth system further to improve throughput. Cloth simulations are treated like other physics objects and run on the task graph’s worker threads. This allows developers to scale content on multicore CPUs and avoid bottlenecks. With the changes, the amount of cloth simulations usable in a scene has increased by approximately 30 percent.

Cloth is simulated in every frame, even if the player is not looking at that particular point; simulation results will determine if the cloth system shows up in a player’s view. Cloth simulation uses the CPU about the same amount from frame to frame, assuming more systems aren’t added. It’s easily predictable and developers can tune the amount they’re using to fit the available headroom.

Examples of cloth systems

Figure 7: Examples of cloth systems in the Content Examples project.

For the purposes of the graphs in this document, the cloth actors used have 8,192 simulated triangles per mesh, and were all within the viewport when the data was captured. All data was captured on an Intel® Core™ i7-7820HK processor.

CPU Usage

Figure 8: Different CPU usages between versions of Unreal Engine 4, based on number of cloth systems in the scene.

frames per second

Figure 9: Difference in frames per second between versions of Unreal Engine 4 based on number of cloth systems in the scene.

Enhanced CPU Particles

Particle systems have been used in computer graphics and video games since the very early days. They’re useful because motion is a central facet of real life, so modeling particles to create explosions, fireballs, cloud systems, and other events is crucial to develop full immersion.

High-quality features available to CPU particles include the following:

Light emission
Material parameter control
Attractor modules

Particles on multicore systems can be enhanced by using CPU systems in tandem with GPU ones. Such a system easily scales—developers can keep adding to the CPU workload until they run out of headroom. Engineers have found that pairing CPU particles with GPU particles can improve realism by adding light casting, allowing light to bounce off objects they run into. Each system has inherent limitations, so pairing them results in a system greater than the sum of their parts.

CPU particles emitting light

Figure 10: CPU Particles can easily scale based on available headroom.

Intel® VTune™ Amplifier Support

The Intel VTune Amplifier is an industry-standard tool to determine thread bottlenecks, sync points, and CPU hotspots. In UE 4.19, support for Intel VTune Amplifier ITT markers was added to the engine. This allows users to generate annotated CPU traces that give deep insight into what the engine is doing at all times.

ITT APIs have the following features:

Control application performance overhead based on the amount of traces that you collect.
Enable trace collection without recompiling your application.
Support applications in C/C++ and Fortran environments.
Support instrumentation for tracing application code.

Users can take advantage of this new functionality by launching Intel VTune Amplifier and running a UE workload through the UI with the -VTune switch. Once inside the workload, simply type Stat Namedevents on the console to begin outputting the ITT markers to the trace.

Intel VTune Amplifier trace in Unreal Engine 4.19

Figure 11:Example of annotated Intel VTune Amplifier trace in Unreal Engine 4.19.

Conclusion

Improvements involved solving technical challenges at every layer—the engine, middleware, the game editor, and in the game itself. Rather than working on a title by title basis, engine improvements benefit the whole Unreal developer ecosystem. The advances in 4.19 improve CPU workload challenges throughout the ecosystem in the following areas:

More realistic destruction, thanks to more breakpoints per object.
More particles, leading to better animated objects such as vegetation, cloth, and dust particles.
More realistic background characters.
More cloth systems.
Improved particles (for example, physically interacting with character, NPCs, and environment).

As more end users migrate to powerful multicore systems, Intel plans to pursue a roadmap that will continue to take advantage of higher core counts. Any thread-bound systems or bottlenecked operations are squarely in the team’s crosshairs. Developers should be sure to download the latest version of the UE, engage at the Intel Developer Zone, and see for themselves.

Further Resources

Unreal* Engine 4 Optimization Guide

CPU Optimizations for Cloth Simulations

Setting up Destructive Meshes

CPU Scaling Sample

House render

This article discusses how to use Google Blocks* to quickly model things in VR, improving the workflow for your virtual reality (VR) projects. Before we delve into using the software, let’s look at a typical workflow.

Workflows

Typical workflow for VR development

A typical VR project is an iterative combination of code and assets. Usually there’s a period of preproduction to concept interactivity and assets (via sketches). However, once production begins, you often face many starts and stops, because the code-development side must wait for finished assets and vice versa.

This situation is made worse when developing for VR, because VR’s unique visual perspective creates issues of scale, placement, and detail that exist in 3D projects presented on a 2D monitor. So you must deal with more back and forth on asset development to get ideas right. You’re also often left with low-quality visual presentation in the prototyping stages that can hinder attracting funding or interest.

Benefits of use VR to prototype

Creating prototype model assets in VR itself creates a much smoother flow for both code and asset developers. This approach allows code developers to more quickly design and adjust the rough versions of what they need, have playable objects, and provide reference material to asset developers, compared to what sketches alone can provide. Modeling normally requires specialized skills in using many different tools and camera perspectives in order to work on a 3D object using 2D interfaces, lending precision and detail but at the cost of heavy workloads.

On the other hand, modeling in VR lets you work in the natural 3D environment by actually moving and shaping in the real-world, room-scale environment. In this way, 3D modeling is more like working with clay than it is adjusting vertices in a modeling app. This approach is not only much faster when creating low-detail prototype assets, but also much more accessible to people with no modeling skill whatsoever.

Benefits of Google Blocks low poly models

Google Blocks provides a simple “build by creating and modifying primitives” approach for VR and allows quick volumetric “sketching.” This combination of modified shapes and simple colored surfaces also lends itself to an aesthetic style that is clean-looking and low poly. This means much better performance, which is going to be extremely useful during the prototyping stage where performance will not yet be optimized. The clean look can even be used as-is for a fairly good-looking presentation.

Revised workflow for VR development

The new workflow for production shifts from start and stop “waterfall” development to one where the code team can provide prototypes of anything they need during the prototyping stage without waiting on the asset team. This approach allows the asset team to take the prototype models and preproduction sketches and develop finished assets that can be simply swapped into the working prototype that the code team has already put in place.

It’s easy to think you can just use primitives within a development tool like Unity* software to do all the prototype “blocking” you need, but the lack of actual rough models can lead to difficulty in developing proper interactions and testing. You will often find your progress hindered when trying to build things out using cubes, spheres, and cylinders. With the new workflow, you can quickly obtain shapes closer to finished assets and provide additional direction for the development of finalized assets as a result of what you learn during interaction development.

Tools Overview

Let’s lay out the tools we’ll use for development. All of them, except HTC vive*, are free. As mentioned, we’ll use the Google Blocks app to build models in VR using HTC vive. Next we’ll export and share the models using the Google Poly* cloud service. Then we’ll use the models in Unity software for actual development. Finally we’ll work with the models in Blender* for finalized asset development. You can also swap Blender with Maya*, or with whatever your preferred 3D modeling app is, and replace Unity with the Unreal Engine* (also free-ish).

Using Google Blocks

Because we will be using HTC vive for our VR head-mounted display (HMD), first download Steam*, and then install Blocks from here:
http://store.steampowered.com/app/533970/Blocks_by_Google/

When you start Blocks for the first time, you’ll get a tutorial for making an ice cream cone with all the fixings:
https://www.youtube.com/watch?v=kTCcM5sRz74&feature=youtu.be

This tutorial is a good introduction to using the basics in Blocks.

Moving your object
Creating and scaling shapes
Painting objects
Selecting and copying
Erasing

You have many more tools and options at your disposal through the simple interface provided. Here’s an overview of all of them:
https://youtu.be/41IljbEcGzQ

Tools list

Tools list

Shapes: Cone, Sphere, Cube, Cylinder, Torus
Strokes: Shapes that are three-sided or more
Paintbrush: Color options on back side of the Palette, Paint Objects, or Faces
Hand: Select, Grab, Scale, Copy, Flip, Group/Ungroup
Modify: Reshape, Subdivide, Extrude
Eraser: Object, Face

Palette controls

Tutorial, Grid, Save/Publish, New scene, Colors

Extra controls

Single Grip button to move scene, both buttons to zoom/rotate
Left/Right on Left Trackpad to Undo/Redo
Left Trigger to create symmetrical
Right Trigger to place things

Files (left controller menu button)

Yours, Featured, Liked, Environments

You also need to initiate an option using your mouse, in order to import a reference image (or images) that you can place in the scene. This is a great way to go from preproduction sketches to prototyping without having to rely on memory. To import an image, click the Add reference image button in the top center of the screen on your desktop.

To import an image

Google Poly

Before we go any further into modeling in Blocks, let’s take a look at Poly, Google’s online “warehouse” of published models: https://youtu.be/qnY4foiOEgc

Poly incorporates published works from both Tilt Brush and Blocks, but you can browse specifically for either by selecting Blocks from the sidebar. Take a moment to browse and search to see the kinds of things you can make using shapes and colors without any textures needed. As you browse, be sure to click Like to make it easy to find them inside Blocks.

Like inside Block

Also be sure to note which models are under a sharing or editing license. You can only modify and publish changes to a model if it is marked remixable. Currently any models marked remixable require you to also credit the author, so be sure to do so if you use any remixable models as a base for something in an actual project.

Remix content

Now let’s take a look at the specific Blocks model you’ll be using as a base to build content for: https://poly.google.com/view/bvLXsDt9mww

Render

After you’ve Liked it, to load it easily from inside Blocks, click the menu button on the left controller, and then click the heart option to see your Liked models. Then select the house and press Insert.

Blocks

Starting out, the house will be scaled like a diorama. To make it bigger, grab it with the grab tool (the hand) and then press and hold Up on the right Trackpad (the +) to scale up. Once the size is what you want, you can use the controller grips to move and rotate it.

Rendering performance may suffer depending on your computer due to the complexity of the model, but this shows you an example of how such a complex scene is composed and lets you even modify it for your own purposes. Once we start using Unity software we will be using a prefab version I’ve modified to reduce complexity and add collision boxes, saving time. Now let’s set up the Unity software with VR implemented so we have a place to put what we make.

Unity* Software Project Setup

Importing plug-ins

First we’ll create a new project (save it wherever you like). Next we’ll need to import the SteamVR* plug-in to support vive: https://assetstore.unity.com/search?q=steamvr

SteamVR

You may have to tweak your project settings to meet its recommendations. Click Accept all to do this the first time. It may ask you again after this, unless you’ve modified your project settings. Just close the dialog in the future.

Next we’ll grab a great toolkit called VRTK, which makes movement and object interactions easier: https://assetstore.unity.com/search/?q=vrtk

We won’t cover the details of how to use the VRTK plug-in for this tutorial, but we will be using a sample scene to handle some things like touchpad walking. The VRTK plug-in is fantastic for making quick interactions available to the things you make in Blocks. To find out more about using VRTK, watch this video:

VRTK virtual reality toolkit

We’ll use the sample scene “017_CameraRig_TouchpadWalking” as a quick starting scene, so load that scene up and delete everything except the following:

Directional Light
Floor
[VRTK_SDKManager]
[VRTK_Scripts]

Sample scene

Next, scale the X/Z of the Floor larger to give you more space to walk around and place things.

scale the X/Z

Importing the prefab

Grab the prefab of the house model we looked at earlier to import into your scene: https://www.dropbox.com/s/87l8k23pc3h4a40/house.unitypackage?dl=0

Be sure to move the house to be pretty flush with the ground:

move the house

Then you should be able to run the project, put on your HMD and walk around in the scene using the touchpad to move. You could alternatively have imported this into a different example scene that used a different form of locomotion (all the example scenes are labeled) such as teleporting if touchpad movement is uncomfortable to you.

If you want to skip these set-up steps and go directly to importing the house, you can download a copy of the fully set-up Unity software project here: https://www.dropbox.com/s/cehr2wxhi6nmh6c/blocks.zip?dl=0

Outside of colliding, right now you can’t interact with anything when you move through the scene. To see the original FBX that came from Poly, in the Test Objects folder, open model 1. Now you have a nice little VR scene to start adding things to.

Building Objects

Now that the scene is set up, we have to decide what to prototype to add to it. Depending on what we want to add, there are two approaches: build something from scratch or remix something from Poly. For demonstration purposes, we’ll do both.

Let’s say we want to turn the house into a kind of cooking and training simulator, and we need more interactive props with shapes that are a little more complex than primitives can offer. We’ll start with a simple remix of two components to create a frying pan with a lid.

Combining objects

We’ll use the following two remixable models to mix into one, so be sure to Like them:

Frying pan: https://poly.google.com/view/bYF5rVRy_kp
Saucepan with lid: https://poly.google.com/view/2iOIFA7Sgtg

Let’s combine them: https://youtu.be/SbjSs_rcFbk

You might be wondering why we couldn’t just bring both objects into Unity software and do the mixing of objects there. This highlights the importance of the two download options you can get from Poly: OBJ and FBX. When you download an OBJ, the entire model is a single mesh, but with an FBX the model is a group of all the individual shape pieces.

This makes a difference in how you can use the model within the Unity software. Having an entire mesh as a single object can be useful when putting on mesh colliders and setting up objects to be interactive via VRTK. The two models we are remixing are available only as OBJ files, so we can’t modify the individual parts within the Unity software. However our new model will be available as both (sometimes it takes time for the OBJ option to show up under download).

Now let’s download the FBX and import it into the Unity software using drag and drop, as we would with any other file, and then we’ll drop the object into the scene.

Testing the scene

Once you’ve got the object placed, click play to hop into the scene and check the scale. If the scale doesn’t look right, select the whole model group to adjust it. Now you have a simple object to use for interactive prototyping with VRTK that will be much more effective than using a couple of unmodified primitives. You also have a great starting point for refining the model or adding extra details like textures.

What’s really cool is that you could also have inserted the house object into Blocks first and modeled the pan (or any other accessory) while having the full house for reference, and then deleted the single house object before saving and publishing, without even having to go into the Unity software.

Creating a model

Now let’s look at making a quick, simple model—a cheese grater—from scratch. This model is more than a simple primitive, but not overly complex. We’ll also make a block of cheese with the kind of crumbly edge you get when you slice it. You’ll notice that I used the menu button on the right wand to group and ungroup objects for easier manipulation, and I also used the modify tool for both edges and faces. See Video:

Because you can import this model as an FBX into the Unity software, you can easily separate the cheese from the grater for different interactions but do it as a single import.

If you prefer using Unreal Engine over the Unity software, first read about FBX workflows with that engine: https://docs.unrealengine.com/latest/INT/Engine/Content/FBX/index.html

Working with a Modeler

Importing into blender

Once you have the FBX or OBJ files you or a modeler want to work with in a full modeling package, you can import them into a free program such as Blender.

Blender

From there you can edit the model, animate it, or perform other operations.

Exporting for Poly

If you want to be able to share the model using Poly again, you can export it as an OBJ (with an MTL materials file).

model using Poly

Next, click the upload OBJ button on the Poly site.

button on the Poly site

Finally, drag and drop the .obj and .mtl files onto the page, and then click Publish to publish the model.

drag and drop the .obj

The disadvantage, however, is that the model is a single mesh OBJ, and you also can’t use it in Blocks to remix or view, so it’s useful mostly as a way to quickly share a full model. But this can also be a great way for a modeler to show you the work in progress (as well as allow you to download the OBJ for testing in the Unity software). Keep in mind that files uploaded this way won’t show up in Blocks even if you Like them. So pay attention to objects you Like to see if they say “Uploaded OBJ File,” because that won’t be usable in Blocks.

Review

Let’s review what we covered in this article.

Take preproduction reference images into Blocks when you have them.
Quickly “sketch” usable models in Blocks.
- Use remixable models as starting points when it makes sense.
- You can bring in other models to use for reference or scaling purposes.
Publish your models to Poly (you can choose to make them unlisted).
Download the OBJ or FBX, depending on your needs.
Import the model into Unity software for prototyping.
Share the Poly page with a modeler so they can modify the OBJ and FBX or simply use it as a reference along with the preproduction sketch and even screenshots (or a playable version) of the Unity software scene to begin developing a finalized asset in a tool like Blender.
The modeler can also use Poly as a way to provide you with quick previews that you can download and insert in your scene.
Rinse and repeat to quickly build your commercial or gaming project!

In the future, Google Blocks may also incorporate animation (see this article: https://vrscout.com/news/animating-vr-characters-google-blocks/ ) so watch for that to make your future workflow even more awesome.

modeled character

Introduction

Character modeling is the process of creating a character within the 3D space of computer programs. The techniques for character modeling are essential for third - and first - person experiences within film, animation, games, and VR training programs. In this article, I explain how to design with intent, how to make a design model - ready, and the process of creating your model. In later lessons, we will continue to finish the model using retopologizing techniques.

characters within the 3D space

Design and Drawing

The first step to designing a character is to understand its purpose in the application or scene. For example, if this character is to be created for a first-person training program, you may only need to model floating hands. This could be how your character is designed for a training application.

Additionally, for film, games, and VR the character's design is key. The design must fit into the world and also visually describe their personality. If they have big, wide eyes, they're probably cartoony and cute. If they wear one sock higher than the other, they might be quirky or stressed. Let their design tell a story about what kind of person they are.

Below I've provided a sample design of the character I will be modeling throughout the article. With it, I've provided a breakdown that explains how his design affects your perception of his character.

Simple Design Breakdown

Round shapes indicate that the character is nice and friendly; you want the audience to like this character.
Big eyes show youth and make the character cute; also very expressive.
Details like the propeller hat and the striped shirt indicate that he's fun and silly.

Model-Ready: Static vs Animated

design of the character model

Once you have a design, it's important to distinguish whether your character is static or animated. This will determine how your go about creating your blueprints for your character model. These blueprints are called orthographic drawings. Orthographic drawings are front, side, and top drawings of your model. You may see these types of drawings for 2D animation or concept art. However, orthographic drawings for 3D character models are different. Below I will explain the different requirements for static and animated orthographic drawings.

Animated

An animated model must be set up properly for rigging. The following requirements are necessary for a character to be bound to a rig:

The drawings must be done in a T pose or A pose
They must have a slight bend at the knees and arms
Fingers and legs must be spread apart
They must have a blank expression

Skipping any of these steps will make it difficult to achieve clean results with rigging and animation. I've provided some example orthographic, T-pose drawings of the character I will be modeling.

T-pose drawings of the character

Static

A static model, like a statue or action figure for instance, will hold the same pose. Therefore, it doesn't need a rig. Rather, it just needs to be modeled in the pose and expression the design calls for. The only requirement for your orthographic views is the drawings must be representative of the pose and expression of the finished character model, for all orthographic angles.

static model

Notice, for both of the animated and static drawings the side and front views of the body line up correspondingly. This is important to ensure that the model will be proportionally correct when these blueprints guide you through the modeling process. To continue forward, save each orthographic view as its own .jpg or .png file.

orthographic view
Orthographic View

Now, you're ready to continue onto the modeling section! Since head modeling tends to be more difficult, I've chosen to focus on head modeling for the majority of the section. However, I believe once you are able to understand how to model the head, creating the rest of the character will come easily. Additionally, the same techniques will apply, and I will continue to guide you with step-by-step processes and images.

Modeling

Setup

Now that your orthographic drawings are done you can bring them into your 3D program of choice. To do so, you'll bring them in as image planes. As you can see by the images below, the drawings on the image planes line up accordingly with one another. This is essential. A little bit of difference is okay, but if they're far off, the image planes can warp the proportions of your character model. Once you have your planes in place, we are officially ready to begin modeling.

image planes

Tips Before You Start

The three keys to character modeling are symmetry, simplicity, and toggling.

Symmetry: Throughout each piece of the body, it's important for us to have symmetry to maintain proper functionality for animation.
Simplicity: Never start with a dense mesh. Starting with a low polygon count will allow you to easily shape the mesh. For instance, in the video I start with a cube, three subdivisions across the depth, width, and height.
Toggling: It's important to toggle mesh-smoothing on and off. Often, messy geometry will appear clean while the mesh is smoothed.

I do all three of these processes throughout the head modeling video. Watching it will help you understand how these techniques fit into the workflow. Now let's get started!

Modeling The Head

For modeling the head, we are going to go through four stages. These stages will apply to creating the head and the rest of the body.

Low Poly-Stage: shaping a low-poly primitive object (a cube for instance) to the piece of the body you are creating.
Pre-planning Stage: increase the polygon count and continue to shape the mesh.
Planning Stage: plan a space for the details, like the facial features on a head model, for instance.
Refinement stage: tweak and add topology as you see fit so you are able to match your design.

the four stages for modeling the head

Stages one and two

To begin, I'm going to shape a low-poly cube into my character's head. As you watch the video, you'll notice that I use the Translate tool to shape the head, as well as the Insert Edgeloop tool and the Smooth button for further detailing the head. I like using insert edge loop when I need more topology in a particular area. On the other hand, the Smooth button helps when I like to increase the topology on the entire mesh while maintaining the smooth-mesh shape.

Stage three and four

Now that there's more topology, we can begin planning for the eyes, nose, and mouth. You'll be using your orthographic drawings to guide you on the placement for each of these. Again, it's important to follow along with the video so that you can see the process. From here, the steps that follow are:

Plan/shape the vertices of your mesh for the facial features.
Extrude to build a space for the eye sockets, mouth and nose.
Continue to form shape without adding more topology.
Slowly add or extrude polygons.
Use sculpt tools or soft-selection to match the mesh and orthographic drawings as best as possible.
Repeat steps three and four a few times until your topology matches your drawings.

planning the eyes, nose, and mouth

During this stage, I like using the Edgeslide tool, so that when I translate the vertices the head shape will not be altered. Next, you can move onto the refinement stage. After you've finished refining the facial features, you can begin to model the eyes.

Modeling the Eyes

The next step is to make eyeballs that fit inside the head, and for the sockets to fit around them. The process is as follows:

Make a sphere.
Move and uniformly scale the sphere to fit roughly inside the socket.
Rotate the sphere 90 degrees so that the pole is facing outward.
Adjust the socket as necessary so that it rests on the eyeball.
Shape the iris, pupil, and cornea as demonstrated below.
Select the new group and scale the group -1 across the x axis.

Follow the steps as guided with the images below.

modeling the eyes
Figure 1. Uniformly scale, and translate a sphere to roughly fit inside the socket.
Then rotate the sphere 90 degrees so that the sphere's pole is facing outward.
Figure 2. Adjust the socket to fit around the eye.

Duplicate the eye. The "eye" mesh we do not edit will be the cornea.

creating the iris and pupil
Figure 3. Pick a sphere, select the edges as shown, and scale.
Figure 4. Translate the edges back to fit inside the cornea. Now we've created the iris.
5. Select the inner faces, and extrude inward to create the pupil.

Group the eye pieces. Rename the group and then proceed to duplicate it.

group and scale eye piece
Figure 6. Select the new group and scale the group -1 across the x axis.

Now, the only things missing from the head are the eyelids, ears, and neck. However, we won't be doing those until we finish retopologizing our model. As for the hair and eyebrows, I typically like to create low-poly simple shapes.

Patching a Mistake

If it's your first time making a model, it's possible you ran into several complications throughout this process. Below I've provided some possible problems with their corresponding solutions.

My symmetry tool isn't working properly.
- This is an indication of asymmetry. Go through the following steps to troubleshoot the problem.
1. Check for and delete extra vertices and faces.
2. Make a duplicate and hide or move the original. Next, you'll need to delete half of the duplicate's faces, ensure the vertices that cut down the middle of the mesh are in line with the axis of symmetry, and then use the mirror tool across the axis of symmetry.
3. Delete the object's history.
My mesh is asymmetrical.
- This sometimes happens when you move vertices after forgetting to turn symmetry back on.
1. Make a duplicate and hide or move the original. Next, you'll need to delete half of the duplicate's faces, ensure the vertices that cut down the middle of the mesh are in line with the axis of symmetry, and then use mirror tool across the axis of symmetry.
I can't get my character's eyes to fit inside both socket and head.
- This is likely the case for eyes that have an oval shape or are really far spread apart. For these instances you'll probably need to use a lattice deformer on your geometry. Animating a texture map is also a possible solution.
When I group and mirror the mesh, it doesn't mirror.

Arms and Edgeloop Placement

Next, I'm going to make another complex piece of geometry: the arm. Before I begin to explain my process, it's important to understand the importance of edgeloop placement. Edgeloops not only allow for you to add topology, but also allow the mesh to bend when it's rigged. At least three edgeloops are needed at joints such as the knuckles, elbows, shoulders, and knees.

Also, remember how we drew a slight bend in the character's arm? You'll need to model that bend. When the character is rigged, the joints will be placed along that bend; this helps the IK joints figure out which way to bend. However, if the joints are placed in a straight line, the joints could bend backwards, giving your character a broken arm or leg.

Modeling The Arms

My process for modeling the arms starts with the fingers and works backwards. I find that doing it this way makes the end mesh cleaner. Following this order, I've simplified process into four stages:

Finger Stage: model all fingers and thumb.
Palm Stage: model the palm.
Attaching Stage: attach the fingers and thumb to the hand.
Arm Stage: extrude and shape the arm.

Now that you have a basic understanding of our goal, here are the detailed steps with images to show the process.

Stage one

Figure 1. Make a low-poly cube to model a finger, toggle views to match your drawings.

Figure 2. Add edgeloops at the knuckles, and refine.

duplicate and adjust mesh across fingers

Figure 3. Duplicate, tweak, and translate the finger model to create the other fingers.

Figure 4. Model the thumb from a low-poly cube. Refine the thumb. Toggle views to check their placement; then combine the fingers and thumb into one mesh.

Stage two

steps to create and shape the palm
Figure 6. Create a cube with the proper amount of subdivisions to attach the fingers.
Figure 7. Delete every other edgeloop (for simplicity) and shape the palm.

Note: Doing it this way makes it easy to push the shape of the palm at a lower subdivision, and it ensures that there will be enough geometry to attach the fingers when we increase the topology.

Stage three

steps to shape the palm
Figure 8. Add back in the palm's topology.
Figure 9. Combine the palm and finger mesh.
Figure 10. Attach the fingers.

Note: I prefer using the Target Weld tool to attach the palm to the fingers.

usage of Target Weld tool
Figure 11. Clean the geometry.

Stage four

extrude the arm
Figure 12. Extrude the arm.

extrude the arm
Figure 13. Add edgeloops, and ensure that the mesh is hollow.

Once the arm is made, we can duplicate it onto the otherside like we did for the eyeballs. Here's a refresher of the steps:

Duplicate and group the arm.
Scale the group -1 across the x axis and ungroup the arm.

Modeling The Body

At this point, you've learned most of the techniques needed to finish your character model! The rest of the body follows similar steps we have taken to model the head and arms. If you follow along the video, and follow these steps you'll be in good shape.

Ask yourself, "What primitive mesh will work best for each one?"
- For example: a cylinder works great for pant legs, but a cube could work better for a shoe.
Create a vague plan.
- For example: "I'm going to use the cylinder to create one pant leg, finish the left side of the pants, then use the mirror tool to finish the model."
Move, scale and edit the low-poly, primitive mesh to match with the orthographic drawings.
Slowly add or extrude polygons.
Use sculpt tools or soft-selection to match the mesh and orthographic drawings as best as possible.
Repeat steps five and six a few times until your topology matches your drawings.
Mirror your model if needed!

Below, are some example images I've provided for each of the remaining parts of the body.

shirt
Shirt

shorts
Shorts

legs
Leg(s)

shoes
Shoe(s)

Great! Your model has been made! However, before we move on, you need to double check these things to make sure you are ready to move onto retopologizing.

Is your model symmetrical?
Do your knees and arms have a bend? (only applies if character will be rigged)
Have you modeled everything for this character?
Does your character relatively match your drawings?

If none of these questions bring up concerns, then you are ready to move onto the following article for character retopology.

Resources

Thank you for continuing with me throughout this article-video combination. Since it's best to learn from multiple sources, I intend to provide resources that have helped me and my peers on our curiosity journeys. Here are some listed below.

Helpful YouTube* Channels for Everything Related to 3D:

Maya* Learning Channel
Pixologic Learning Channel
Blender* Guru
Blender
James Taylor (MethodJTV*)

Other Character Modeling Resources:

Linda.com*
Pluralsight*
AnimSchool*

Other Rigging Resources:

Rapid Rig* and Mixamo* (auto-rigging)
Pluralsight*
AnimSchool

Nikolay Lazarev

Integrated Computer Solutions, Inc.

General Description of the Flocking Algorithm

The implemented flocking algorithm simulates the behavior of a school, or flock, of fish. The algorithm contains four basic behaviors:

Cohesion: Fish search for their neighbors in a radius defined as the Radius of Cohesion. The current positions of all neighbors are summed. The result is divided by the number of neighbors. Thus, the center of mass of the neighbors is obtained. This is the point to which the fish strive for cohesion. To determine the direction of movement of the fish, the current position of the fish is subtracted from the result obtained earlier, and then the resulting vector is normalized.
Separation: Fish search for their neighbors in a radius defined as the Separation Radius. To calculate the motion vector of an individual fish in a specific separation direction from a school, the difference in the positions of the neighbors and its own position is summed. The result is divided by the number of neighbors and then normalized and multiplied by -1 to change the initial direction of the fish to swim in the opposite direction of the neighbors.
Alignment: Fish search for their neighbors in a radius defined as the Radius of Alignment. The current speeds of all neighbors are summed, then divided by the number of neighbors. The resulting vector is normalized.
Reversal: All of the fish can only swim in a given space, the boundaries of which can be specified. The moment a fish crosses a boundary must be identified. If a fish hits a boundary, then the direction of the fish is changed to the opposite vector (thereby keeping the fish within the defined space).

These four basic principles of behavior for each fish in a school are combined to calculate the total position values, speed, and acceleration of each fish. In the proposed algorithm, the concept of weight coefficients was introduced to increase or decrease the influence of each of these three modes of behavior (cohesion, separation, and alignment). The weight coefficient was not applied to the behavior of reversal, because fish were not permitted to swim outside of the defined boundaries. For this reason, reversal had the highest priority. Also, the algorithm provided for maximum speed and acceleration.

According to the algorithm described above, the parameters of each fish were calculated (position, velocity, and acceleration). These parameters were calculated for each frame.

Source Code of the Flocking Algorithm with Comments

To calculate the state of fish in a school, double buffering is used. Fish states are stored in an array of size N x 2, where N is the number of fish, and 2 is the number of copies of states.

The algorithm is implemented using two nested loops. In the internal nested loop, the direction vectors are calculated for the three types of behavior (cohesion, separation, and alignment). In the external nested loop, the final calculation of the new state of the fish is made based on calculations in the internal nested loop. These calculations are also based on the values of the weight coefficients of each type of behavior and the maximum values of speed and acceleration.

External loop: At each iteration of a cycle, a new value for the position of each fish is calculated. As arguments to the lambda function, references are passed to:

agents	Array of fish states
currentStatesIndex	Index of array where the current states of each fish are stored
previousStatesIndex	Index of array where the previous states of each fish are stored
kCoh	Weighting factor for cohesion behavior
kSep	Weighting factor for separation behavior
kAlign	Weighting factor for alignment behavior
rCohesion	Radius in which neighbors are sought for cohesion
rSeparation	Radius in which neighbors are sought for separation
rAlignment	Radius in which the neighbors are sought for alignment
maxAccel	Maximum acceleration of fish
maxVel	Maximum speed of fish
mapSz	Boundaries of the area in which fish are allowed to move
DeltaTime	Elapsed time since the last calculation
isSingleThread	Parameter that indicates in which mode the loop will run

ParllelFor can be used in either of two modes, depending on the state of the isSingleThread Boolean variable:

     ParallelFor(cnt, [&agents, currentStatesIndex, previousStatesIndex, kCoh, kSep, kAlign, rCohesion, rSeparation, 
            rAlignment, maxAccel, maxVel, mapSz, DeltaTime, isSingleThread](int32 fishNum) {

Initializing directions with a zero vector to calculate each of the three behaviors:

     FVector cohesion(FVector::ZeroVector), separation(FVector::ZeroVector), alignment(FVector::ZeroVector);

Initializing neighbor counters for each type of behavior:

     int32 cohesionCnt = 0, separationCnt = 0, alignmentCnt = 0;

Internal nested loop. Calculates the direction vectors for the three types of behavior:

     for (int i = 0; i < cnt; i++) {

Each fish should ignore (not calculate) itself:

     if (i != fishNum) {

Calculate the distance between the position of a current fish and the position of each other fish in the array:

     float distance = FVector::Distance(agents[i][previousStatesIndex].position, agents[fishNum][previousStatesIndex].position);

If the distance is less than the cohesion radius:

     if (distance < rCohesion) {

Then the neighbor position is added to the cohesion vector:

     cohesion += agents[i][previousStatesIndex].position;

The value of the neighbor counter is increased:

     cohesionCnt++;
     }

If the distance is less than the separation radius:

     if (distance < rSeparation) {

The difference between the position of the neighbor and the position of the current fish is added to the separation vector:

     separation += agents[i][previousStatesIndex].position - agents[fishNum][previousStatesIndex].position;

The value of the neighbor counter is increased:

     separationCnt++;
     }

If the distance is less than the radius of alignment:

     if (distance < rAlignment) {

Then the velocity of the neighbor is added to the alignment vector:

     alignment += agents[i][previousStatesIndex].velocity;

The value of the neighbor counter is increased:

     alignmentCnt++;
                      }
             }

If neighbors were found for cohesion:

     if (cohesionCnt != 0) {

Then the cohesion vector is divided by the number of neighbors and its own position is subtracted:

     cohesion /= cohesionCnt;
     cohesion -= agents[fishNum][previousStatesIndex].position;

The cohesion vector is normalized:

     cohesion.Normalize();
     }

If neighbors were found for separation:

     if (separationCnt != 0) {

The separation vector is divided by the number of neighbors and multiplied by -1 to change the direction:

            separation /= separationCnt;
            separation *= -1.f;

The separation vector is normalized:

              separation.Normalize();
     }

If neighbors were found for alignment:

     if (alignmentCnt != 0) {

The alignment vector is divided by the number of neighbors:

            alignment /= alignmentCnt;

The alignment vector is normalized:

            alignment.Normalize();
     }

Based on the weight coefficients of each of the possible types of behavior, a new acceleration vector is determined, limited by the value of the maximum acceleration:

agents[fishNum][currentStatesIndex].acceleration = (cohesion * kCoh + separation * kSep + alignment * kAlign).GetClampedToMaxSize(maxAccel);

To limit the acceleration vector along the Z-axis:

   agents[fishNum][currentStatesIndex].acceleration.Z = 0;

Add to the previous position of the fish the result of the multiplication of the new velocity vector and the time elapsed since the last calculation:

     agents[fishNum][currentStatesIndex].velocity += agents[fishNum][currentStatesIndex].acceleration * DeltaTime;

The velocity vector is limited to the maximum value:

     agents[fishNum][currentStatesIndex].velocity =
                 agents[fishNum][currentStatesIndex].velocity.GetClampedToMaxSize(maxVel);

To the previous position of a fish, the multiplication of the new velocity vector and the time elapsed since the last calculation is added:

     agents[fishNum][currentStatesIndex].position += agents[fishNum][currentStatesIndex].velocity * DeltaTime;

The current fish is checked to be within the specified boundaries. If yes, the calculated speed and position values are saved. If the fish has moved beyond the boundaries of the region along one of the axes, then the value of the velocity vector along this axis is multiplied by -1 to change the direction of motion:

agents[fishNum][currentStatesIndex].velocity = checkMapRange(mapSz,
               agents[fishNum][currentStatesIndex].position, agents[fishNum][currentStatesIndex].velocity);
               }, isSingleThread);

For each fish, collisions with world-static objects, like underwater rocks, should be detected, before new states are applied:

     for (int i = 0; i < cnt; i++) {

То detect collisions between fish and world-statiс objects:

            FHitResult hit(ForceInit);
            if (collisionDetected(agents[i][previousStatesIndex].position, agents[i][currentStatesIndex].position, hit)) {

If a collision is detected, then the previously calculated position should be undone. The velocity vector should be changed to the opposite direction and the position recalculated:

                   agents[i][currentStatesIndex].position -= agents[i]  [currentStatesIndex].velocity * DeltaTime;
                   agents[i][currentStatesIndex].velocity *= -1.0; 
                   agents[i][currentStatesIndex].position += agents[i][currentStatesIndex].velocity * DeltaTime;  
            }
     }

Having calculated the new states of all fish, these updated states will be applied, and all fish will be moved to a new position:

for (int i = 0; i < cnt; i++) {  
           FTransform transform; 
            m_instancedStaticMeshComponent->GetInstanceTransform(agents[i][0]->instanceId, transform);

Set up a new position of the fish instance:

     transform.SetLocation(agents[i][0]->position);

Turn the fish head forward in the direction of movement:

     FVector direction = agents[i][0].velocity; 
     direction.Normalize();
     transform.SetRotation(FRotationMatrix::MakeFromX(direction).Rotator().Add(0.f, -90.f, 0.f).Quaternion());

Update instance transform:

            m_instancedStaticMeshComponent->UpdateInstanceTransform(agents[i][0].instanceId, transform, false, false);
     }

Redraw all the fish:

     m_instancedStaticMeshComponent->ReleasePerInstanceRenderData();

     m_instancedStaticMeshComponent->MarkRenderStateDirty();

Swap indexed fish states:

      swapFishStatesIndexes();

Complexity of the Algorithm: How Increasing the Number of Fish Affects Productivity

Suppose that the number of fish participating in the algorithm is N. To determine the new state of each fish, the distance to all the other fish must be calculated (not counting additional operations for determining the direction vectors for the three types of behavior). The initial complexity of the algorithm will be O(N²). For example, 1,000 fish will require 1,000,000 operations.

Figure 1: Computational operations for calculating the positions of all fish in a scene.

Compute Shader with Comments

Structure describing the state of each fish:

     struct TInfo{
              int instanceId;
              float3 position;
              float3 velocity;
              float3 acceleration;
     };

Function for calculating the distance between two vectors:

     float getDistance(float3 v1, float3 v2) {
              return sqrt((v2[0]-v1[0])*(v2[0]-v1[0]) + (v2[1]-v1[1])*(v2[1]-v1[1]) + (v2[2]-v1[2])*(v2[2]-v1[2]));
     }

     RWStructuredBuffer<TInfo> data;

     [numthreads(1, 128, 1)]
     void VS_test(uint3 ThreadId : SV_DispatchThreadID)
     {

Total number of fish:

     int fishCount = constants.fishCount;

This variable, created and initialized in C++, determines the number of fish calculated in each graphics processing unit (GPU) thread (by default:1):

     int calculationsPerThread = constants.calculationsPerThread;

Loop for calculating fish states that must be computed in this thread:

     for (int iteration = 0; iteration < calculationsPerThread; iteration++) {

Thread identifier. Corresponds to the fish index in the state array:

     int currentThreadId = calculationsPerThread * ThreadId.y + iteration;

The current index is checked to ensure it does not exceed the total number of fish (this is possible, since more threads can be started than there are fish):

     if (currentThreadId >= fishCount)
            return;

To calculate the state of fish, a single double-length array is used. The first N elements of this array are the new states of fish to be calculated; the second N elements are the older states of fish that were previously calculated.

Current fish index:

    int currentId = fishCount + currentThreadId;

Copy of the structure of the current state of fish:

     TInfo currentState = data[currentThreadId + fishCount];

Copy of the structure of the new state of fish:

     TInfo newState = data[currentThreadId];

Initialize direction vectors for the three types of behavior:

     float3 steerCohesion = {0.0f, 0.0f, 0.0f};
     float3 steerSeparation = {0.0f, 0.0f, 0.0f};
     float3 steerAlignment = {0.0f, 0.0f, 0.0f};

Initialize neighbors counters for each type of behavior:

     float steerCohesionCnt = 0.0f;
     float steerSeparationCnt = 0.0f;
     float steerAlignmentCnt = 0.0f;

Based on the current state of each fish, direction vectors are calculated for each of the three types of behaviors. The cycle begins with the middle of the input array, which is where the older states are stored:

     for (int i = fishCount; i < 2 * fishCount; i++) {

Each fish should ignore (not calculate) itself:

     if (i != currentId) {

Calculate the distance between the position of current fish and the position of each other fish in the array:

     float d = getDistance(data[i].position, currentState.position);

If the distance is less than the cohesion radius:

     if (d < constants.radiusCohesion) {

Then the neighbor’s position is added to the cohesion vector:

     steerCohesion += data[i].position;

And the counter of neighbors for cohesion is increased:

            steerCohesionCnt++;
     }

If the distance is less than the separation radius:

     if (d < constants.radiusSeparation) {

Then the separation vector is added to the difference between the position of the neighbor and the position of the current fish:

     steerSeparation += data[i].position - currentState.position;

The counter of the number of neighbors for separation increases:

            steerSeparationCnt++;
     }

If the distance is less than the alignment radius:

     if (d < constants.radiusAlignment) {

Then the velocity of the neighbor is added to the alignment vector:

     steerAlignment += data[i].velocity;

The counter of the number of neighbors for alignment increases:

                          steerAlignmentCnt++;
                   }
            }
     }

If neighbors were found for cohesion:

   if (steerCohesionCnt != 0) {

The cohesion vector is divided by the number of neighbors and its own position is subtracted:

     steerCohesion = (steerCohesion / steerCohesionCnt - currentState.position);

The cohesion vector is normalized:

            steerCohesion = normalize(steerCohesion);
     }

If neighbors were found for separation:

     if (steerSeparationCnt != 0) {

Then the separation vector is divided by the number of neighbors and multiplied by -1 to change the direction:

     steerSeparation = -1.f * (steerSeparation / steerSeparationCnt);

The separation vector is normalized:

            steerSeparation = normalize(steerSeparation);
     }

If neighbors were found for alignment:

     if (steerAlignmentCnt != 0) {

Then the alignment vector is divided by the number of neighbors:

     steerAlignment /= steerAlignmentCnt;

The alignment vector is normalized:

           steerAlignment = normalize(steerAlignment);
     }

Based on the weight coefficients of each of the three possible types of behaviors, a new acceleration vector is determined, limited by the value of the maximum acceleration:

     newState.acceleration = (steerCohesion * constants.kCohesion + steerSeparation * constants.kSeparation
            + steerAlignment * constants.kAlignment);
     newState.acceleration = clamp(newState.acceleration, -1.0f * constants.maxAcceleration,
            constants.maxAcceleration);

To limit the acceleration vector along the Z-axis:

     newState.acceleration[2] = 0.0f;

To the previous velocity vector, the product of the new acceleration vector and the time elapsed since the last calculation is added. The velocity vector is limited to the maximum value:

     newState.velocity += newState.acceleration * variables.DeltaTime;
     newState.velocity = clamp(newState.velocity, -1.0f * constants.maxVelocity, constants.maxVelocity);

Add to the previous position of the fish the result of the multiplication of the new velocity vector and the time elapsed since the last calculation:

     newState.position += newState.velocity * variables.DeltaTime;

                   float3 newVelocity = newState.velocity;
                   if (newState.position[0] > constants.mapRangeX || newState.position[0] < -constants.mapRangeX) {
                          newVelocity[0] *= -1.f;
                   }    

                   if (newState.position[1] > constants.mapRangeY || newState.position[1] < -constants.mapRangeY) {
                          newVelocity[1] *= -1.f;
                   }
                   if (newState.position[2] > constants.mapRangeZ || newState.position[2] < -3000.f) {
                          newVelocity[2] *= -1.f;
                   }
                   newState.velocity = newVelocity;

                   data[currentThreadId] = newState;
            }
     }

Table 1: Comparison of algorithms.

Fish	Algorithm (FPS)			Computing Operations
Fish	CPU SINGLE	CPU MULTI	GPU MULTI	Computing Operations
100	62	62	62	10000
500	62	62	62	250000
1000	62	62	62	1000000
1500	49	61	62	2250000
2000	28	55	62	4000000
2500	18	42	62	6250000
3000	14	30	62	9000000
3500	10	23	56	12250000
4000	8	20	53	16000000
4500	6	17	50	20250000
5000	5	14	47	25000000
5500	4	12	35	30250000
6000	3	10	31	36000000
6500	2	8	30	42250000
7000	2	7	29	49000000
7500	1	7	27	56250000
8000	1	6	24	64000000
8500	0	5	21	72250000
9000	0	5	20	81000000
9500	0	4	19	90250000
10000	0	3	18	100000000
10500	0	3	17	110250000
11000	0	2	15	121000000
11500	0	2	15	132250000
12000	0	1	14	144000000
13000	0	0	12	169000000
14000	0	0	11	196000000
15000	0	0	10	225000000
16000	0	0	9	256000000
17000	0	0	8	289000000
18000	0	0	3	324000000
19000	0	0	2	361000000
20000	0	0	1	400000000

Figure 2: Comparison of algorithms.

Laptop Hardware:
CPU – Intel^® Core^™ i7-3632QM processor 2.2 GHz with turbo boost up to 3.2 GHz
GPU - NVIDIA GeForce* GT 730M
RAM - 8 GB DDR3*

Abstract

Diabetic retinopathy (DR) is one of the leading causes of preventable blindness. This is rampant in people across the globe. Detecting it is a time-consuming and manual process. This experiment aims to automate the preliminary DR detection based on the retinal image of a patient's eye. TensorFlow* based implementation uses convolutional neural networks to take a retinal image, analyze it, and learn the characteristics of an eye that shows signs of diabetic retinopathy to detect this condition. A simple transfer learning approach with an Inception* v3 architecture model on an ImageNet* dataset was used to train and test on a retina dataset. The experiments were run on Intel® Xeon® Gold processor powered systems. The tests resulted in a training accuracy of about 83 percent, and test accuracy was approximately 77 percent (refer Configurations).

Introduction

Diabetic retinopathy (DR) is one of the leading causes of preventable blindness. It affects up to 40 percent of diabetic patients, with nearly 100 million cases worldwide, as of 2010. Currently, detecting DR is a time-consuming and manual process that requires a trained clinician to examine and evaluate digital color fundus photographs of the retina. By the time human readers submit their reviews, often a day or two later, the delayed results lead to lost follow up, miscommunication, and delayed treatment. The objective of this experiment is to develop an automated method for DR screening. Consultation of the eyes with DR by an ophthalmologist for further evaluation and treatment would aid in reducing the rate of vision loss, enabling timely and accurate diagnoses.

Continued research in the Deep Learning space resulted in the evolution of many frameworks to solve the complex problem of image classification, detection, and segmentation. These frameworks have been optimized specific to the hardware where they are run for better accuracy, reduced loss, and increased speed. Intel has optimized the TensorFlow* library for better performance on their Intel® Xeon® Gold processors. This paper discusses the training and inferencing DR detection problem that is built using the Inception* v3 architecture with TensorFlow framework on Intel® processor powered clusters. A transfer learning approach was used by taking the weights for Inception v3 architecture on an ImageNet* dataset and using those weights on a retina dataset to train, validate, and test.

Document Content

This section describes in detail the end-to-end steps, from choosing the environment, to running the tests on the trained DR detection model.

Choosing the Environment

Hardware

The detailed experiments performed on an Intel Xeon Gold processor powered system are as listed in the following table:

Components	Details
Architecture	x86_64
CPU op-mode(s)	32 bit, 64 bit
Byte order	Little-endian
CPU(s)	24
Core(s) per socket	Six
Socket(s)	Two
CPU family	Six
Model	85
Model name	Intel® Xeon® Gold 6128 processor @ 3.40 GHz
RAM	92 GB

Table 1. Intel® Xeon® Gold processor configuration.

Software

An Intel® optimized TensorFlow framework along with Intel® Distribution for Python* were used as the software configuration.

Software/Library	Version
TensorFlow*	1.4.0 (Intel® optimized)
Python*	3.6 (Intel optimized)

Table 2. On Intel® Xeon® Gold processor.

The listed software configurations are available on the hardware environments chosen, and no source build for TensorFlow was necessitated.

Dataset

The dataset is a small, curated subset of images that was created from Kaggle's Diabetic Retinopathy Detection challenge’s train dataset. The dataset contains a large set of high-resolution retina images taken under a variety of imaging conditions. A left and right field is provided for every subject. Images are labeled with a subject ID as well as either left or right (for example, 1_left.jpeg is the left eye of patient ID 1). As the images are from different cameras, they may be of different quality in terms of exposure and focus sharpness. Also, some of the images are inverted. The data also has noise in both images and labels.

The presence of disease in each image is labeled on a scale from 0 to 1, as follows:

0: No Disease

1: Disease

The dataset provided is split into training set (90 percent files) and test set (10 percent files) for this experiment.

Inception* v3 Architecture

The Inception v3 architecture was built on the intent to improve the utilization of computing resources inside a deep neural network. The main idea behind Inception v3 is the approximation of a sparse structure with spatially repeated dense components and using dimension reduction as used in a network-in-network architecture to keep the computational complexity in bounds, but only when required. The computational cost of Inception v3 is also much lower than other topologies such as AlexNet, VGGNet*, ResNet*, and so on. More information on Inception v3 is given in Going deeper with convolutions³. The Inception v3 architecture is mentioned in the following figure:

Inception* v3 model

Figure 1, Inception* v3 model³.

To accelerate the training process, the transfer learning technique was applied by using a pre-trained Inception v3 model on the ImageNet dataset. The pre-trained model already learned the knowledge on data and stored that in the form of weights. These weights are directly used as initial weights, and they are readjusted when the model is retrained on the retina dataset. The pre-trained model was downloaded from here⁴:

Execution Steps

This section describes the steps followed in the end-to-end process for training, validation, and testing the retinopathy detection model on Intel® architecture.

These steps include:

Preparing input
Model training
Inference

Preparing Input

Image Directories

The dataset was downloaded from the Nomikxyz / retinopathy-dataset¹.

The files were extracted and separated into different directories based on the DR types.
Nearly 2063 images (diseased and non-diseased folders) were separated and put into a different directory from the master list.
There were 1857 JPEG images of retinas for training, 206 images for testing, and a .CSV file where the level of the disease is written for the train images.

Processing and Data Transformations

Images from the training and test datasets have very different resolutions, aspect ratios, colors, are cropped in various ways, and some are of very low quality, out of focus, and so on.
To help improve the results during training, the images are augmented through simple distortions like crops, scales, and flips.
Images were of varying sizes and were cropped to 299 pixels wide by 299 pixels high.

Model Training

Transfer learning is a technique that reduces the time taken to train from scratch by taking a fully-trained model for a set of categories like ImageNet and retrains from the existing weights for new classes. In the experiment, we retrained the final layer from scratch, while leaving all the others untouched. The following command was run that accesses the training images and trains the algorithm toward detecting diseased images.

The retrain.py was run on the retina dataset as follows:

python retrain.py \
  --bottleneck_dir=bottlenecks \
  --how_many_training_steps=300 \
  --model_dir=inception \
  --output_graph=retrained_graph.pb \
  --output_labels=retrained_labels.txt \
  --image_dir=<>

The mentioned script loads the pre-trained Inception v3 model, removes the old top layer, and trains the retina images. Though there were no retina class/images in the original ImageNet classes when the full network was trained on it, with transfer learning the lower layers are trained to distinguish between generic features (for example, edge detectors or color blob detectors) that can be reused for other recognition tasks without any modification.

Retraining with Bottlenecks

TensorFlow computes all the bottleneck values as the first step in training. In this step, it analyzes all the images on disk and calculates the bottleneck values for each of them. Bottleneck is an informal term we often use for the last-but-one layer before the final output layer that actually does the classification. This penultimate layer has been trained to output a set of values that is good enough for the classifier to use, to distinguish between all the classes it has been asked to recognize. The reason our final layer retraining can work on new classes is that it turns out that the kind of information needed to distinguish between all of the 1,000 classes in ImageNet is often also useful to distinguish between new kinds of objects like retina, traffic signal, accidents, and so on.

The bottleneck values are then stored as they will be required for each iteration of training. The computation of these values is faster because TensorFlow takes the help of the existing pre-trained model to assist it with the process. As every image is reused multiple times during training, and calculating each bottleneck takes a significant amount of time, it speeds things up to cache these bottleneck values on disk so they do not have to be repeatedly recalculated, and the values are stored in the bottleneck directory.

Training

After the bottlenecks are complete, the actual training of the top layer of the network begins. During the run, the following outputs are generated showing the progress of algorithm training:

Training accuracy shows the percentage of the images used in the current training batch that were labeled with the correct class.
Validation accuracy is the precision (percentage of correctly labelled images) on a randomly selected group of images from a different set.
Cross entropy is a loss function that tells us how well the learning process is progressing.

Training was run on nearly 2063 images with a batch size of 100 for 300 steps/iterations and we observed training accuracy at 83.0 percent (refer Configurations).

Testing

We ran the label_image.py to the trained model on 206 test images with the following script and observed testing accuracy at about 77.2 percent.

python -m scripts.label_image \
    --graph=tf_files/retrained_graph.pb  \
    --image=<>

Diseased versus Not probability

Figure 2. Diseased versus Not probability.

Conclusion

In this paper we explained how training and testing retinopathy detection was done using transfer learning where the weights from the model trained Inception v3 on the ImageNet dataset was used. These weights were readjusted when the model was retrained using the Intel Xeon Gold processor-powered environment. The experiment can be extended by applying different optimization algorithms, changing learning rates, and varying input sizes so that the accuracy can be improved further.

About the Author

Lakshmi Bhavani Manda and Ajit Kumar Pookalangara, are part of the Intel team working on the artificial intelligence (AI) evangelization.

Configurations

For performance reference under Abstract and Training sections:

Hardware: refer Hardware under Choosing the Environment

Software: refer Software under Choosing the Environment

Test performed: executed on remaining 10% of the images using the trained model

For more information go to Product Performance site.

References

1. For curated dataset:
https://github.com/Nomikxyz/retinopathy-dataset

2. TensorFlow for Poets tutorial:
https://codelabs.developers.google.com/codelabs/tensorflow-for-poets

3. Rethinking the Inception Architecture for Computer Vision::
https://arxiv.org/pdf/1512.00567v3.pdf

4. Dataset Link:
https://storage.googleapis.com/download.tensorflow.org/models/inception_dec_2015.zip

Related Resources

TensorFlow* Optimizations on Modern Intel® Architecture: https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture

Build and Install TensorFlow* on Intel® Architecture: https://software.intel.com/en-us/articles/build-and-install-tensorflow-on-intel-architecture