IoT: Three Launches Within One Week!

November 14, 2017, 11:28 am

Latest and popular articles on Intel Technologies

≫ Next: Hyperscan and Snort* Integration

≪ Previous: Intel® Security Dev API: 1.1 Get Started Guide

I want to give a big THANK YOU to our rock stars, Claire Conant, Oksana Benedick, Jeremy Crawford, and Tracy Johnson for the three recent launches under the IoT segment, which includes Intel® Cyclone® 10 LP FPGA Evaluation Kit, UP Squared* Grove* IoT Development Kit, and Adruino* Create. Appreciate many weeks of hard work and the excellent collaboration to make this a success! Go team!!!

Intel® Cyclone® 10 LP FPGA Evaluation Kit (Launched October 26, 2017)

UP Squared* Grove* IoT Development Kit (Launched November 2, 2017)

1 new page
- https://software.intel.com/en-us/iot/hardware/up-squared-grove-dev-kit
10 updated pages (shared with Adruino Create updates. See below for URLs)
1 navigation update
2 vanity URLs
- https://software.intel.com/up-squared-grove-iot
- https://software.intel.com/up2-grove-get-started

Adruino* Create (Launched November 2, 2017)

1 new page
- https://software.intel.com/en-us/iot/arduino-create
10 updated pages (shared with UP Squared Grove IoT Dev Kit above)
1 navigation update

Thank you!

Nam Ngo | Project Manager

↧

Hyperscan and Snort* Integration

November 14, 2017, 11:10 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® MPI Library Release Notes

≪ Previous: IoT: Three Launches Within One Week!

Abstract

Hyperscan, an advanced regular expression matching library, is suitable to apply to network solutions such as deep packet inspection (DPI), in-plane switching (IPS), intrusion detection software (IDS), and next-generation firewall (NGFW).

Snort* is one of the most widely used open source IDS/IPS products, the core part of which involves a large amount of literal and regular expression matching work. This article describes the integration of Hyperscan to Snort to improve its overall performance. The integration code is available under Downloads at 01.org's Hyperscan site.

Snort* Introduction

Figure 1:The Architecture of Snort*.

As shown in Figure 1, Snort has five major parts. The packet decoder is responsible for receiving packets from different network interfaces and conducting initial analysis of packets. The preprocessor is a plug-in for further processing of the decoded packets. Its functions include HTTP URI normalization, packet defragmentation, TCP flow reassembly, and so on. The core of Snort is the detection engine, which can match the packets according to the configured rules. Rule matching is critical to the overall performance of Snort*. Once matching is successful, it notifies the logging and alerting system based on the behavior defined in the rules. Then the system will output the alert or log accordingly. Users can also define the output module to save alerts or logs in a specific form, such as a database or XML file.

Integration of Hyperscan

Figure 2: Hyperscan and Snort* Integration.

As shown in Figure 2, Hyperscan and Snort integration focuses on four aspects：

Literal Matching

Users define the matching of specific literals in rules, and Snort searches the packets, using the Boyer-Moore algorithm to match literals. We replaced this algorithm with Hyperscan to improve its matching performance.

PCRE Matching

Snort uses Perl compatible regular expressions (PCRE) as its regular expression matching engine. Hyperscan is compatible with PCRE rules, but it does not support a few backtracking and assertion syntaxes. However, Hyperscan itself comes with a PCRE preprocessing function (PCRE prefiltering). It can be made compatible with Hyperscan by transforming the PCRE rules. The matching produced by the actual rules is a subset of the matching generated by the transformed rules; therefore, you can use Hyperscan as a prefilter. If it doesn't produce matches, the actual rules will not generate any matches either. If there is a match, you can use PCRE scan to confirm whether there is a real match. Because the overall performance of Hyperscan is higher than PCRE, the prefiltering of Hyperscan can avoid the excessive time cost of PCRE matching.

Multiple Literal Matching

Another important matching process in Snort is the matching of multiple literals. Multiple literal matching can filter out rules that are not possible to match to reduce the number of rules to check individually in detail, therefore improving overall matching performance. Snort uses the Aho-Corasick algorithm for multiple literal matching. We replaced this algorithm with Hyperscan and improved the performance significantly.

HTTP Preprocessing

In addition to the integration of matching algorithms for the detection engine, Hyperscan is also applied in the preprocessor. When doing HTTP preprocessing, we use Hyperscan to search keywords to further accelerate the preprocessing.

Performance

Snort Version	2.9.8.2
Hyperscan Version	4.3.1
Snort Ruleset	Snortrules-snapshot-2983

Table 1:Snort and Hyperscan software setup.

Our performance testing was done on Snort 2.9.8.2 and Hyperscan 4.3.1, as shown in Table 1 with the default Snort ruleset Snortrules-snapshot-2983, which has 8863 rules. HTTP enterprise traffic is sent from an IXIA* traffic generator to Snort during testing. Figure 3 shows the single core, single-thread performance comparison between the original Snort and the Hyperscan accelerated Snort on Intel Xeon® processor E5-2658 v4 @ 2.30GHz. We can see that Hyperscan improves the matching performance of Snort greatly. The overall performance is about six times higher than that of the original Snort.

We compared the memory footprint of the original Snort and the Snort optimized by Hyperscan. The original Snort has to convert all the rules into a Trie structure for the Aho-Corasick algorithm, and this takes up a lot of memory. Hyperscan has its own optimized matching engine, which greatly reduces memory consumption during the matching process. As shown in Figure 4, in this test the overall memory footprint of the original Snort is more than 12 times larger than that of Snort optimized by Hyperscan.

Figure 3:Performance comparison between original Snort and Hyperscan integrated Snort.

Figure 4:Memory consumption comparison between original Snort* and Hyperscan integrated Snort.

Conclusion

Hyperscan-integrated Snort is much better than the original Snort in both overall performance and memory footprint. Hyperscan has shown its powerful ability of large-scale rule matching, which makes it very suitable for products based on rule matching, such as DPI/IDS/IPS/NGFW.

About the Author

Xiang Wang is a software engineer working on Hyperscan at Intel. His major areas of focus include automata theory and regular expression matching. He works on a pattern matching engine optimized by Intel architecture that is at the core of DPI, IDS, IPS, and firewalls in the network security domain.

↧

Intel® MPI Library Release Notes

November 13, 2017, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: Winners Announced! Highly Talented CERN Student Researchers Innovate on HPC, AI & IOT in the Modern Code Developer Challenge

≪ Previous: Hyperscan and Snort* Integration

This page provides the current Release Notes for Intel® MPI Library. The notes are categorized by year, from newest to oldest, with individual releases listed within each year.

Click a version to expand it into a summary of new features and changes in that version since the last release, and access the download buttons for the detailed release notes, which include important information, such as pre-requisites, software compatibility, installation instructions, and known issues.

You can copy a link to a specific version's section by clicking the chain icon next to its name.

All files are in PDF format - Adobe Reader* (or compatible) required.
To get product updates, log in to the Intel® Software Development Products Registration Center.
For questions or technical support, visit Intel® Software Developer Support.

2018

Update 1

Linux* Release Notes Windows* Release Notes

Startup performance improvements.

Initial Release

Linux* Release Notes Windows* Release Notes

Hydra startup improvements.
Improved support for Intel® Omni-Path Architecture.
Support removal for the Intel® Xeon Phi™ coprocessor (code named Knights Corner).
New deprecations.

2017

Update 4

Linux* Release Notes Windows* Release Notes

Performance tuning for processors based on Intel® microarchitecture codenamed Skylake and for Intel® Omni-Path Architecture.
Deprecated support for the IPM statistics format.

Update 3

Linux* Release Notes Windows* Release Notes

Hydra startup improvements.
Default fabrics list change.

Update 2

Linux* Release Notes Windows* Release Notes

New environment variables: I_MPI_HARD_FINALIZE and I_MPI_MEMORY_SWAP_LOCK.

Update 1

Linux* Release Notes Windows* Release Notes

PMI-2 support for SLURM*, improved SLURM support by default.
Improved mini-help and diagnostic messages, man1 pages for mpiexec.hydra, hydra_persist, and hydra_nameserver.
New deprecations.

Initial Release

Linux* Release Notes Windows* Release Notes

Support for the MPI-3.1 standard.
New topology-aware collective communication algorithms.
Effective MCDRAM (NUMA memory) support.
Controls for asynchronous progress thread pinning.
Performance tuning.
New deprecations.

5.1

Update 3 Build 223

Linux* Release Notes

Fix for issue with MPI_Abort call on threaded applications (Linux* only)

Update 3

Linux* Release Notes Windows* Release Notes

Fixed shared memory problem on Intel® Xeon Phi™ processor (codename: Knights Landing)
Added new algorithms and selection mechanism for nonblocking collectives
Added new psm2 option for Intel® Omni-Path fabric
Added I_MPI_BCAST_ADJUST_SEGMENT variable to control MPI_Bcast
Fixed long count support for some collective messages
Reworked the binding kit to add support for Intel® Many Integrated Core Architecture and support for ILP64 on third party compilers
The following features are deprecated in this version of the Intel MPI Library. For complete list of all deprecated and removed features, visit our deprecation page.
- SSHM
- MPD (Linux*)/SMPD (Windows*)
- Epoll
- JMI
- PVFS2

Update 2

Linux* Release Notes Windows* Release Notes

Intel® MPI Library now supports YARN* cluster manager (Linux* only)
DAPL library UCM settings are automatically adjusted for MPI jobs of more than 1024 ranks, resulting in more stable job start-up (Linux* only)
ILP64 support enhancements, support for MPI modules in Fortran 90
Added the direct receive functionality for the TMI fabric (Linux* only)
Single copy intra-node communication using Linux* supported cross memory attach (CMA) is now default (Linux* only)

Update 1

Linux* Release Notes Windows* Release Notes

Changes to the named-user licensing shceme. See more details in the Installation Instructions section of Intel® MPI Library Installation Guide.
Various bug fixes for general stability and performance.

Initial Release

Linux* Release Notes Windows* Release Notes

Added support for OpenFabrics Interface* (OFI*) v1.0 API
Added support for Fortran* 2008
Updated the default value for I_MPI_FABRICS_LIST
Added brand new Troubleshooting chapter to the Intel® MPI Library User's Guide
Added new application-specific features in the Automatic Tuner and Hydra process manager
Added support for the MPI_Pcontrol feature for improved internal statistics
Increased the possible space for MPI_TAG
Changed the default product installation directories
Various bug fixes for general stability and performance

↧

Winners Announced! Highly Talented CERN Student Researchers Innovate on HPC, AI & IOT in the Modern Code Developer Challenge

November 15, 2017, 5:42 pm

Latest and popular articles on Intel Technologies

≫ Next: Missing components when installing Intel Parallel Studio XE 2018 update 1 via Intel Software Manager

≪ Previous: Intel® MPI Library Release Notes

CERN openlab and Intel are pleased to announce the winners of the Intel^® Modern Code Developer Challenge! The announcement was made today at ‘SC17’, the International Conference for High Performance Computing, Networking, Storage, and Analysis, in Denver, Colorado, USA.

Two winners were selected: Elena Orlova, for her work on improving particle collision simulation algorithms, and Konstantinos Kanellis, for his work on cloud-based biological simulations.

A Challenge for CERN Openlab Summer Students

CERN openlab is a unique public-private partnership between CERN and leading companies, helping accelerate development of the cutting-edge ICT solutions that make the laboratory’s ground-breaking physics research possible. Intel has been a partner in CERN openlab since 2001, when the collaboration was first established.

Each year, CERN openlab runs a highly competitive summer-student programme that sees 30-40 students from around the world come to CERN for nine weeks to do hands-on ICT projects involving the latest industry technologies.

This year, five students were selected to take part in the Intel^® Modern Code Developer Challenge. This competition showcases the students’ blogs about their projects — all of which make use of Intel technologies or are connected to broader collaborative initiatives between Intel and CERN openlab. You can find additional information about these projects on a dedicated website that also features audio and video interviews.

“We are thrilled to support these students through the Modern Code Developer Challenge,” says Michelle Chuaprasert, Director, Developer Relations Division at Intel. “Intel's partnership with CERN openlab is part of our continued commitment to education and building the next generation of scientific coders that are using high-performance computing, artificial intelligence, and Internet-of-things (IoT) technologies to have a positive impact on people’s lives across the world.”

Selecting a Winner from Five Challenging Projects

The competition featured students working on exciting challenges within both high-energy physics and other research domains.

Selecting a winner from five challenging projects

At the start of the challenge, the plan was for a panel of judges to select just one of the five students as the winner and to invite said winner to present their work at the SC17 conference. However, owing to the high quality of the students’ work, the judges decided to select two winners, both of whom received full funding from Intel to travel to the USA and present their work.

Smash-simulation Software

Elena Orlova, a third-year student in applied mathematics from the Higher School of Economics in Moscow, Russia, was selected as one of the two winners. Her work focused on teaching algorithms to be faster at simulating particle-collision events.

Physicists widely use a software toolkit called GEANT4 to simulate what will happen when a particular kind of particle hits a particular kind of material in a particle detector. This toolkit is so popular that researchers use it in other fields to predict how particles will interact with other matter, such as in assessing radiation hazards in space, in commercial air travel, in medical imaging, and in optimizing scanning systems for cargo security.

An international team, led by researchers at CERN, is developing a new version of this simulation toolkit known as GeantV. This work is supported bya CERN openlab project with Intel on code modernization. GeantV will improve physics accuracy and boost performance on modern computing architectures.

The team behind GeantV is implementing a deep learning tool intended to make simulations faster. Orlova worked to write a flexible mini-application to help train the deep neural network on distributed computing systems.

“I’m really glad to have had this opportunity to work on a breakthrough project like this with such cool people,” says Orlova. “I’ve improved my skills, gained lots of new experience, and have explored new places and foods. I hope my work will be useful for further research.”

Cells In the Cloud

Konstantinos Kanellis, a final-year undergraduate in the Department of Electrical and Computer Engineering at the University of Thessaly, Greece, is the other Modern Code Developer Challenge winner due to his work related to BioDynaMo. BioDynaMo is one of CERN openlab’s knowledge-sharing projects (another part of CERN openlab’s collaboration with Intel on code modernization). The project’s goal is to develop methods for ensuring that scientific software makes full use of the computing potential offered by today’s cutting-edge hardware technologies. This joint effort by CERN, Newcastle University, Innopolis University, and Kazan Federal University is to design and build a scalable and ﬂexible platform for rapid simulation of biological tissue development.

The project focuses initially on the area of brain tissue simulation, drawing inspiration from existing, but low-performance, software frameworks. By using the code to simulate development of both normal and diseased brains, neuroscientists hope to learn more about the causes of — and identify potential treatments for — disorders such as epilepsy and schizophrenia.

Late 2015 and early 2016 saw algorithms already written in Java* code ported to C++. Once porting was completed, work began to optimise the code for modern computer processors and co-processors. However, to address ambitious research questions, more computational power was needed. Future work will attempt to adapt the code for high-performance computing resources over the cloud.

Kanellis’s work focused on adding network support for the single-node simulator and prototyping the computation management across many nodes. “Being a summer student at CERN was a rich and fulfilling experience. It was exciting to work on an interesting and ambitious project like BioDynaMo,” says Kanellis. “I feel honoured to have been chosen as a winner, and that I've managed to deliver something meaningful that can make an impact in the future.”

ICT Stars of the Future

ICT stars of the future

Alberto Di Meglio, head of CERN openlab, will present more details about these projects, as well as details about the entire competition, in a talk at SC17. The other three projects featured in the challenge focused on using machine learning techniques to better identify the particles produced by collision events, integrating IoT devices into the control systems for the LHC, and helping computers get better at recognising objects in satellite maps created by UNITAR, a UN agency hosted at CERN.

“Training the next generation of developers — the people who can produce the scientific code that makes world-leading research possible — is of paramount importance across all scientific fields,” says Meglio. “We’re pleased to partner with Intel on this important cause.”

For more information, please visit the Intel® Modern Code Developer Challenge website. Also, if you’re a student and are interested in joining next year’s CERN openlab Summer Student Programme, please visit the dedicated page on our website (applications will open in December).

↧

Missing components when installing Intel Parallel Studio XE 2018 update 1 via Intel Software Manager

November 16, 2017, 7:44 am

Latest and popular articles on Intel Technologies

≫ Next: VR Developer Tutorial: Testing and Profiling the Premium VR Game

≪ Previous: Winners Announced! Highly Talented CERN Student Researchers Innovate on HPC, AI & IOT in the Modern Code Developer Challenge

Description: When attempting to install Intel® Parallel Studio XE 2018 update 1 using the Intel® Software Manager, if not all required components are found, the installer will only install the available components and will ignore the components that were not found.

To install all required components, use one of the following:

Workaround #1

Close the installer
Use the “Download Only” button in Intel® Software Manager to download the full web image
Press the “Install” button in Intel® Software Manager after downloading had finished

Workaround #2

Close the installer
Go to <user\home>\Downloads\Intel\parallel_studio_xe_2018_update1_*_edition_online_setup_image\config\ folder and remove the file “offline_installation.ind”
Launch the installer again from <user\home>\Downloads\Intel\parallel_studio_xe_2018_update1_*_edition_online_setup_image\setup.exe
The installer will download all missed components

Workaround #3

Close the installer
Go to Intel® Registration Center website (http://registrationcenter.intel.com ) and download the online or offline installer of the required product update.

Have questions?

Check out the Installation FAQ
See our Get Help page for your support options

↧

VR Developer Tutorial: Testing and Profiling the Premium VR Game

November 16, 2017, 2:32 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® CPUs for Deep Learning Training

≪ Previous: Missing components when installing Intel Parallel Studio XE 2018 update 1 via Intel Software Manager

Introduction

In terms of processing power and user experience, a virtual reality (VR) system falls into three types: premium, mainstream, and entry-level. Premium VR represents high-end VR and includes products on the market with high configuration, high-performance PCs, or game consoles. The main VR peripherals that support premium VR are HTC Vive*, Oculus Rift*, and Sony PlayStation* VR.

The hardware performance of mainstream VR is not up to the quality of high-end VR, but still uses PC processors for VR computing power. Entry-level VR comprises mobile VR devices, such as Gear* VR, and Google Cardboard* and includes VR glasses and all-in-one machines that use mobile phone chips as computing devices.

This article described the methods of testing and profiling VR games based on HTC Vive* and Oculus Rift on a PC. Compared to traditional PC games, VR games differ in gameplay design, input mode, and performance requirements. Gameplay and input are not within the scope of this tutorial. Instead, we look at the different aspects of performance requirements of a VR game versus a traditional game.

The size of pixel processing per second is an important measurement of VR experience. Because the screen resolution of the current HTC Vive and Oculus Rift CV1 is 2160x1200, when doing the actual rendering, more sampling is needed to offset the resolution loss caused by lens distortion. For HTC Vive and Oculus Rift CV1, this loss is as high as 140 percent. The size of pixel processing for VR reaches a surprising 457 million per second.

Performance testing and analysis are important parts of VR games. These tasks help achieve the necessary requirements and ensure full utilization of all CPU and GPU processing capabilities. Before testing, Oculus needs to close the Asynchronous Spacewarp (ASW) and SteamVR* needs to close the asynchronous reprojection so that the VR runtime compensation intervention doesn’t affect the performance analysis behind it.

To disable ASW, in the Oculus SDK, you can open and run Program Files\Oculus\Support\oculus-diagnostics\OculusDebugTool.exe.

Figure 1. Asynchronous Spacewarp configuration.

To disable the reprojection function in SteamVR, use Settings/Performance.

Figure 2. SteamVR* configuration.

Tools for VR

Software tools play a vital role in testing and analyzing VR games. The main tools for these tasks include Fraps*, GamePlus*, Unreal Engine* console command, Windows* Assessment and Deployment Kit (ADK), SteamVR frame timing, and Intel® Graphics Performance Analyzers (Intel® GPA).

The Fraps FPS (Frames Per Second) Counter

The Fraps FPS (frames per second) counter is a traditional test and frame-time tool, which developers can use to test the maximum frame rate, minimum frame rate, and the average frame rate over a period of time. The results can be easily imported into an Excel* file to generate graphics (see the top of Figure 3). As shown in the graph, we can see whether the frame rate change is smooth throughout the entire process. In addition, Fraps is handy for taking screenshots, which can be saved for reporting purposes.

Figure 3. (Top) Fraps FPS (frames per second) shows the frame rate change over time. (Bottom) The maximum frame rate, minimum frame rate, and average frame rate of the game’s frame rate change generated in Fraps over time.

The bottom of Figure 3 shows the maximum frame rate, minimum frame rate, and average frame rate of the game's frame rate change generated in Fraps over time.

As the data shows, the frame rate of the scene in this VR game is low most of the time—only about 45 FPS—and does not meet HTC’s requirements. With this kind of performance, the player will experience dizziness or have motion sickness (In order to prevent discomfort or so-called motion sickness when playing the VR game, helmet manufacturers are required to reach a stable frame rate of 90 FPS.) At this point, we can use the GPUVIEW in the Windows ADK to determine whether the problem is due to the GPU or CPU.

Benchmark Utility GamePP*

Although Fraps is free-sharing software, it has not been updated for a long time. GamePP* is a similar benchmark utility from China (http://gamepp.com/) that you can use. When running this utility, a tool window automatically displays at the top of the game window. This window displays FPS, CPU temperature, graphics occupancy rate, CPU, graphics card, and memory usage, and so on, in real time (see Figure 4). Another disadvantage of Fraps is that it cannot be used to test DirectX* 12 games, but you can use PresentMon, another tool, to collect FPS data.

Figure 4. GamePP* real-time data display interface.

The tool window at the top of the game window provides real-time monitoring of the running game and its performance. But both Fraps and the GamePP utility are designed for traditional games and can only be displayed on a monitor.

Unreal Engine Tool

VR gamers wearing helmets or head-mounted displays (HMDs), who cannot see a game’s data changes in real time on the monitor, have two options if they want to see real-time performance data in a helmet:

If the game is based on Unreal Engine, use the stat FPS console command.
Use the SteamVR frame timing method, which can see real-time performance data in both the helmet and the display monitor. The SteamVR frame timing settings can be found here: https://developer.valvesoftware.com/wiki/SteamVR/Frame_Timing

Figure 5 shows the SteamVR frame timing data.

Figure 5. The missed frame in the head-mounted display.

When dropped frames occur in the game scene frame, a Missed Frames box will display on the HDM, as shown in Figure 5. The thicker the density of the red bar that displays in this box, the more frequent the frame drops occurred.

Figure 6. The CPU and GPU running on the PC.

Figure 6 shows more detailed data on the PC display. You can also configure the display in the HDM on the above show in headset. In the data readout, blue indicates the GPU rendering time and tan indicates the GPU free time.

As shown in Figure 6, the GPU rendering of some frames exceeds 11.11 milliseconds (ms), which will miss the time of Vsync and cause the missed frames. These frames cannot reach 90 FPS. Using this SteamVR frame timing tool, we can learn more about GPU bound, but it cannot determine whether the CPU render thread did not pass the rendering command in time, which caused the GPU to be a bubble or GPU rendering time to be too long.

Unity Engine’s Console Command Tool.

If the game was developed using the Unity Engine 4 engine and is a development version—not a release version—you can view the real-time performance data of the game using the Unity Engine’s console command tool.

You can press the ~ button in the game to display the command line window. The following are some of the common console commands:

Stat FPS: Displays FPS per frame and frame time. This command is easy to use in VR because the frame rate is displayed on the VR headset. It is convenient for testers to observe the real-time performance of the game in use.
Stat Unit: Shows the total time of each frame in the game, time consumed by the game logic thread, game render thread time consumption, and GPU time consumption (see Figure 7). In general, if the time consumption of a frame is close to that of the logical thread, the bottleneck is in the logical threads. However, if the time consumption of a frame is close to the rendering thread time, the bottleneck is in the rendering thread. If both times are closer to the GPU time, the bottleneck is on the graphics card.

Figure 7. Screenshot of the Stat Unit command.

Stat SceneRendering: Shows the various parameter values on the game’s render thread (see Figure 8).

Figure 8. Screenshot of the stat SceneRendering command.

Stat Game: Shows the real-time view of parameter values running on the game logic thread, such as artificial intelligence (AI), physics, blueprint, memory allocation, and so on (see Figure 9).

Figure 9. Screenshot of the Stat Game command.

Stat GPU: Shows the time parameters of the GPU main render content in each frame in real time (see Figure 10).

Figure 10. The Stat GPU Command Display Screen

Stat InitViews: Shows the time and efficiency data that culling takes (see Figure 11).

Figure 11. Screenshot of the Stat InitViews command.

Stat LightRendering: Displays the render time required for lighting and shading (see Figure 12).

Figure 12. Screenshot of the Stat LightRendering command.

Additional commands, such as Stat A and Stat, can be referenced from the Unreal official webpages:

GPUVIEW Tool

Further analysis can be done using GPUVIEW and Windows Performance Analyzer (WPA) in the Windows ADK. GPUVIEW is a powerful tool (for more information, refer to https://graphics.stanford.edu/~mdfisher/GPUView.html).

Of the above commands, Stat Unit gives a preliminary indication of whether a Frame is GPU bound or CPU bound. Sometimes the results are inaccurate, such as when a thread of a CPU causes a bubble in the middle of a GPU frame. If this happens, the GPU rendering time seen in the Stat Unit command is actually the add-ons of the real-time rendering time and bubble time. In this case, both GPUVIEW and WPA analysis are needed.

For example, as shown in Figure 13, the middle of each frame has a 2 ms bubble, thus the GPU is not working. Originally, frame rendering time was less than 11 ms, but with the bubble the rendering time is more than 11.1 ms as required by 90 FPS, which leads to the following frame missing the Vsync time. As a result, the frame drop occurred.

Figure 13. GPUVIEW interface: The middle of each frame has a 2 ms bubble

At this point, we can open the same Merged.etl file using WPA and find the time window of the bubble through the timeline to locate which thread of the CPU is heavier and what is running at this time on that thread (see Figure 14).

Figure 14. Windows* Performance Analyzer interface: the time window of the bubble through the timeline

If the rendering time of a GPU frame in GPUVIEW is more than 11.11 ms, the GPU bound can be determined, and then Intel® GPA can be used to analyze which parts of the pipeline are overloaded.

Intel® GPA

Intel GPA is a powerful, free graphics performance analyzer tool, which can be downloaded at https://software.intel.com/en-us/gpa. Intel GPA includes the following independent tools:

System Analyzer: Real-time display of game performance indicators.
Graphics Frame Analyzer: The frame analyzer.
Platform Analyzer: Locates the CPU and GPU workloads.
Graphics Trace Analyzer: Captures detailed event traces for other analyzer analysis.

Intel GPA Graphics Frame Analyzer is used in conjunction with GPUVIEW. It can view the draw call, render target, texture map, overdraw, and shader of a certain frame in a game. By simplifying the shader, you can design an experiment to detect which part of the rendering affects performance, in order to identify the key part to optimize (see Figure 15).

Figure 15. Interface of the Intel® Graphics Performance Analyzers Graphics Frame Analyzer.

Case Study

Let’s use an example to showcase how we can test and analyze a VR game.

You can use dxdiag command to view the machine configuration before testing:

CPU	Intel® Core™ i7-6700K processor 4.00 GHz
GPU	NVIDIA GeForce* GTX 1080
Memory	1x8 GB DDR3
OS	Windows* 10 Pro 64-bit (10.0, Build 10586)
Driver	22.21.13.8233

First we run the Fraps test for a period of time and draw frame rate changes. As shown in Figure 16, we can see that during the first half of the test, there are some scenes that can reach 90 FPS, but in most of the latter half, the frame rate is fluctuating around 45, which does not meet the required standard. Further analysis is required.

Figure 16. FPS display.

Use the Unity Engine console stat FPS command to find a scene with a lower frame rate to conduct the analysis. If you think that the game is changing too fast to grab the data, you can use the console command PAUSE to PAUSE the game to make it easy to open the tools you need. In combination with the parameters of the stat Unit command, there are probably bottlenecks in both the CPU rendering thread and the GPU (see Figure 17).

Figure 17. Screenshot of stat unit command.

We must use GPUVIEW and WPA for simultaneous analysis.

Figure 18. GPU rendering time.

The first thing we can see from GPUVIEW is that the rendering time of a frame is 13.69 ms, over 11.11 ms, so the performance is not likely to reach 90 FPS (see Figure 18).

Next, we see that there is about 10 ms time on the CPU where there is only audio thread running (see Figure 19). Other threads are basically free, which means the audio thread did not make full use of CPU resources, which provides an opportunity to use the CPU for special effects, such as more AI, physics, materials, or particle effects.

Figure 19. CPU idle time.

This is also true from the WPA, where the game and render threads are basically idle.

Figure 20. CPU running thread displayed on the Windows* Performance Analyzer.

Using Intel GPA, you can see that there are less than 1,000 draw calls, which is a reasonable number.

Figure 21. All the draw calls in the Intel® Graphics Performance Analyzers Frame Analyzer.

Select all the Target to do experiments, where the time of the frame spent can be roughly seen.

Test Target	Before the Test	After the Test
2x2 Textures	60.4 FPS	71.4 FPS
1x1 Scissor Rect	60.4 FPS	160.9 FPS
Simple Pixel Shader	60.4 FPS	133 FPS

The 2x2 textures experiment uses simple textures instead of textures in the real scene. Experiments have shown that simple textures don't have significant performance improvement, so texture optimizations can be ignored.

The 1x1 scissor rect experiment is to remove the pixel processing stage in the rendering pipeline. From this experiment, the performance has been improved significantly.

The simple pixel shader experiment, as the name suggests, replaces the original shader from a simplified pixel shader, and the performance is greatly improved through experiments.

From the experiments above, the pixel processing task in the GPU rendering pipeline is rather heavy. Another way to view which operation in a frame takes most of the time is to use ToggleDrawEvents at command line input in Unity Engine 4.

Taking the aspect analysis by making the specific function names on each draw call attach, and then catch Frame using Intel GPA, the time spent by each draw call is shown in the Intel GPA Frame Analyzer.

Figure 22. Intel® Graphics Performance Analyzers shows the execution functions of each Draw call.

Below is a table of a few time-consuming modules. This information will help you focus on the modules that have a high time ratio and selectively do the optimization.

Modules	Time Spent Ratio
BasePass	15 percent
Lights	32.4 percent
PostProcessing	15.9 percent

For more detailed Intel GPA analysis, please refer to another article from Intel® Developer Zone: https://software.intel.com/en-us/android/articles/analyze-and-optimize-windows-game-applications-using-intel-inde-graphics-performance

Summary

Optimization is one method you can use to experience a high-quality game if the hardware is not yet fully capable of achieving the high performance required for an immersive VR experience. Finding the bottlenecks during game optimization requires a comprehensive use of various tools and methods that were described in this article. This article also provided some insights through a variety of experiments and parameter adjustments to locate the CPU or GPU performance bottlenecks of the game, so as to improve the experience of the game.

↧

Intel® CPUs for Deep Learning Training

November 17, 2017, 2:27 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® System Studio Release Notes

≪ Previous: VR Developer Tutorial: Testing and Profiling the Premium VR Game

Overview

In November 2017, UC Berkeley, U-Texas, and UC Davis researchers trained ResNet-50* in a record time of 31 minutes and AlexNet* in a record time of 11 minutes on CPUs to state-of-the-art accuracy¹. These results were obtained on Intel Xeon® Scalable processors (formerly codename Skylake-SP). The main reasons for these performance speeds are:

The compute and memory capacity of Intel Xeon Scalable processors
Software optimizations in the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) and in the popular deep learning frameworks
Recent advancements in distributed training algorithms for supervised deep learning workloads

This level of performance shows that Intel Xeon processors are an excellent hardware platform for deep learning training. Data scientists can now use their existing general-purpose Intel Xeon processor clusters for deep learning training as well as continue using them for deep learning inference, classical machine learning and big data workloads. They can get excellent deep learning training performance using 1 server node, and further reduced the time-to-train by using more server nodes scaling near linearly to hundreds of nodes.

In this 4-part article, we explore each of the main three factors contributing to record-setting speed, and provide examples of industry uses cases.

Part 1: Compute and Memory Capacity of Intel Xeon® Scalable Processors

Training deep learning models often requires significant compute. For example, training ResNet-50² requires a total of about one exa (10¹⁸) single precision operations¹. Hardware capable of high compute throughput can reduce the training time if high utilization is achieved. High utilization requires high bandwidth memory and clever memory management to keep the compute busy on the chip³. These features are in the new generation of Intel Xeon processors: large core count at high processor frequency, fast system memory, large per-core mid-level cache (MLC or L2 cache), and new SIMD instructions making this new generation of Intel Xeon processors an excellent platform for training deep learning models. In Part 1, we review the main hardware features of the Intel Xeon Scalable processors including compute and memory, and compare the performance of the Intel Xeon Scalable processors to previous generations of Intel Xeon processors for deep learning workloads.

In July 2017, Intel launched the Intel Xeon Scalable processor family built on 14 nm process technology. The Intel Xeon Scalable processors can support up to 28 physical cores (56 threads) per socket (up to 8 sockets) at 2.50 GHz processor base frequency and 3.80 GHz max turbo frequency, and six memory channels with up to 1.5 TB of 2,666 MHz DDR4 memory. The top-bin Intel Xeon Platinum processor 8180 provides up to 199GB/s of STREAM Triad performance on a 2-socket system [a, b]. For inter-socket data transfers, the Intel Xeon Scalable processors introduced the new Ultra Path Interconnect (UPI), a coherent interconnect that replaces QuickPath Interconnect (QPI) and increases the data rate to 10.4 GT/s per UPI port and up to 3 UPI ports in a 2-socket configuration⁴.

Additional improvements include a 38.5 MB shared non-inclusive last-level cache (LLC or L3 cache), that is, memory reads fill directly to the L2 and not to both the L2 and L3, and 1MB of private L2 cache per core. The Intel Xeon Scalable processor core now includes the 512-bit wide Fused Multiply Add (FMA) instructions as part of the larger 512-bit wide vector engine with up to two 512-bit FMA units computing in parallel per core (previously introduced in the Intel Xeon Phi™ processor product line)¹ [4]. This provides a significant performance boost over the previous 256-bit wide AVX2 instructions in the previous Intel Xeon processor v3 and v4 generations (formerly codename Haswell and Broadwell, respectively) for both training and inference workloads.

The Intel Xeon Platinum 8180 processor provides up to 3.57 TFLOPS (FP32) and up to 5.18 TOPS (INT8) per socket². The 512-bit wide FMA’s essential doubles the FLOPS that the Intel Xeon Scalable processors can deliver and significantly speeds up single precision matrix arithmetic. Comparing SGEMM and IGEMM performance we observe 2.3x and 3.4x improvement, respectively, over the previous Intel Xeon processor v4 generation [c,e]. Comparing the performance on a full deep learning model, we observed using the ResNet-18 model with the neon framework a 2.2x training and 2.4x inference throughput improvement in performance using FP32 over the previous Intel Xeon processor v4 generation^d,f.

Part 2: Software optimizations in Intel MKL-DNN And The Main Frameworks

Software optimization is essential to high compute utilization and improved performance. Intel Optimized Caffe* (sometimes referred to as Intel Caffe), TensorFlow*, Apache* MXNet*, and Intel® neon™ are optimized for training and inference. Optimizations with other frameworks such as Caffe2*, CNTK*, PyTorch*, and PaddlePaddle* are also a work in progress. In Part 2, we compare the performance of Intel optimized vs non-Intel optimized models; we explain how the Intel MKL-DNN library enables high compute utilization; we discuss the difference between Intel MKL and Intel MKL-DNN; and we explain additional optimizations at the framework level that further improve performance.

Two years ago, deep learning performance was sub-optimal on Intel® CPUs as software optimizations were limited and compute utilization was low. Deep learning scientists incorrectly assumed that CPUs were not good for deep learning workloads. Over the past two years, Intel has diligently optimized deep learning functions achieving high utilization and enabling deep learning scientist to use their existing general-purpose Intel CPUs for deep learning training. By simply setting a configuration flag when building the popular deep learning frameworks (the framework will automatically download and build Intel MKL-DNN by default), data scientists can take advantage of Intel CPU optimizations.

Using Intel Xeon processors can provide over 100x performance increase with the Intel MKL-DNN library. For example, inference across all available CPU cores on AlexNet, GoogleNet v1, ResNet-50, and GoogleNet v3 with Apache MXNet on the Intel Xeon processor E5-2666 v3 (c4.8xlarge AWS* EC2* instance) can provide 111x, 109x, 66x, and 92x, respectively, higher throughput [5]. Inference across all CPU cores on AlexNet with Caffe2 on the Intel Xeon processor E5-2699 v4 can provide 39x higher throughput⁶. Training AlexNet, GoogleNet, and VGG* with TensorFlow on the Intel Xeon processor E5-2699 v4 can provide 17x, 6.7x, and 40x, respectively, higher throughput⁷. Training across all CPU cores AlexNet with Intel Optimized Caffe and Intel MKL-DNN on the Intel Xeon Scalable Platinum 8180 processor has 113x higher throughput than BVLC*-Caffe without Intel MKL-DNN on the Intel Xeon processor E5-2699 v3^d,g. Figure 1 compares the training throughput across multiple Intel Xeon processor generations with the Intel MKL-DNN library.

Image of a chart
Figure 1:Training throughput of Intel Optimized Caffe across Intel Xeon processor v2 (formerly codename Ivy Bridge)^h, Intel Xeon processor v3 (formerly codename Haswell)^g, Intel Xeon processor v4 (codename Broadwell)^e, Intel Xeon Gold processor^f and Intel Xeon Platinum processors (formerly codename: Skylake)^d with AlexNet using batch size (BS) equal to 256, GoogleNet v1 BS=96, ResNet-50 BS=50, and VGG-19 BS=64. Intel® MKL-DNN provides significant performance gains starting with the Intel Xeon processors v3 when AVX-2 instructions are introduction and another significant jump with the Intel Xeon Scalable processors when AVX-512 instructions are introduced.

At the heart of these optimizations is the Intel® Math Kernel Library (Intel® MKL)⁸ and the Intel MKL-DNN library⁹. There are a variety of deep learning models, and they may seem very different. However, most models are built from a limited set of building blocks known as primitives that operate on tensors. Some of these primitives are inner products, convolutions, rectified linear units or ReLU, batch normalization, etc, along with functions necessary to manipulate tensors. These building blocks or low-level deep learning functions have been optimized for the Intel Xeon product family inside the Intel MKL library. Intel MKL is a library that contains many mathematical functions and only some of them are used for deep learning. In order to have a more targeted deep learning library and to collaborate with deep learning developers, Intel MKL-DNN was released open-source under an Apache 2 license with all the key building blocks necessary to build complex models. Intel MKL-DNN allows industry and academic deep learning developers to distribute the library and contribute new or improved functions. Intel MKL-DNN is expected to lead in performance as all new optimizations will first be introduced in Intel MKL-DNN.

Deep learning primitives are optimized in the Intel MKL-DNN library by incorporating prefetching, data layout, cache-blocking, data reuse, vectorization, and register-blocking strategies⁹. High utilization requires that data be available when the execution units (EU) need it. This requires prefetching the data and reusing the data in cache instead of fetching that same data multiple times from main memory. For cache-blocking, the goal is to maximize the computation on a given block of data that fits in cache, typically in MLC. The data layout is arranged consecutively in memory so that access in the innermost loops is as contiguous as possible avoiding unnecessary gather/scatter operations. This results in better utilization of cache lines (and hence bandwidth) and improves pre-fetcher performance. As we loop through the block, we constrain the outer dimension of the block to be a multiple of SIMD-width and the inner most dimension looping over groups of SIMD-width to enable efficient vectorization. Register blocking may be needed to hide the latency of the FMA instructions³.

Additional parallelism across cores is important for high CPU utilization, such as parallelizing across a batch using OpenMP*. This requires improving the load balance so that each core is doing an equal amount of work and reducing synchronization events across cores. Efficiently using all cores in a balanced way requires additional parallelization within a given layer.

These sets of optimizations ensure that all the key deep learning primitives, such as convolution, matrix multiplication, batch normalization, etc. are efficiently vectorized to the latest SIMD and parallelized across the cores¹⁰. Intel MKL-DNN primitives are implemented in C with C and C++ API bindings for most widely used deep learning functions¹¹:

Direct batched convolution
Inner product
Pooling: maximum, minimum, average
Normalization: local response normalization across channels (LRN), batch normalization
Activation: rectified linear unit (ReLU), softmax
Fused primitives: convolution+ReLU
Data manipulation: multi-dimensional transposition (conversion), split, concat, sum and scale
Coming soon: Long short-term memory (LSTM) and Gated recurrent units (GRU)

There are multiple deep learning frameworks such as Caffe, TensorFlow, MXNet, PyTorch, etc. Modifications (code refactorization) at the framework level is required to efficiently take advantage of the Intel MKL-DNN primitives. The frameworks carefully replace calls to existing deep learning functions with the appropriate Intel MKL-DNN APIs avoiding the framework and Intel MKL-DNN library from competing for the same threads. During setup, the framework manages layout conversions from the framework to MKL-DNN and allocate temporary arrays if the appropriate output and input data layouts do not match. To improve performance, graph optimizations may be required to keep conversion between different data layouts to a minimum. During the execution step the data is fed to the network in a plain layout like BCWH (batch, channel, width, height) and is converted to a SIMD-friendly layout. As data propagates between layers the data layout is preserved and conversions are made when it is necessary to perform operations that are not supported by the Intel MKL-DNN [10].

Image of Operation graph flow chart
Figure 2: Operation graph flow. MKL-DNN primitives are shown in blue. Framework optimizations attempts to reduce the layout conversion so that the data stays in the MKL-DNN layout for consecutive primitive operations. Image courtesy of¹⁰.

Part 3: Advancements in Distributed Training Algorithms For Deep Learning

Training a large deep learning model often takes days or even weeks. Distributing the computational requirement among multiple server nodes can reduce the time to train. However, regardless of the hardware use there are algorithmic challenges to this, but there are recent advancements in distributed algorithms that mitigate some of these challenges. In Part 3, we review the gradient descent and stochastic gradient descent (SGD) algorithms and explain the limitations of training with very large batches; we discuss model and data parallelism; we review synchronous SGD (SSGD), asynchronous SGD (ASGD) and allreduce/broadcast algorithms; finally, we present recent advances that enable larger batch-size SSGD training and present state-of-the-art results.

In supervised deep learning, input data is passed through the model and the output is compared to the ground truth or expected output. A penalty or loss is then computed. Training the model involves adjusting the model parameters to reduce this loss. There are various optimization algorithms that can be used to minimize the loss function such as gradient descent, or variants such as stochastic gradient descent, Adagrad, Adadelta, RMSprop, Adam, etc.

In gradient descent (GD), also known as steepest descent, the loss function for a particular model defined by the set of weights w is computed over the entire dataset. The weights are updated by moving in the direction opposite to the gradient; that is, moving towards the local minimum: updated-weights = current-weights – learning-rate * gradient.

In stochastic gradient descent (SGD), or more correctly called batch gradient descent, the dataset is broken into several batches. The loss is computed with respect to a batch and the weights are updated using the same update rule as gradient descent. There are other variants that speed up the training process by accumulating velocity (known as momentum) in the direction of the opposite of gradients, or that reduce the data scientist’s burden of choosing a good learning rate by automatically modifying the learning rate depending on the norm of the gradients. An in-depth discussion of these variants can be found elsewhere¹².

The behavior of SGD approaches the behavior of GD as the batch sizes increase and become the same when the batch size equals the entire dataset. There are three main challenges that GD has (and SGD also has when the batch size is very large). First, each step is computationally expensive as it requires computing the loss over the entire dataset. Second, learning slows near saddle points or areas where the gradient is close to zero. Third, according to Intel and Northwestern researchers¹³, it appears that the optimization space has many sharp minima. Gradient descent does not explore the optimization space but rather moves towards the local minimum directly underneath its starting position, which is often a sharp minimum. Sharp minima do not generalize. While the overall loss function with respect to the test dataset is similar to that of the training dataset, the actual cost at the sharp minima may be very different. A cartoonish way to visualize this is shown in Figure 3 where the loss function with respect to the test dataset is slighted shifted from the loss function with respect to the training dataset. This shift results in models that converge to a sharp minimum having a high cost with respect to the test dataset, meaning that the model does not generalize well for data outside the training set. On the other hand, models that converge to a flat minimum have a low cost with respect to the test dataset, meaning that the model generalizes well for data outside the training set.

Image of a graph
Figure 3:In this cartoonish figure, the loss function with respect to the test dataset is slighted shifted from the loss function with respect to the training dataset. The sharp minimum has a high cost with respect to the test loss function. Image courtesy of¹³.

Small-batch SGD (SB-SGD) resolves these issues. First, using SB-SGD is computational inexpensive and therefore each iteration is fast. Second, it is extremely unlikely to get stuck at a saddle point using SB-SGD since the gradients with respect to some of the batches in the training set are likely not zero even if the gradient with respect to the entire training set is zero. Third, it is more likely to find a flat minimum since SB-SGD better explores the solution space instead of moving towards the local minimum directly underneath its starting position. On the other hand, very small batches or tiny batches are also not ideal because it is difficult to have high CPU (or GPU) utilization. This becomes more problematic when distributing the computational workload of the small-batch across several worker nodes. Therefore, it is important to find a batch size large enough to maintain high CPU utilization but not so large to avoid the issues of GD. This becomes more important for synchronous data-parallel SGD discussed below.

Efficiently distributing the workload across several worker nodes can reduce the overall time-to-train. The two main techniques used are model parallelism and data parallelism. In model parallelism, the mode is split among the worker nodes with each node working on the same batch. Model parallelism is used in practice when the memory requirements exceed the worker’s memory. Data parallelism is the more common approach and works best for models with fewer weights. In data parallelism, the batch is split among the worker nodes with each node having the full model and processing a piece of the batch, known as the node-batch. Each worker node computes the gradient with respect to the node batch. These gradients are then aggregated using some allreduce algorithm (a list of allreduce options is discussed below) to compute the gradient with respect to the overall batch. The model weights are then updated and those updated weights are broadcasted to each worker node. At the end of each iteration or cycle through a batch, all the worker nodes have the same updated model, that is, the nodes are synchronized. This is known as synchronous SGD (SSGD).

Asynchronous SGD (ASGD) alleviates the overhead of synchronization. However, ASGD has additional challenges. ASGD requires more tuning of hyperparameters such as momentum, and requires more iterations to train. Furthermore, it does not match single node performance and therefore it is more difficult to debug. In practice ASGD has not been shown to scale and retain accuracy on large models. Stanford, LBNL, and Intel researchers have shown that an ASGD/SSGD hybrid approach can work where the nodes are clustered in up to 8 groups. Updates within a group are synchronous and between groups asynchronous. Going beyond 8 groups reduces performance due to the ASGD challenges¹⁴.

One strategy for communicating gradients is to appoint one node as the parameter server, which computes the sum of the node gradients, updates the model, and sends the updated weights back to each worker. However, there is a bottleneck in sending and receiving all of the gradients using one parameter server. Unless ASGD is used, a parameter server strategy is not recommended.

Allreduce and broadcast algorithms are used for communicating and adding the node gradients and then broadcasting updated weights. There are various allreduce algorithms including Tree, Butterfly, and Ring. Butterfly is optimal for latency scaling at O(log(P)) iterations, where P is the number of worker nodes, and combines allreduce and broadcast. Ring is optimal for bandwidth; for large data communication it scales at O(1) with the number of nodes. In bandwidth constraints clusters, e.g., using 10 GbE, AllReduce Ring is usually the better algorithm. A detailed explanation of the AllReduce Ring algorithm is found elsewhere¹⁵.

Image of map of communication strategies
Figure 4:Various communication strategies. Butterfly All-reduce is optimal for latency. Ring All-reduce is optimal for bandwidth. Image courtesy of Amir Gholami, Peter Jin, Yang You, Kurt Keutzer, and the PALLAS group at UC Berkeley.

In November 2014, Jeff Dean spoke of Google’s research goal to reduce training time from six weeks to a day¹⁶. Three years later, CPUs were used to train AlexNet in 11 minutes! This was accomplished by using larger batch sizes that allows distributing the computational workload to 1000+ nodes. To scale efficiently, the communication of the gradients and updated weights must be hidden in the computation of these gradients.

Increasing the overall batch size is possible with these techniques: 1) proportionally increasing the batch size and learning rate; 2) slowly increasing the learning rate during the initial part of training (known as warm-up learning rates); and 3) having a different learning rate for each layer in the model using the layer-wise adaptive rate scaling (LARS) algorithm. Let’s go through each technique in more detail.

The larger the batch size, the more confidence one has in the gradient and therefore a larger learning rate can be used. As a rule of thumb, the learning rate is increased proportional to the increased batch size⁴,¹⁷ This technique allowed UC Berkeley researchers to increase the batch size from 256 to 1024 with the GoogleNet model and scale to 128 K20-GPU nodes reducing the time-to-train from 21 days to 10.5 hours¹⁸, and Intel researchers to increase the batch size from 128 to 512 with the VGG-A model and scale to 128 Intel Xeon processor E5-2698 v3 nodes¹⁹.

A large learning rate can lead to divergence (the loss increases with each iteration rather than decreases), in particular during the initial training phase. This is because the norm of the gradients is much greater than the norm of the weights during the initial training phase²⁰. This is mitigated by gradually increasing the learning rate during the initial training phase, for example during the first 5 epochs, until the target learning rate is reached. This technique allowed Facebook* researchers to increase the batch size from 256 to 8096 with ResNet-50 and scale to 256 P100-GPU nodes reducing the time-to-train from 29 hours (using 8 P100-GPUs) to 1 hour²¹. This technique allowed SurfSARA and Intel researchers to scale to 512 2-socket Intel Xeon Platinum processors reducing the ResNet-50 time-to-train to 40 minutes²².

Researchers at NVidia* observed that the ratio of gradients to the weights for different layers within a model greatly varies [20]. They proposed having a different learning rate for each layer that is inversely proportional to this ratio. This technique (combined with the ones above) allowed them to increase the batch size to 32K.

UC Berkeley, U-Texas, and UC Davis researchers used these techniques to achieve record training times: AlexNet in 11 minutes and ResNet-50 in 31 minutes on Intel CPUs to state-of-the-art accuracy¹. They used 1024 and 1600, respectively, 2-socket Intel Xeon Platinum 8160 processor servers with the Intel MKL-DNN library and the Intel Optimized Caffe framework⁵. SURFsara and Intel researchers train ResNet-50 in 42 minutes on 1024 2S Intel Xeon Platinum Processor 8160 to state-of-the-art accuracy²³.

Part 4: Commercial Uses Cases

The additional computation and memory capacity, coupled with the software optimization and advancements of distributed training are enabling industries to use their existing Intel Xeon processors for deep learning working. In Part 4, we present two commercial uses cases. One use at Intel assembly and test factory and one use for inspection service of Honeywell Drone at Honeywell.

Intel assembly and test factory benefited from Intel Optimized Caffe on Intel Xeon processors improving silicon manufacturing package fault detection. This project aimed to reduce the human review rate for package cosmetic damage at the final inspection point, while keeping the false negative ratio at the same level as the human rate. The input was package photos, and the goal was to perform binary classification on each of them, indicating whether the package was rejected or passed. The GoogleNet model was modified for this task. Using 8 Intel Xeon Platinum 8180 processor connected via 10 Gb Ethernet training was completed within 1 hour. The false negative rate consistently met the expected human-level accuracy. Automating this process saved 70% of the inspectors’ time. Details of this project are found elsewhere²⁴.

Honeywell recently launched its first commercial unmanned aerial vehicle (UAV) inspection service²⁵. Honeywell successfully used Faster-RCNN* with Intel Optimized Caffe for the tasks of solar panel defect detection. 300 original solar panel images augmented with 36-degree rotation were used in training. Training on Intel Xeon Platinum 8180 processor took 6 hours and achieved a detection accuracy of 96.3% under some adverse influences from environments. The inference performance is 188 images per second. This is a general solution that can be used for various inspection services in the markets including oil and gas inspection, pipeline seepage and leakage; utilities inspection, transmission lines and substations; and emergency crisis response.

Conclusion

Intel’s newest Xeon Scalable processors along with optimized deep learning functions in the Intel MKL-DNN library provide sufficient compute for deep learning training workloads (in addition to inference, classical machine learning, and other AI algorithms). Popular deep learning frameworks are now incorporating these optimizations, increasing the effective performance delivered by a single server by over 100x in some cases. Recent advances in distributed algorithms have also enabled the use of hundreds of servers to reduce the time to train from weeks to minutes. Data scientists can now use their existing general-purpose Intel Xeon processor clusters for deep learning training as well as continue using them for deep learning inference, classical machine learning and big data workloads. They can get excellent deep learning training performance using 1 Intel CPU, and further reduced the time-to-train by using multiple CPU nodes scaling near linearly to hundreds of nodes.

Footnotes

Available in Intel Xeon Platinum processors, Intel Xeon Gold processors 6000 series and 5122.
The raw compute can be calculated as AVX-512-frequency * number-of-cores * number-of-FMAs-per-core * 2-instructions-per-FMA * SIMD-vector-length / number-of-bits-in-numerical-format. The Intel Xeon Platinum 8180 has AVX-512- frequency * 28 * 2 * 2 * 512/32 = 1792 * AVX-512-frequency peak TFLOPS. The AVX-512 frequencies for multiple SKUs can be found at https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html. The frequencies shown correspond to FP64 operations; the frequencies for FP32 may be slightly higher than the ones shown. For deep learning workloads, the AVX-512 max turbo-frequency may not be sustained when running high FLOPS workloads.
The FMA latency in the Intel Xeon Scalable processors is 4 clock cycles per FMA (it was 5 in the previous Intel Xeon processor generation). An Intel Xeon Scalable processor with 2 FMAs require at least 8 registers to hide these latencies. In practice, blocking by 10 registers is desired, e.g., at least 8 for the data and at least 1 for the deep learning model weights.
This does not work as we approach large batch sizes 8K+²⁶. After 8K, the learning rate should increase proportional to the square root of the increased in batch sizes.
The researchers added some modifications that will be committed to the main Intel Optimized Caffe branch.

Acknowledgements

A special thanks to the performance team for collecting and reviewing the data: Deepthi Karkada, Adam Procter, William (Prune) Wickart, Vikram Saletore, and Banu Nagasundaram, and to the wonderful reviewers Alexis Crowell, Chase Adams, Eric Dahlen, Wei Li, Akhilesh Kumar, Mike Pearce, R. Vivek Rane, Dave Hill, Mike Ferron-Jones, Allyson Klein, Frank Zhang, Akhilesh Kumar, and Israel Hirsh for ensuring technical correctness and improving the clarity and content of the article.

About the authors

Andres Rodriguez, PhD, is a Sr. Principal Engineer with Intel’s Artificial Intelligence Products Group (AIPG) where he designs AI solutions for Intel’s customers and provides technical leadership across Intel for AI products. He has 13 years of experience working in AI. Andres received his PhD from Carnegie Mellon University for his research in machine learning. He holds over 20 peer reviewed publications in journals and conferences, and a book chapter on machine learning.

Frank Zhang is the Intel Optimized Caffe product manager from Intel Software and Service Group where he is responsible for product management of Intel Optimized Caffe deep learning framework development, product release and customer support. He has more than 20 years of industrial experience in software development from multiple companies including NEC, TI and Marvell. Frank graduated from University of Texas at Dallas with master degree in Electrical Engineering.

Jiong Gong is a senior software engineer with Intel’s Software and Service Group where he is responsible for the architectural design of Intel Optimized Caffe, making optimizations to show its performance advantage on both single-node and multi-node IA platforms. Jiong has more than 10 years industrial experiences in system software and AI. Jiong graduated from Shanghai Jiao Tong University as a master in computer science. He holds 4 US patents on AI and machine learning.

Chong Yu is a software engineer in Intel Software and Service Group, and now is working for Intel Optimized Caffe framework development and optimization on IA platforms. Chong won the Intel Fellowship and then joined Intel 5 years ago. Chong obtained the master degree in information science and technology from Fudan University. Chong published 20 journal publications and has 2 Chinese patents. His research areas include artificial intelligence, robotics, 3D reconstruction, remote sensing, steganography, etc.

Configuration Details

a. STREAM: 1-Node, 2 x Intel Xeon Platinum 8180 processor on Neon City with 384 GB Total Memory on Red Hat Enterprise Linux* 7.2-kernel 3.10.0-327 using STREAM AVX 512 Binaries. Data Source: Request Number: 2500, Benchmark: STREAM - Triad, Score: 199 Higher is better

b. SGEMM: System Summary 1-Node, 1 x Intel Xeon Platinum 8180 processor GEMM - GF/s 3570.48 processor Intel® Xeon® Platinum 8180 processor (38.5M Cache, 2.50 GHz)Vendor Intel Nodes 1 Sockets 1 Cores 28 Logical processors 56 Platform Lightning Ridge SKX Platform Comments Slots 12 Total Memory 384 GB Memory Configuration 12 slots / 32 GB / 2666 MT/s / DDR4 RDIMM Memory Comments OS Red Hat Enterprise Linux* 7.3 OS/Kernel Comments kernel 3.10.0-514.el7.x86_64 Primary / Secondary Software ic17 update2 Other Configurations BIOS Version: SE5C620.86B.01.00.0412.020920172159 HT No Turbo Yes 1-Node, 1 x Intel® Xeon® Platinum 8180 processor on Lightning Ridge SKX with 384 GB Total Memory on Red Hat Enterprise Linux* 7.3 using ic17 update2. Data Source: Request Number: 2594, Benchmark: SGEMM, Score: 3570.48 Higher is better

c. SGEMM, IGEMM proof point: SKX: Intel Xeon Platinum 8180 CPU Cores per Socket 28 Number of Sockets 2 (only 1 socket was used for experiments) TDP Frequency 2.5 GHz BIOS Version SE5C620.86B.01.00.0412.020920172159 Platform Wolf Pass OS Ubuntu 16.04 Memory 384 GB Memory Speed Achieved 2666 MHz

BDX: Intel(R) Xeon(R) CPU E5-2699 v4 Cores per Socket 22 Number of Sockets 2 (only 1 socket was used for experiments) TDP Frequency 2.2 GHz BIOS Version GRRFSDP1.86B.0271.R00.1510301446 Platform Cottonwood Pass OS Red Hat 7.0 Memory 64 GB Memory Speed Achieved 2400 MHz

d. Platform: 2S Intel® Xeon® Platinum 8180 CPU @ 2.50GHz (28 cores), HT disabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 384GB DDR4-2666 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. SSD: Intel® SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, MLC).

Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact‘, OMP_NUM_THREADS=56, CPU Freq set with cpupower frequency-set -d 2.5G -u 3.8G -g performance

Caffe: (http://github.com/intel/caffe/), revision f96b759f71b2281835f690af267158b82b150b5c. Inference measured with “caffe time --forward_only” command, training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models (GoogLeNet, AlexNet, and ResNet-50), https://github.com/intel/caffe/tree/master/models/default_vgg_19 (VGG-19), and https://github.com/soumith/convnet-benchmarks/tree/master/caffe/imagenet_winners (ConvNet benchmarks; files were updated to use newer Caffe prototxt format but are functionally equivalent). Intel C++ compiler ver. 17.0.2 20170213, Intel MKL small libraries version 2018.0.20170425. Caffe run with “numactl -l“.

TensorFlow: (https://github.com/tensorflow/tensorflow), commit id 207203253b6f8ea5e938a512798429f91d5b4e7e. Performance numbers were obtained for three convnet benchmarks: alexnet, googlenetv1, vgg(https://github.com/soumith/convnet-benchmarks/tree/master/tensorflow) using dummy data. GCC 4.8.5, Intel MKL small libraries version 2018.0.20170425, interop parallelism threads set to 1 for alexnet, vgg benchmarks, 2 for googlenet benchmarks, intra op parallelism threads set to 56, data format used is NCHW, KMP_BLOCKTIME set to 1 for googlenet and vgg benchmarks, 30 for the alexnet benchmark. Inference measured with --caffe time -forward_only -engine MKL2017option, training measured with --forward_backward_only option.

MxNet: (https://github.com/dmlc/mxnet/), revision 5efd91a71f36fea483e882b0358c8d46b5a7aa20. Dummy data was used. Inference was measured with “benchmark_score.py”, training was measured with a modified version of benchmark_score.py which also runs backward propagation. Topology specs from https://github.com/dmlc/mxnet/tree/master/example/image-classification/symbols. GCC 4.8.5, Intel MKL small libraries version 2018.0.20170425.

Neon: ZP/MKL_CHWN branch commit id:52bd02acb947a2adabb8a227166a7da5d9123b6d. Dummy data was used. The main.py script was used for benchmarking in mkl mode. ICC version used : 17.0.3 20170404, Intel MKL small libraries version 2018.0.20170425.

e. Platform: 2S Intel® Xeon® Gold 6148 CPU @ 2.40GHz (20 cores), HT disabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 384GB DDR4-2666 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. SSD: Intel® SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, MLC).

Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact‘, OMP_NUM_THREADS=40, CPU Freq set with cpupower frequency-set -d 2.4G -u 3.7G -g performance

TensorFlow: (https://github.com/tensorflow/tensorflow), commit id 207203253b6f8ea5e938a512798429f91d5b4e7e. Performance numbers were obtained for three convnet benchmarks: alexnet, googlenetv1, vgg(https://github.com/soumith/convnet-benchmarks/tree/master/tensorflow) using dummy data. GCC 4.8.5, Intel MKL small libraries version 2018.0.20170425, interop parallelism threads set to 1 for alexnet, vgg benchmarks, 2 for googlenet benchmarks, intra op parallelism threads set to 40, data format used is NCHW, KMP_BLOCKTIME set to 1 for googlenet and vgg benchmarks, 30 for the alexnet benchmark. Inference measured with --caffe time -forward_only -engine MKL2017option, training measured with –forward_backward_only option.

f. Platform: 2S Intel® Xeon® CPU E5-2699 v4 @ 2.20GHz (22 cores), HT enabled, turbo disabled, scaling governor set to “performance” via acpi-cpufreq driver, 256GB DDR4-2133 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64. SSD: Intel® SSD DC S3500 Series (480GB, 2.5in SATA 6Gb/s, 20nm, MLC).

Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact,1,0‘, OMP_NUM_THREADS=44, CPU Freq set with cpupower frequency-set -d 2.2G -u 2.2G -g performance

Caffe: (http://github.com/intel/caffe/), revision f96b759f71b2281835f690af267158b82b150b5c. Inference measured with “caffe time --forward_only” command, training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models (GoogLeNet, AlexNet, and ResNet-50), https://github.com/intel/caffe/tree/master/models/default_vgg_19 (VGG-19), and https://github.com/soumith/convnet-benchmarks/tree/master/caffe/imagenet_winners (ConvNet benchmarks; files were updated to use newer Caffe prototxt format but are functionally equivalent). GCC 4.8.5, Intel MKL small libraries version 2017.0.2.20170110.

TensorFlow: (https://github.com/tensorflow/tensorflow), commit id 207203253b6f8ea5e938a512798429f91d5b4e7e. Performance numbers were obtained for three convnet benchmarks: alexnet, googlenetv1, vgg(https://github.com/soumith/convnet-benchmarks/tree/master/tensorflow) using dummy data. GCC 4.8.5, Intel MKL small libraries version 2018.0.20170425, interop parallelism threads set to 1 for alexnet, vgg benchmarks, 2 for googlenet benchmarks, intra op parallelism threads set to 44, data format used is NCHW, KMP_BLOCKTIME set to 1 for googlenet and vgg benchmarks, 30 for the alexnet benchmark. Inference measured with --caffe time -forward_only -engine MKL2017option, training measured with --forward_backward_only option.

MxNet: (https://github.com/dmlc/mxnet/), revision e9f281a27584cdb78db8ce6b66e648b3dbc10d37. Dummy data was used. Inference was measured with “benchmark_score.py”, training was measured with a modified version of benchmark_score.py which also runs backward propagation. Topology specs from https://github.com/dmlc/mxnet/tree/master/example/image-classification/symbols. GCC 4.8.5, Intel MKL small libraries version 2017.0.2.20170110.

g. Platform: 2S Intel® Xeon® CPU E5-2699 v3 @ 2.30GHz (18 cores), HT enabled, turbo disabled, scaling governor set to “performance” via intel_pstate driver, 256GB DDR4-2133 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.el7.x86_64. OS drive: Seagate* Enterprise ST2000NX0253 2 TB 2.5" Internal Hard Drive.

Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact,1,0‘, OMP_NUM_THREADS=36, CPU Freq set with cpupower frequency-set -d 2.3G -u 2.3G -g performance

Intel Caffe: (http://github.com/intel/caffe/), revision b0ef3236528a2c7d2988f249d347d5fdae831236. Inference measured with “caffe time --forward_only” command, training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models (GoogLeNet, AlexNet, and ResNet-50), https://github.com/intel/caffe/tree/master/models/default_vgg_19 (VGG-19), and https://github.com/soumith/convnet-benchmarks/tree/master/caffe/imagenet_winners (ConvNet benchmarks; files were updated to use newer Caffe prototxt format but are functionally equivalent). GCC 4.8.5, MKLML version 2017.0.2.20170110.

BVLC-Caffe: https://github.com/BVLC/caffe, Inference & Training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before trainingBVLC Caffe (http://github.com/BVLC/caffe), revision 91b09280f5233cafc62954c98ce8bc4c204e7475 (commit date 5/14/2017). BLAS: atlas ver. 3.10.1.

h. Platform: 2S Intel® Xeon® CPU E5-2697 v2 @ 2.70GHz (12 cores), HT enabled, turbo enabled, scaling governor set to “performance” via intel_pstate driver, 256GB DDR3-1600 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel 3.10.0-514.21.1.el7.x86_64. SSD: Intel® SSD 520 Series 240GB, 2.5in SATA 6Gb/s, 25nm, MLC.

Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact,1,0‘, OMP_NUM_THREADS=24, CPU Freq set with cpupower frequency-set -d 2.7G -u 3.5G -g performance

Caffe: (http://github.com/intel/caffe/), revision b0ef3236528a2c7d2988f249d347d5fdae831236. Inference measured with “caffe time --forward_only” command, training measured with “caffe time” command. For “ConvNet” topologies, dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. Topology specs from https://github.com/intel/caffe/tree/master/models/intel_optimized_models (GoogLeNet, AlexNet, and ResNet-50), https://github.com/intel/caffe/tree/master/models/default_vgg_19 (VGG-19), and https://github.com/soumith/convnet-benchmarks/tree/master/caffe/imagenet_winners (ConvNet benchmarks; files were updated to use newer Caffe prototxt format but are functionally equivalent). GCC 4.8.5, Intel MKL small libraries version 2017.0.2.20170110.

Bibliography

Y. You, et al., “ImageNet training in minutes.” Nov. 2017. https://arxiv.org/pdf/1709.05011.pdf
K. He, at al., “Deep residual learning for image recognition.” NIPS. Dec. 2015. https://arxiv.org/abs/1512.03385
N. Rao, “Comparing dense compute platforms for AI.” June 2017. https://www.intelnervana.com/comparing-dense-compute-platforms-ai/
D. Mulnix, “Intel® Xeon® processor scalable family technical overview.” Sept. 2017. https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview
A. Rodriguez and J. Riverio, “Deep learning at cloud scale: improving video discoverability by scaling up Caffe on AWS.” AWS Re-Invent, Nov. 2016. https://www.slideshare.net/AmazonWebServices/aws-reinvent-2016-deep-learning-at-cloud-scale-improving-video-discoverability-by-scaling-up-caffe-on-aws-mac205
A. Rodriguez and N. Sundaram, “Intel and Facebook collaborate to boost Caffe2 performance on Intel CPUs.” Apr. 2017. https://software.intel.com/en-us/blogs/2017/04/18/intel-and-facebook-collaborate-to-boost-caffe2-performance-on-intel-cpu-s
E Ould-Ahmed-Vall, et al., “TensorFlow optimizations on modern Intel® architecture.” Aug. 2017. https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture
Intel® MKL https://software.intel.com/en-us/mkl
Intel® MKL-DNN. https://github.com/01org/mkl-dnn
V. Pirogov and G. Federov, “Introducing DNN primitives in Intel® Math Kernel Library.” Oct. 2016. https://software.intel.com/en-us/articles/introducing-dnn-primitives-in-intelr-mkl
https://github.com/01org/mkl-dnn/blob/master/include/mkldnn.hpp
S. Ruder, “An overview of gradient descent optimization algorithms.” June 2017. http://ruder.io/optimizing-gradient-descent/
N. Keskar, et al., “On large-batch training for deep learning: generalization gap and sharp minima.” Feb. 2017. https://arxiv.org/abs/1609.04836
T. Kurth, et al., “Deep learning at 15PF: supervised and semi-supervised classification for scientific data.” Aug. 2017. https://arxiv.org/pdf/1708.05256.pdf
A. Gibiansky, “Bringing HPC techniques to deep learning.” Feb. 2017. http://research.baidu.com/bringing-hpc-techniques-deep-learning/
J. Dean, “Large scale deep learning.” Nov. 2014. https://www.slideshare.net/hustwj/cikm-keynotenov2014
A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks.” Apr. 2014. https://arxiv.org/pdf/1404.5997.pdf
F. Iandola, et al., “FireCaffe: near-linear acceleration of deep neural network training on compute clusters.” Jan. 2016. https://arxiv.org/abs/1511.00175
D. Das, et al., “Distributed deep learning using synchronous stochastic gradient descent.” Feb. 2016. https://arxiv.org/abs/1602.06709
Y. You, I. Gitman, and B. Ginsburg, “Large batch training of convolutional networks.” https://arxiv.org/pdf/1708.03888.pdf
P. Goyal, et al., “Accurate, large minibatch SGD: training ImageNet in 1 hour.” June 2017. https://arxiv.org/abs/1706.02677
V. Codreanu, D. Podareanu and V. Saletore, “Achieving deep learning Training in less than 40 minutes on ImageNet-1K & best accuracy and training time on ImageNet-22K & Places-365 with scale-out Intel® Xeon®/Xeon Phi™ architectures.” Sep. 2017. https://blog.surf.nl/en/imagenet-1k-training-on-intel-xeon-phi-in-less-than-40-minutes/
https://arxiv.org/abs/1711.04291
V. Codreanu, D. Podareanu and V. Saletore, “Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train.” Nov. 2017. https://software.intel.com/en-us/articles/manufacturing-package-fault-detection-using-deep-learning
“Honeywell launches UAV industrial inspection service, teams with Intel on innovative offering.” Sept. 2017. https://www.honeywell.com/newsroom/pressreleases/2017/09/honeywell-launches-uav-industrial-inspection-service-teams-with-intel-on-innovative-offering

↧

Intel® System Studio Release Notes

November 17, 2017, 3:50 pm

Latest and popular articles on Intel Technologies

≫ Next: Driving Higher Performance for Advanced Structural Analysis

≪ Previous: Intel® CPUs for Deep Learning Training

Below are links to Intel® System Studio Release Notes. For more information, please visit our technical articles library.

Intel System Studio 2017 release notes
Intel System Studio 2016 release notes

↧

Driving Higher Performance for Advanced Structural Analysis

November 21, 2017, 3:40 pm

Latest and popular articles on Intel Technologies

≫ Next: Taboola Optimizes Artificial Intelligence for Smarter Content Recommendations

≪ Previous: Intel® System Studio Release Notes

Altair RADIOSS* and Intel® Xeon® Scalable processors combine to provide an advanced solution stack for improved crashworthiness, safety, and manufacturability of automotive structural designs.

Altair RADIOSS* takes advantage of Intel Xeon Scalable processors to optimize highly non-linear problems in structural analysis, potentially helping customers to:

Accelerate simulations to obtain results faster and run more variations
Run more complex simulations to obtain richer insights
Reduce time to market for faster design and development

Download complete solution brief (PDF)

↧

Taboola Optimizes Artificial Intelligence for Smarter Content Recommendations

November 21, 2017, 3:50 pm

Latest and popular articles on Intel Technologies

≫ Next: Carestream Delivers Fast Access to Clinical Imaging from the Cloud

≪ Previous: Driving Higher Performance for Advanced Structural Analysis

Hundreds of billions of recommendations to a billion users a month powered by an optimized combination of hardware from Intel and TensorFlow* software

In the world of online, always-on and on-demand media, being slow is the same as not being there at all.

Taboola’s software-as-a-service (SaaS) solution delivers 360 billion content and article recommendations to over a billion unique users every month on thousands of publishers’ sites. Each one is served in a tenth of a second by an artificial intelligence-based (AI) recommendation engine that analyzes contextual information about web visitors and their preferences. Getting this right drives the clicks, views and shares that are the foundation of modern publishing; getting it wrong, or not delivering at all, undermines this key source of revenue and disrupts user experience.

Download complete case study (PDF)

↧

Carestream Delivers Fast Access to Clinical Imaging from the Cloud

November 21, 2017, 3:57 pm

Latest and popular articles on Intel Technologies

≫ Next: Boss Key’s Lawbreakers: The Return of Cliff Bleszinski

≪ Previous: Taboola Optimizes Artificial Intelligence for Smarter Content Recommendations

Overcoming network latency and bandwidth challenges on the way to giving healthcare providers a cloud-based, FDA-compliant enterprise imaging solution

Fast diagnoses save lives and minimize uncertainty for patients, so doctors and physicians need on-demand access to critical information wherever they are and on multiple form factor devices.

Carestream’s Clinical Collaboration Platform already delivers medical images from magnetic resonance imaging (MRI) and computerized tomography (CT) scans, as well as nuclear medicine and any other images generated across the enterprise, direct to doctor’s mobiles, tablets and PCs. When it decided to pilot rendering these high-resolution images on the server side rather than client side, to accommodate a cloud-based model, it turned to Intel® Xeon® Platinum 8168 processors and Intel® QuickAssist Technology (Intel® QAT) to do so without compromising performance.

Download complete case study (PDF)

↧

Boss Key’s Lawbreakers: The Return of Cliff Bleszinski

November 18, 2017, 4:04 pm

Latest and popular articles on Intel Technologies

≫ Next: An Introduction to Neural Networks With an Application to Games

≪ Previous: Carestream Delivers Fast Access to Clinical Imaging from the Cloud

The original article is published by Intel Game Dev on VentureBeat*: Boss Key’s Lawbreakers: The return of Cliff Bleszinski. Get more game dev news and related topics from Intel on VentureBeat.

Screenshot of game Lawbreakers

Among the pantheon of video game superstars, Cliff Bleszinski, founder of Boss Key Productions and former Creative Director at Epic Games, holds a unique position. Still only 42 years old, he has racked up a resume the envy of many, with multi-million-selling hits headlined by the Unreal series and Gears of War. Given that his career began as a 15-year old, it seemed that Bleszinski had been around the industry for as long as anyone could remember, which unsurprisingly then prompted retirement.

It didn’t last. It was just two years, in fact, until Bleszinski announced his return with a new company and a new game. "I was bored,” he admits, "I missed the creation of games, from the inception all the way through." Downtime during this off-time, when not traveling for fun, was spent playing other games. "I dipped into CSGO," he says, "and my wife is really good at them…but I got bored. And there really hasn't been anything else that captured my attention like that."

The brief tenure outside the industry didn't dampen Bleszinski's popularity with gamers and the games press, in large part for his unfailing pleasantness and good humor, willingness to answer questions with real opinions, and live a remarkably public life through social media.

As a result, building a company from scratch, convincing a team to uproot lives and families and move from around the country to Raleigh, NC, and designing a new game in a crowded marketplace, was all in the gaming public's overbearing interest. Undaunted, the studio has now grown to 65 employees and seen this new game take shape over the course of three years.

Screenshot of game Lawbreakers
Above: Innovative use of changing gravity and gameplay functions like blindfiring ensure Lawbreakers should stand out from the crowd.

So was born Lawbreakers, a competitive shooter that plans to be another entry in the crowded arena of first-person shooters, newly dominated by games like Blizzard's Overwatch. "The game started as my baby," says Bleszinski, "and when we began with the concept art, it immediately took shape."

Given his background, it's no surprise that Bleszinski is keenly aware of all the big-brand players among this popular genre. "There's no lack of shooters, and Halo, Call of Duty, and Battlefield will all keep going given the popularity of character-based action games," he says. Understanding that, Bleszinski explains that this new game started life as a form of spiritual successor to Unreal Tournament 3 and Quake. "Once players mastered Quake III, more people came in to play these games, but then Quake Champions added even more with effects, personalities, and more," he adds.

That has led to a process of constant iteration for Lawbreakers, which provides a fast-paced playing field for two teams competing in four thematically familiar but stylistically distinct game modes. "We were asking questions like 'what happens if you shoot in zero gravity?' that led the maps in some crazy directions, and 'we want to see a character that shoots electricity from its hands'."

While Bleszinski retains the role of creative lead on Lawbreakers, directing gameplay from his distinct and defined personal perspective (a situation he says had changed during his end days at Epic Games, where the scope of big business resulted in a diluting of that personal touch, and which precipitated his retirement) he now adds a crucial second responsibility at Boss Key: CEO.

Screenshot of game Lawbreakers
Above: Teams battling in fast-paced FPS modes pits Lawbreakers against Blizzard's powerhouse Overwatch (and others).

Leading From The Front

"It's been about making important core decisions," he says of the responsibilities, "and being available to employees so I can talk about their 401K or whatever. And if someone is going through something personal, we can cut them a bit of slack."

Always comfortable in front of a crowd or speaking to the press, running the company has changed some of the public-facing events that made Bleszinski the poster-child for "famous" video game developers during the height of the Unreal and Gears campaigns. He cites presenting the original Gears of War at Grauman's Chinese Theater as a career highlight, and now accepts that "While I will do my share of events, I'm reducing the number of boondoggles — like going to the DICE conference in Vegas, that kind of stuff."

The change of focus these responsibilities bring is in part due to a settled, married life, and the maturity of having enjoyed tremendous success and recognizing that it's not just about designing chainsaws on guns or new movement techniques and map designs, but considering the lives of employees. "I'm nervous, but in a good way," he says of the days dealing with the pressures of the game design and execution, publishing deals, and promotion.

While a new studio, the employees include many industry veterans that Bleszinski persuaded to move their talents to North Carolina, just another example of the pressures of being the boss. The move to blend a veteran and enthusiastic new staff already seems to have paid off with Lawbreakers passing the "Plays Great on Intel" performance tests using Intel integrated graphics. Additionally, the netcode has also been independently reviewed as some of the best in the world.

These new responsibilities and pressures manifest quickly in the quick response Bleszinski had to being asked what he hoped his legacy in this industry might be.

"I want the perception to be of a game designer who actually did something, who could throw ideas and see them happen," he says. And how other people might perceive and write about him? "Not too cool for the nerdy kids, and not too nerdy for the cool kids!" It's a line that Bleszinski has used before, emblematic of his thoughtful consideration of the communications with press and the wider gaming audience.

While Bleszinski's journey now looks impressively distant from the kid who appeared in Nintendo Power magazine for topping the leaderboard in Super Mario Bros., it's clearly far from over.

Retirement: The Sequel may have to wait a while.

↧

An Introduction to Neural Networks With an Application to Games

November 19, 2017, 10:48 am

Latest and popular articles on Intel Technologies

≫ Next: Character Animation: Skeletons and Inverse Kinematics

≪ Previous: Boss Key’s Lawbreakers: The Return of Cliff Bleszinski

The original article is published by Intel Game Dev on VentureBeat*: An introduction to neural networks with an application to games. Get more game dev news and related topics from Intel on VentureBeat.

By: Dean Macri

Pirate animated by neural network

Introduction

Speech recognition, handwriting recognition, face recognition: just a few of the many tasks that we as humans are able to quickly solve but which present an ever increasing challenge to computer programs. We seem to be able to effortlessly perform tasks that are in some cases impossible for even the most sophisticated computer programs to solve. The obvious question that arises is "What's the difference between computers and us?".

We aren't going to fully answer that question, but we are going to take an introductory look at one aspect of it. In short, the biological structure of the human brain forms a massive parallel network of simple computation units that have been trained to solve these problems quickly. This network, when simulated on a computer, is called an artificial neural network or neural net for short.

Figure 1 shows a screen capture from a simple game that I put together to investigate the concept. The idea is simple: there are two players each with a paddle, and a ball that bounces back and forth between them. Each player tries to position his or her paddle to bounce the ball back towards the other player. I used a neural net to control the movement of the paddles and through training (we'll cover this later) taught the neural nets to play the game well (perfectly to be exact).

Image Simple Ping-pong Game for Experimentation
Figure 1:Simple Ping-pong Game for Experimentation

In this article, I'll cover the theory behind one subset of the vast field of neural nets: back-propagation networks. I'll cover the basics and the implementation of the game just described. Finally, I'll describe some other areas where neural nets can be used to solve difficult problems. We'll begin by taking a simplistic look at how neurons work in your brain and mine.

Neural Network Basics

Neurons in the Brain
Shortly after the turn of the 20th century, the Spanish anatomist Ramón y Cajál introduced the idea of neurons as components that make up the workings of the human brain. Later, work by others added details about axons, or output connections between neurons, and about dendrites, which are the receptive inputs to a neuron as seen in Figure 2.

Image of Simplified Representation of a Real Neuron
Figure 2:Simplified Representation of a Real Neuron

Put simplistically, a neuron functionally takes many inputs and combines them to either produce an excitatory or inhibitory output in the form of a small voltage pulse. The output is then transmitted along the axon to many inputs (potentially tens of thousands) of other neurons. With approximately 1010neurons and 6×1013 connections in the human brain¹ it's no wonder that we're able to perform the complex processes we do. In nervous systems, massive parallel processi ng compensates for the slow (millisecond+) speed of the processing elements (neurons).

In the remainder of this article, we'll cover how artificial neurons, based on the model just described, can be used to mimic behaviors common to humans and other animals. While we can't simulate 10 billion neurons with 60 trillion connections, we can give you a simple worthy opponent to enrich your game play.

Artificial Neurons

Using the simple model just discussed, researchers in the middle of the 20th century derived mathematical models for simulating the workings of neurons within the brain. They chose to ignore several aspects of real neurons such as their pulse-rate decay and came up with an easy-to-understand model. As illustrated in Figure 3, a neuron is depicted as a computation block that takes inputs (X0, X1Xn) and weights (W0, W1 Wn), multiplies them and sums the results to produce an induced local field, v, which then passes through a decision function, φ(v), to produce a final output, y.

Image Mathematical model of a neuron
Figure 3:Mathematical model of a neuron

Put in the form of a mathematical equation, this reduces to:

Image of a mathematical equation

I introduced two new terms, induced local field and decision function, while describing the components of this model so let's take a look at what these mean. The induced local field of a neuron is the output of the summation unit, as indicated in the diagram. If we know that the inputs and the weights can have values that range from -? to +?, then the range of the induced local field is the same. If just the induced local field was propagated to other neurons, then a neural network could perform only simple, linear calculations. To enable more complex computation, the idea of a decision function was introduced. McCulloch and Pitts introduced one of the simplest decision functions in 1943. Their function is just a threshold function that outputs one if the induced local field is greater than or equal to zero and outputs zero otherwise. While some simple problems can be solved using the McCulloch-Pitts model, more complex problems require a more complex decision function. Perhaps the most widely used decision function is the sigmoid function given by:

Image of a mathematical equation

The sigmoid function has two important properties that make it well-suited for use as a decision function:

It is everywhere differentiable (unlike the threshold function), which enables an easy way to train networks, as we'll see later.
Its output includes ranges that exhibit both nonlinear and linear behavior.

Other decision functions like the hyperbolic tangent ?(v)=tanh(v), are sometimes used as well. For the examples we'll cover, we'll use the sigmoid decision function unless otherwise noted.

Connecting the Neurons

We've covered the basic building blocks of neural networks with our look at the mathematical model of an artificial neuron. A single neuron can be used to solve some relatively simple problems, but for more complex problems we have to examine a network of neurons, hence the term: neural network.

A neural network consists of one or more neurons connected into one or more layers. For most networks, a layer contains neurons that are not connected to one another in any fashion. While the interconnect pattern between layers of the network (its "topology") may be regular, the weights associated with the various inter-neuron links may vary drastically. Figure 4 shows a three-layer network with two nodes in the first layer, three nodes in the second layer, and one node in the third layer. The first-layer nodes are called input nodes, the third-layer node is called an output node, and nodes in the layers in between the input and output layers are called hidden nodes.

Image of a three layer neural network
Figure 4:A Three-Layer Neural Network

Notice the input labeled, x6, on the first node in the hidden layer. The fixed input (x6) is not driven by any other neurons, but is labeled as being a constant value of one. This is referred to as a bias and is used to adjust the firing characteristics of the neuron. It has a weight (not shown) associated with it, but the input value will never change. Any neuron can have a bias added by fixing one of its inputs to a constant value of one. We haven't covered the training of a network yet, but when we do, we'll see that the weight affecting a bias can be trained just like the weights of any other input.

The neural networks we'll be dealing with will be structurally similar to the one in Figure 4. A few key features of this type of network are:

The network consists of several layers. There is one input layer and one output layer with zero or more hidden layers
The network is not recurrent which means that the outputs from any node only feed inputs of a following layer, not the same or any previous layer.
Although the network shown in Figure 4 is fully connected, it is not necessary for every neuron in one layer to feed every neuron in the following layer.

Neural Networks for Computation

Now that we've taken a brief look at the structure of a neural network, let's take a quick look at how computation can be performed using a neural network. Later in the paper we'll learn how to go about adjusting weights or training a network to perform a desired computation.

At the simplest level, a single neuron produces one output for a given set of inputs and the output is always the same for that set of inputs. In mathematics, this is known as a function or mapping. For that neuron, the exact relationship between inputs and outputs is given by the weights affecting the inputs and by the particular decision function used by the neuron.

Let's look at a simple example that's common ly used to illustrate the computational power of neural networks. For this example, we will assume that the decision function used is the McCulloch-Pitts threshold function. We want to examine how a neural network can be used to compute the truth table for an AND logic gate. Recall that the output of an AND gate is one if both inputs are one and zero otherwise. Figure 5 shows the truth table for the AND operator.

Image of data table
Figure 5:Truth Table for AND Operator

We want to construct a neural network that has two inputs, one output, and calculates the truth table given in Figure 5.

Figure 6:Neuron for Computing an AND Operation

Figure 6 shows a possible configuration of a neuron that does what we want. The decision function is the McCulloch-Pitts threshold function mentioned previously. Notice that the bias weight (w0) is -0.6. This means that if both X1 and X2 are zero then the induced local field, v, will be -0.6 resulting in a 0 for the output. If either X1 or X2 is one, then the induced local field will be 0.5+(-0.6)= -0.1 which is still negative resulting in a zero output from the decision function. Only when both inputs are one will the induced local field go non-negative (0.4) resulting in a one output from the decision function.

While this use of a neural network is overkill for the problem and has a fairly trivial solution, it's the start of illustrating an important point about the computational abilities of a single neuron. We're going to examine this problem and another one to understand the concept of linearly separable problems.

Look at the "graph" in Figure 7. Here, the x-axis corresponds to input 0 and the y-axis corresponds to input 1. The outputs are written into the graph and correspond to the truth table from Figure 5. The gray shaded area represents the region of values that produce a one as output (if we assume the inputs are valid along the real line from zero to one).

Image of graph
Figure 7:Graph of an AND Function

The key thing to note is that there is a line (the lower left slope of the gray triangle) that separates input values that yield an output of one from input values that yield an output of zero. Problems for which such a "dividing line" can be drawn (such as the AND problem), are classified as linearly separable problems.

Now let's look at another Boolean operation, the exclusive-or (XOR) operation as given in Figure 8.

Image of data table
Figure 8:Truth Table for the XOR Operator

Here, the output is one only if one, but not both, of the inputs is one. The "graph" of this operator is shown in Figure 9.

Image of graph
Figure 9:Graph of an XOR Function

Notice that the gray region surrounding the "one" outputs is separated from the zero outputs by not one line, but two lines (the lower and upper sloping lines of the gray region). This problem is not linearly separable. If we try to construct a single neuron that can calculate this function, we won't succeed.

Early researchers thought that this was a limitation of all computation using artificial neurons. It is only with the addition of multiple layers that it was realized that neurons that were linear in behavior could be combined to solve problems that were not linearly separable. Figure 10 shows a simple, three-neuron network that can solve the XOR problem. We're still assuming that the decision function is the McCulloch-Pitts threshold function.

Figure 10:Network for Calculating XOR Function

All the weights are fixed at 1.0 with the exception of the weight labeled as w=-2. For completeness, let's quickly walk through the outputs for the four different input combinations.

If both inputs are 0, then neurons 0 and 1 both output 0 (because of their negative biases). Thus, the output of neuron 2 is also 0 due to its negative bias and zero inputs.
If X0 is 1 and X1 is 0, then neuron 0 outputs 0, neuron 1 outputs 1 (because 1.0+(-0.5)=0.5 is greater than 0) and neuron 2 then outputs a 1 also.
If X0 is 0 and X1 is 1, then neuron 0 outputs 0, neuron 1 outputs 1, and neuron 2 outputs 1.
Finally, if both inputs are 1, then neuron 0 outputs a 1 that becomes a -2 input to neuron 2 (because of the negative weight). Neuron 1 outputs a 1 which combines with -2 and the -0.5 bias to produce an output of 0 from neuron 2.

The takeaway from this simple example is that to solve non-linearly separable problems, multi-layer networks are needed. In addition, while the McCulloch-Pitts threshold function works fine for these easy to solve problems, a more mathematically friendly (i.e. differentiable) decision function is needed to solve most real world problems. We'll now get into the way a neural network can be trained (rather than being programmed or structured) to solve a particular problem.

Learning Processes

Let's go way back to the definition of the output of a single neuron (we've added a parameter for a particular set of data, k):

Image of a mathematical equation

Equation 1

Note here that x = y, the output from neuron i, if neuron j is not an input neuron. Also, w is the weight connecting output of neuron i as an input to neuron j.

We want to determine how to change the values of the various weights, w(k), when the output, y(k), doesn't agree with the result we expect or require from a given set of inputs, x(k). Formally , let d(k) be the desired output for a given set of inputs, k. Then, we can look at the error function, e(k)=d(k)-y(k). We want to modify the weights to reduce the error (ideally to zero). We can look at the error energy as a function of the error:

Image of a mathematical equation

Equation 2

Adjusting the weights now becomes a problem of minimizing ?(k). We want to look at the gradient of the error energy with respect to the various weights,

Image of a mathematical equation

. Combining Equation 1 and Equation 2 and using the chain rule (and recalling that y(k)=?(v(k)) and v(k)=?w(n)y(n) ), we can expand this derivative to something more manageable:

Image of a mathematical equation

Equation 3

Each of the terms in Equation 3 can be reduced so we get:

Image of a mathematical equation

Equation 4

Where ?'() signifies differentiation with respect to the argument. Adjustments to the weights can be written using the delta rule:

Image of a mathematical equation

Equation 5

Here, ? is a learning-rate parameter that varies from 0 to 1. It determines the rate at which weights are changed to move "up the gradient". If ? is 0, no learning will take place. We can re-write Equation 5 to include what is known as the local gradient, ?(k):

Image of a mathematical equation

Equation 6

Here,

Image of a mathematical equation

Equation 6 can be used directly to update the weights of a neuron in the output layer of a neural network. For neurons in hidden and inputs layers of a network, the calculations are slightly more complex. To calculate the weight changes for these neurons, we use what is known as the back-propagation formula. I won't go through the details of the derivation, but the formula for the local gradient reduces to:

Image of a mathematical equation

Equation 7

In this formula, w(k) represents the weights connecting the output of neuron, j, to an input of neuron n. Once we've calculated the local gradient, ?j, for this neuron, we can use Equation 6 to calculate the weig ht changes.

To compute the weight changes for all the neurons in a network, we start with the output layer. Using Equation 6 we first compute the weight changes for all the neurons in the output layer of the network. Then, using Equation 6 and Equation 7 we compute the weight changes for the hidden layer closest to the output layer. We use these equations again for each additional hidden layer working from outputs toward inputs and from right to left, until weight changes for all the neurons in the network have been calculated. Finally we apply the weight changes to the weights, at which point we can recompute the network output to see if we've gotten closer to the desired result.

Network training can occur in several different ways:

The weight changes can be accumulated over several input patterns and then applied after all input patterns have been presented to the network.
The weight changes can be applied to the network after each input pattern is presented.

Method 1 is most commonly used. When method 2 is used, the patterns are presented to the network in a random order. This is necessary to keep the network from possibly becoming "trapped" in some presentation-order-sensitive local energy minimum.

Before looking at an example problem, let me wrap up this section by noting that I've only discussed one type of learning process: back-propagation using error-correction learning. Other types of learning processes include memory-based learning, Hebbian learning and competitive learning. Refer to the references at the end of this article for more information on these techniques.

Putting Neural Nets to Work

Let's take a closer look at the game I described in the introduction. Figure 11 shows a screen capture of the game after several generations of training.

Image of Ping-Pong Sample Application
Figure 11:Ping-Pong Sample Application

The training occurs by shooting the ball from the center with a random direction (the speed is fixed). The neural network is given as input the (x,y) position of the ball as well as the direction and the y position of the paddle (either red or blue depending upon which paddle the ball is heading towards). The network is trained to output a y direction that the paddle should move.

I created a three-layered network with three nodes in the input layer, ten nodes in the hidden layer, and one node in the output layer. The input nodes each get the same five inputs corresponding to the (x,y) position and direction of the ball and the y position of the paddle. These nodes are fully connected to the nodes in the hidden layer, which are in turn fully connected to the output node. Figure 12 shows the layout of the network with inputs and outputs. Weights, biases, and decision functions are not shown.

Image of Neural Network for Ping Pong Game
Figure 12:Neural Network for Ping Pong Game

The network learns to move the paddle in the same y-direction that the ball is heading. After several thousand generations of training, the neural network learns to play perfectly (the exact number of generations varies because the network weights are initialized to random values).

I experimented with using a paddle speed that was slower than the speed of the ball so that the networks would have to do some form of prediction. With the network from Figure 12 some learning took place but the neural nets weren't able to learn to play perfectly. Some additional features would have to be added to the network to enable it to fully learn this problem.

In this example, the neural network is the only form of AI that the computer controlled opponent has. By varying the level of training, the computer opponent can vary from poor play to perfect play. Deciding when to stop the training is a non-trivial challenge. One easy solution would be to train the network for some number of iterations up front (say 1000) and then each time the human player wins, train the network an additional 100 iterations. Eventually this would produce a perfect computer-controlled opponent, but should also produce a progressively more challenging opponent.

Non-trivial Applications of Neural Nets

While the ping-pong example provides an easy to understand application of neural nets to artificial intelligence, real-world problems require a bit more thought. I'll briefly mention a few possible uses of neural nets, but realize that there isn't going to be a one-size-fits-all neural network that you can just plug into your application and solve all your problems. Good solutions to specific problems require considerable thought and experimentation with what variables or "features" to use as network input and outputs, what size and organization of network to use, and what training sets are used to train the network.

Using a neural network for the complete AI in a game probably isn't going to work well for anything beyond the simple ping-pong example previously discussed. More likely than not, you're going to use a traditional state machine for the majority of AI decisions but you may be able to use neural nets to complement the decisions or to enhance the hard-coded state machine. An example of this might be a neural net that takes as input such things as health of the character, available ammunition, and perhaps health and ammunition of the human opponent. Then, the network could decide whether to fight or flee at which point the traditional AI would take over to do the actual movement, path-finding, etc. Over several games, the network could improve its decision making process by examining whether each decision produced a win or a loss (or maybe less globally, an increase or decrease in health and/or ammunition).

One area that intrigues me and which has had some research devoted to it is the idea of using neural networks to perform the actual physics calculations in a simulation². I think this has promise because training a neural network is ultimately a process of finding a function that fits several sets of data. Given the challenge of creating physical controllers for physically simulated games, I think neural networks are one possibility for solutions there as well.

The use of neural nets for pattern recognition of vario us forms is their ultimate strength. Even in the problems described, the nets would be used for recognizing patterns, whether health and ammunition, forces acting on an object, or something else, and then take an appropriate action. The strength lies in the ability of the neural nets to be trained on a set of well-known patterns and then be able to extract meaningful decisions when presented with unknown patterns. This feature of extrapolation from existing data to new data can be applied to the areas of speech recognition, handwriting recognition and face recognition mentioned in the introduction. And it can also be beneficial to "fuzzy" areas like finding trends in stock market analysis.

Conclusion

I've tried to keep the heavy math to a minimum, the details about sample code pretty much out of the picture, and still provide a solid overview of back-propagation neural networks. Hopefully this article has provided a simple overview of neural networks and given you some simple sample code to examine to see if neural networks might be worth investigating for decision-making in your applications. I'd recommend checking out some of the references to gain a more solid understanding of all the quirks of neural networks. I encourage you to experiment with neural networks and come up with novel ways in which they can add realism to upcoming game titles or enhance your productivity applications.

↧

Character Animation: Skeletons and Inverse Kinematics

November 21, 2017, 8:36 am

Latest and popular articles on Intel Technologies

≫ Next: Using BigDL to Build Image Similarity-Based House Recommendations

≪ Previous: An Introduction to Neural Networks With an Application to Games

The original article is published by Intel Game Dev on VentureBeat*: Character Animation: Skeletons and Inverse Kinematics. Get more game dev news and related topics from Intel on VentureBeat.

By: Steve Pitzel

Introduction

So you’ve built your Gothic castle-it’s time for your hero to dash into play and defend the battlements. Your gaming engine supports skeletons, an animation system called Inverse Kinematics (IK), and complex hierarchical setups for control. You’re eager to try them all. Great! But if you’ve never used a skeleton before, or even applied IK to a jointed model, you’re in for some surprises. Some of them you’ll love, some you may not…at least not at first.

This paper introduces some basic and intermediate principles for using skeletons with both Inverse Kinematics and a top-down rotation system to move animated characters called Forward Kinematics. Along the way, it may answer some of your burning questions. When you’re done reading, hopefully you’ll feel energized enough to leap over any current roadblocks.

Forward Kinematics vs. Inverse Kinematics

Not long ago, game characters were a lot like shellfish: they were basically piles of rock-hard segments. The best animators could hope for was a platform that could handle basic hierarchies, allowing them to position and keyframe all those segments at once. Forget realistically bending an elbow or flexing a muscle-it just wasn’t going to happen. Without the processing power available today to handle real-time surface deformations, games featured a lot of robots and armor.

Advanced Processor Performance Enables Computer-generated Skeletons Animated with Inverse Kinematics

For a while, game developers suffered in silence. Then PCs became faster and smarter, which allowed game engines to grow faster and smarter too. Advanced processing performance enabled animators to use computer-generated skeletons not only to hold segmented characters together, but to actually deform the skins of those characters. IK became the control system of choice.

The Scoop on Hierarchies

To understand Inverse Kinematics, it’s important to understand basic hierarchies and Forward Kinematics. Outside of straight target-to-target morphing, animators almost always use hierarchies of one sort or another to animate their characters. A hierarchy is a parent-child-sibling relationship. The process for assigning one object as the parent or child of another is often called “parenting” and sometimes “grouping.” In the case of a human leg, the parent of the upper leg would be the hip, the lower leg would be a child of the upper leg, and the foot would be the child of the lower leg. The right upper leg would share a sibling relationship with the left, and both would be children of the hips. (See Figure 1.)

Figure 1.Basic lower body setup and hierarchy for Forward Kinematics

Another school of thought on hierarchies uses the “inverted tree” model, where the parent (the hip or pelvis) is referred to as the “root.” While no part of a hierarchy is ever really referred to as the “trunk,” the children, and their children, become “branches.” (See Figure 1.)

The Trouble with Forward Kinematics

If you want to move a character’s hand under a basic hierarchy, you would first rotate the upper arm, then the forearm, and finally the hand itself until the entire limb is in place. This “top-down” system of rotation is called Forward Kinematics, and it’s great for basic animation.

Animation seems simple with Forward Kinematics and basic hierarchies-until your character has to do something like walk without her feet sliding across the floor, or stay in place while turning her body to look behind her. In those cases, you have two choices, neither of which is aesthetically pleasing:

Rotate the entire hierarchy (making the feet slide uselessly).
Rotate only the top of the hierarchy (the hip) leaving the other pieces behind (very painful to watch).

At that point, Forward Kinematics (and your character) falls apart. Planting a walking character’s foot is almost impossible, because the single most difficult thing for an object at the bottom of a hierarchy to do is to stay in one place while the objects above it move. The illusion of stability actually takes constant readjustment. (See Figure 2.)

Figure 2.The difficulty with basic hierarchies and Forward Kinematics is when you turn the entire hierarchy, the feet slide. Keep the feet planted and the result can be painful.

The Advent of Computer-generated Skeletons using Inverse Kinematics

Luckily, today game animators are able to use many of the same animation techniques once reserved only for feature film. In the feature film world, constantly readjusting a dinosaur’s toes to compensate for even slight movements in the ankle, knees, thighs and hips above them led to slips and slides on the movie screen, which measured in feet rather than in pixels. It became clear that something other than Forward Kinematics was needed to realistically animate a character. Another cinematic problem-believable depiction of the creasing and bulging of joints as a character moves-could only be solved using a model made from a continuous mesh. But how do you make something solid bend and walk? Turning to an earlier staple of film special effects, Stop Motion* software developers incorporated an armature system, or skeleton, into their computer-generated (CG) models and developed a system called Inverse Kinematics to control the skeletons. (See Figure 3.)

Figure 3. Computer-generated skeleton

Conceptually, the CG skeleton is easy to understand. It mimics our own physical skeleton-a rigid structure with tendons at each joint to hold it all together. It is the inverted tree hierarchy, in fact, with the parent often called the “root” of the skeleton. The root is usually located at the character’s natural center of gravity; this is often the base of the spine for bipeds. What makes this system better than a basic (non-skeletal) hierarchy, are the tendons, implied but unseen in computer 3D programs. As in a hierarchy built from non-skeletal objects, a skeleton can be manipulated top-down through Forward Kinematics. But perhaps the biggest benefit of the skeleton is its ability to transfer control from the top of the hierarchy to the bottom. Since it works or “solves” from the bottom up (the inverse of Forward Kinematics) the process is called “Inverse Kinematics,” or IK. This is a much more intuitive and much less time-consuming process than dealing with top-down rotations.

NOTE: Some animation packages allow the application of Inverse Kinematics to “non-skeletal” objects such as cubes, spheres and null objects, although creating and operating IK chains built this way is usually much more complex.

Figure 4.Simple bone setup with IK handle. Bones are often depicted wider near the root end and narrower toward the effector. The green triangle formed by the bend of the “elbow,” the top of the shoulder, and the wrist, indicates the elbow’s rotation plane. This plane is adjusted to allow the elbow to swing outward.

The bottommost link of a skeletal chain is often called a goal, or end-effector. (See Figure 4.) (Be aware that these names can mean different things in different programs.) In the case of segmented models, the “flesh” of your character, the actual limbs, torso, head, etc., are then parented to the nearest bone (or joint) of the skeleton. In the case of single-mesh polygonal or seamless NURBS models, you “skin” or “envelope” your model over the underlying skeleton.

NOTE: A character’s limbs, head, skin, or clothing are often referred to as its “geometry.” Geometry can be rendered, while skeletons are usually invisible to the renderer.

With skeletons, instead of animating the geometry, the skeleton itself is animated, and the geometry follows suit. Since the bones have invisible tendons between them, the joints won’t separate when only part of the hierarchy is rotated. What’s more, you can apply IK to the skeletal chain.

Advantages of Inverse Kinematics

So what does all this mean to an animator? It means you can yank the bottom of the chain-the hand, for instance-and all of the bones above the hand automatically rotate into position. It’s a real plus when a character has to reach for something, as in picking an apple from a tree, for example. Rather than mentally computing how far to rotate the shoulder, upper arm, elbow and wrist to get that hand into position, the animator simply places the hand on the apple and the rest of the arm follows.

An additional bonus to using an underlying skeleton is that the bones and joints provide a natural control structure for deforming the surface. With deformation, individual control points on the geometry are moved in relation to the bones and joints. This is a marked improvement over moving an entire piece of geometry at once.

When a model is “skinned” or “enveloped” over a skeleton, either Inverse or Forward Kinematics can be used to animate it. There are still benefits to using Forward Kinematics on some parts of your skeleton. Some packages allow the use of both Forward and Inverse Kinematics on the same skeletal chain and even allow you to switch between them on the fly. Others allow you to switch only with great effort on your part. To do this, you usually have to employ constraints (think of constraints as magnetic forces that can be keyframed on and off), expressions (mathematical formulas), or a combination of both.

Bone Basics

First let’s take a look at software packages, then at our own bones and joints, then we’ll look at how they’re controlled.

Software Packages

Skeletons are neither represented nor controlled the same way in every application. Some packages, such as IZWare Mirai*, Alias Wavefront Maya* and PowerAnimator*, allow you to create complete, fully attached skeletons with branching limbs. At the time of this paper’s publishing, Softimage 3D* and XSI* do not, but they give you the means to glue together your arm chains, leg chains and neck chains through parenting and a separate control system called “constraints.”

Figure 5.When geometry is “skinned” or “enveloped,” surface control points are controlled by the bone(s) nearest them. When bones share control over a surface, very smooth tucks and bends are possible. But even if surface deformation isn’t your desired result, skeletons are great for holding segmented models together, their joints providing natural pivot points for all those pieces.

Some packages, while offering an Inverse Kinematics system, do not provide bones at all. Instead, you build your own skeleton out of primitive objects or assign skeletal properties to individual parts of each character. Others, like Softimage 3D, Hash Animation: Master*, and Discreet 3d Studio Max*, provide the option of doing both, allowing you to incorporate other objects, such as nulls (usually a center or pivot with no renderable geometry) or dummy objects, into their normal skeletal hierarchies. We’ll call all of these objects “nulls” here for sake of brevity. These nulls can be scaled, rotated, or translated, moving the control points of your character’s mesh right along with them. This technique is commonly used in Softimage 3D for things like muscle bulges and chest expansion. (See Figure 6.)

Figure 6.Simple Softimage 3D* muscle bulge setup using a “null” object parented directly into the skeletal hierarchy (purple object and box). Translation, rotation and scaling of the null will affect the position and bulging of the muscle and can easily be tied to elbow rotation through expressions. The black box in the schematic to the left shows the arm geometry as the parent of this family. The blue boxes are the root joint, elbow and wrist joints, which rotate the “bones” between them. Note that in Softimage 3D, joints are selectable objects, and bones are not. In Softimage XSI*, both are selectable.

Individual bones are built with their own set of local axes, usually (but not always) X, Y and Z. As with geometry, the placement and orientation of these axes determine the pivot point of the bone as well as its rotation direction. Generally bones can be thought of as two pieces welded together-the bone itself and the joint that allows the bone to rotate. It’s important to note, however, that not all packages make this distinction. Softimage XSI and 3D, for example, refer to the entire bone segment as a joint. In Maya and in IZWare Mirai, the joint is an attached, but separately selectable object. This allows deforming objects such as “flexors” to be placed either at the joint itself or along the length of a bone. Flexors can facilitate realistic creasing, such as on the inside the elbow (a good candidate for joint positioning), and muscle bulging (the flexor positioned along the bone).

In some cases, axes are fixed-the user is not allowed to change the orientation of the bone’s center. In Softimage 3D, the X-axis always faces down the length of the bone. Other packages, like Maya, allow you to choose the orientation of your center. These packages even let you rotate that center freely, regardless of the bone’s orientation, although doing so haphazardly can lead to unpredictable results.

Why does it matter which way the pivots are oriented? Professional animation isn’t only about doing things correctly-it’s also about doing them quickly. If all of your bones are oriented the same way (we’ll go with positive X running down the length of each), it means all of your fingers, knees, elbows and vertebrae bend forward or backward around the same axis. With most setups, that axis is Z. This makes creating expressions a snap. You don’t have to guess or use trial and error to know which way your character is going to move.

An Example of Surface Deformation using Bones

Hold your hand out in front of you with your palm facing up. Watch your forearm. Now, without moving your shoulder, rotate your hand until your thumb points at the ceiling. That not-so-subtle twist you see between wrist and elbow would be impossible (or at least very painful) if your forearm rotated as one solid piece. Notice that the flesh closer to your hand rotates right along with your wrist joint, but that the effect falls off toward your elbow. With surface deformation, each control point on the overlaid geometry is controlled by the nearest joint(s) or bone(s).

This means that characters no longer need be loose piles of rock-hard segments. Elbows can bend naturally; the biceps can even bulge and flatten. (See Figure 5.)

Bone Control

All right, so we’re back where we started: you’ve purchased an application with a complete skeletal and IK system. Your problems are over, right? Well, not exactly. The bad news is that an IK system is not always more intuitive than Forward Kinematics. For things like picking an apple or planting feet on a floor during a walk cycle, IK is intuitive and easy to use, provided you’ve set it up properly. That “setup” is where working with bones and IK can be trying. The setup takes some time to understand.

Where to Begin

Look at your own joints. Before you’re in too deep, consider how your body works and then remember that a computer-generated (CG) skeleton is not exactly a human skeleton-it’s a representation of one. Although animation tools are getting closer and closer in their ability to reflect reality, it’s still up to you, the animator, to make your animations believable.

Think of the various ways your joints work. Your elbows and knees move differently than the joints in your shoulders and hips, which work a little differently from each other, and all are different from the vertebrae in your back and neck. Generally, animation packages give you one or two types of bones with which to work. They also provide a few different options for controlling the bones. Although we humans have several hundred bones supporting our feet, legs, hands and organs, there is rarely a need to match our internal skeleton bone for bone when building a CG skeleton. That method is often counterproductive from an animation standpoint: having too many bones fighting for control of your surface is like having too many cooks in the kitchen. Compromise. Use the fewest number of bones possible. Again, this is meant to be believable, not real.

Joint Types and Control

To be convincing, you need two kinds of joints: ones that can swing all the way around in the socket (which is really more than any of yours do without excruciating pain), and joints that only bend in one direction, like your elbow. Softimage 3D handles this by giving the first (root) joint of every skeletal chain the ability to spin freely on all axes, more or less imitating a ball-and-socket joint. After that, you have the choice of building either a “3D” skeleton, which treats every joint like the first one, or a “2D” skeleton which, after the ball-like root, only bends in one direction along one plane, more like your elbow. The “3D” skeleton is useful for long-necked creatures or things like seaweed and chains. The “2D” skeleton, which starts off with a ball-and-socket joint and ends up with elbow-like joints, is really best for arms and legs. It’s also best when you’re using no more than two links at a time.

Dealing with Challenges

What if you have to use more than two links at a time? Let’s talk about some common challenges, like adding a third link for things like feet or using IK to animate a character picking an apple.

The Third Link Dilemma: What About Adding Feet?

Two links are fairly straightforward in a chain controlled by IK, but the foot represents a third link. How do you deal with adding feet? In Softimage 3D, the ankle is really two things: the end of the leg chain and the start of the foot chain. To put these two together you’ll need to parent the foot chain beneath the end-effector of the leg chain or use a “constraint.” We’ll be discussing constraints later. Let’s talk about why a chain is difficult to control with IK after two links.

Strictly speaking, a chain controlled by IK is controlled by the bottommost link or joint of the chain. Any movement of that bottom joint, or end-effector, rotates every bone to some degree all the way up to the top of the hierarchy.

Consider a chain with two leg bones (thigh and calf) where the ankle is t he bottom joint and end-effector. Pushing the ankle straight up results in a bending of the knee and a rotation of the thigh at the hip, a fairly straightforward motion. Let’s add another link: a bone for the foot. That same push, this time with the end-effector on the ball of the foot, now results in an ankle bend, a knee bend and a rotation of the thigh at the hip. But the order in which each joint bends and the amount each rotates will vary greatly with only slight adjustments, making this setup difficult to control. (See Figure 7.)

Figure 7.Even though deformation is smoother if more bones are present, it’s best to break up the actual IK control of those bones to chains of no more than two links. The leftmost setup (purple arrow) has a one-link foot chain parented to a two-link leg chain in Softimage 3D*. This setup gives you an end-effector on the ankle and another on the toe. This is preferable to using only one end-effector on a three-link chain. Generally in IK, using more bones means less control.

What was an intuitive, easily controlled motion with two bones goes completely haywire with three or more. This is one reason you’ll see many chains parented or constrained together in a typical Softimage 3D skeletal setup. Here’s where you look to the IK solutions in your software. Despite the creative use of constraints and parenting it requires, Softimage 3D offers a solid IK solution for 3D figure animation.

Other packages provide different solutions. Maya and Mirai offer the ability to create complete skeletons from pelvis to toe. (See Figure 8.) Maya handles the unpredictability of long chains through the use of separate “solvers” placed onto the chains themselves. The animator declares the start and end of a solver’s influence simply by placing them where they’re needed. To easily control an arm, the solver could be stretched from shoulder to wrist. Another solver might be placed from the wrist to the base of the middle finger. Animation of the hand and arm involves manipulating each solver separately.

Maya also offers a special “spline” solver for animating long-necked creatures. With this solver, a curve (spline) is drawn through the vertebrae of the neck. As the curve is manipulated, the vertebrae rotate to take on the shape of the curve-a tremendous solution for sea creatures!

IZWare (formerly Nichimen) animation software has worked behind the scenes since Walt Disney’s Tron* and is famous for its polygonal modeling packages. IZWare Mirai, a newcomer on the “off-the-shelf” animation package scene, helps beginning character animators get up and animating in no time. This is because IZWare Technologies has done so much work up front. Mirai not only allows the creation of complete branching skeletons as in Maya, it also provides ready-made skeletons that are perfectly suitable for many character situations (including humans, dogs and even ticks).

Figure 8.Ready-made models and skeletons from Mirai. A boy, his dog, and one really huge tick!

Solving Automatic Mirroring and Negative-rotation Problems

IZWare has also solved some of the technical director’s headaches by adding in a lot of expressions and character controls for automatically mirroring (or even opposing) limb movements. Examples of these movements include the opposing arm and leg swings of a walk or run cycle (one limb forward, one back) and a conductor raising his arms. You only need to animate one arm or leg and then choose whether you wish to animate the opposing limb automatically or not. This also corrects the negative-rotation problem that comes up when skeletons are built in halves and then mirror-copied in other programs. (For example, since the center of each mirror-copied joint has also been “mirrored,” the joints naturally rotate in opposite directions.)

Dealing with IK Constraints

Okay, you built your skeleton, and you’re using IK to control his feet and keep them planted on the floor. You’ve also used IK on his hands. He takes a step forward, and notice what happens to his arms. (See Figure 9.)

Figure 9.Positions for end-effectors in IK are keyframed in world space, not in reference to the parent’s location (as in a simple hierarchy). That’s actually what makes planting a foot with IK so easy-but it doesn’t work well for wrist positions, which need to relate back to the skeleton. The easy fix is to constrain the wrist end-effectors to null objects or locators (crosshairs in the picture to the right) and then parent those locators back into the hierarchy.

This is where constraints come in. It turns out that it’s rarely a good idea to keyframe an end-effector directly. Doing so actually causes most of the problems you’ll find with IK and can prevent your character from accomplishing many of those nifty moves you’ve so carefully plotted on your storyboard.

Constraints can best be described as powerful magnetic forces that can be turned off and on at will. They allow one object (even null objects) to affect other objects. Different constraints can do different things. Constraints almost always act with, and upon, the centers or pivots of objects, which makes the placement of an object’s center or pivot point extremely important. Although some packages have more types of constraints than others, three basic constraints are staples:

Aim or directional
Orientation
Point or positional

Aim or Directional

An aim or directional constraint causes the “affected” object to constantly aim an axis (some packages let you choose which) at the center or pivot of a “target” object. Think of a crowd at Wimbledon: faces always following the tennis ball as it bounds over the net and back.

Orientation

An orientation constraint is a little more like synchronized swimmers. As one swimmer turns, so does the other. The center orientation of the affected object matches that of the target object.

Point or Positional

A point or positional constr aint is the one most often used with IK. This constraint cements one object directly over another, based on the center position of each. As one object moves, the other is forced to go along.

Objects used as constraint targets can be parented to other objects, which means they are subject to the rules of a basic hierarchy. In a hierarchy, their position and movement is relative to the parent.

The Apple-picking Problem

Constraints are why my skeleton’s hands are pinned back in Figure 9. This is because of one of the major differences between a skeletal hierarchy and a basic object or geometry hierarchy. It’s also why successful IK animation usually involves both types. As an example of several objects in a hierarchy, consider the case of planes flying in formation with the lead plane as the parent. If we decide to have one of the planes drift higher or lower than the group, we can keyframe that movement and it’ll continue flying along with the group as it maneuvers around them. That’s because its movement is relative to the movement of its parent object-the lead plane.

In a basic hierarchy, the position of a child is relative to the position of that child’s parent. That’s why the feet slide when a character held together solely by hierarchies (without bones) rotates his or her hips.

Not so on an Inverse Kinematics chain. The position of an end-effector (the bottommost child in an IK hierarchy) is relative to world space. That’s why a foot animated with IK stays put. Unfortunately, it’s also why hands animated with IK naturally try to reach out to the point in space where they were keyframed.

IK is a natural for foot placement. But what about that character picking an apple? Isn’t that action more easily done with IK? Yes. It’s also best done with the help of a basic hierarchy and constraints.

We want to use IK because it’s much easier to place the hand on the apple than it is to rotate the shoulders, then the upper arm, the lower arm, etc. But we also want our character to walk up to the apple tree without his hands trailing back in space behind him. How do we do it?

The Apple-picking Solution

We use IK point constraints along with a basic hierarchy:

Take two objects-squares and circles are good since they are only splines and won’t render, so you don’t have to remember to hide them later. We’ll use circles.
Place one circle on one wrist (end-effector of the arm chain) and the other circle on the other wrist. If you’re really into it, use point constraints to put the circles directly on the wrists and then turn those constraints off.
Make those circles children of your character’s torso (group or parent them to the torso). This means they will go wherever the torso goes and any move they make will be relative to the torso. Now constrain the wrist end-effectors to their respective circles. The idea is to keyframe the constraint objects, not the end-effectors.

The end result is the best of both worlds: the hands move along with the torso as your character walks, and when you place the constraining circle onto the apple, the hand reaches right along with it, giving you the animation benefits of IK.

Using hierarchies of constraint objects to power your IK-driven skeletons can solve about 95 percent of your character animation problems. Everything from two characters playing catch to a rider falling from or jumping up onto a horse can be done with some form of animated constraint hierarchies and IK. With the addition of a little math in the form of expressions to control complex behaviors like foot rotation, you’ll have the tools you need to handle just about anything.

Conclusion

So power-down the robots and hang up the armor-single-mesh characters and surface deformation are the way to go now. We’ve covered hierarchies, Inverse Kinematics versus Forward Kinematics, and many of the major animation software packages that offer them. We’ve also solved some fairly common (and nasty) animation problems using IK and constraints.

Now it’s up to you to experiment on your own. First make sure your engine supports IK and constraint hierarchies, and then have at it. Remember, even if your 3D-software package isn’t mentioned in this paper, most have IK systems, hierarchies, and constraints or links of some sort, and price isn’t always a factor. Explore what you’ve got and get out and animate!

For More Information

Listed below, you’ll find the URLs for the tools and software packages discussed in this primer. You’ll also want to check out the Intel® Software Directory and Software Download Store for an extensive list of tools and solutions.

Maya* evolved from Alias Power Animator* and Wavefront Explorer*, and is now owned by AutoDesk*, along with Discreet’s 3d Studio Max*
Hash Animation: Master*
Smoother Animation with Intel® Pentium® III Processors (NURBS models)
Softimage 3D and XSI*

↧

Using BigDL to Build Image Similarity-Based House Recommendations

November 22, 2017, 4:09 pm

Latest and popular articles on Intel Technologies

≫ Next: Fog Data Processing POC with the Up Squared* Board

≪ Previous: Character Animation: Skeletons and Inverse Kinematics

Overview

This paper introduces an image-based house recommendation system that was built between MLSListings* and Intel^® using BigDL¹ on Microsoft Azure*. Using Intel’s BigDL distributed deep learning framework, the recommendation system is designed to play a role in the home buying experience through efficient index and query operations among millions of house images. Users can select a listing photo and have the system recommend listings of similar visual characteristics that may be of interest. The following provides additional parameters to the image similarity search:

Recommend houses based on title image characteristics and similarity. Most title images are front exterior, while others can be a representative image for the house.
Low latency API for online querying (< 0.1s).

Background

MLSListings Inc., the premier Multiple Listing Service (MLS) for real estate listings in Northern California, is collaborating with Intel and Microsoft to integrate artificial intelligence (AI) into their authorized trading platform to better serve its customers. Together, the technologies enhance the home buying search process using visual images through an integration between Real Estate Standard Organization (RESO) APIs and Intel’s BigDL open source deep learning library for Apache Spark*. The project is paving the road for innovation in advanced analytics applications for the real estate industry.

A large number of problems in the computer vision domain can be solved by ranking images according to their similarity. For instance, e-retailers show customers products that are similar items from past purchases, to sell more online. Practically every industry sees this as a game changer, including the real estate industry, as it has become increasingly digital over the past decade. More than 90 percent of homebuyers search online in the process of seeking a property². Homeowners and real estate professionals provide information on house characteristics such as location, size, and age, as well as many interior and exterior photos for real estate listing searches. However, due to technical constraints, the enormous amount of information in the photos cannot be extracted and indexed to enhance search or serve real estate listing results. In fact, show me similar homes is a top wish list request among users. By tapping into the available reservoir of image data to power web plus mobile digital experiences, the opportunity to drive greater user satisfaction from improved search relevancy is now a reality.

Enter the Intel BigDL framework. As an emerging distributed deep learning framework, BigDL provides easy and integrated deep learning capabilities for big data communities. With a rich set of support for deep learning applications, BigDL allows developers to write their deep learning applications as standard Spark programs, which can directly run on top of existing Apache Spark or Apache Hadoop* clusters.

Overview of Image Similarity

In the research community, image similarity can mean either semantic similarity or visual similarity. Semantic similarity means that both images contain the same category of objects. For example, a ranch house and a traditional house are similar in terms of category (both houses), but may look completely different. Visual similarity, on the other hand, does not care about the object categories but measures how images look like each other from a visual perspective; for example, an apartment image and a traditional house image may be quite similar.

Semantic similarity:

Visual similarity:

For semantic similarity, usually it's an image classification problem, and can be efficiently resolved with the popular image perception models like GoogLeNet*³ or VGG*⁴.

For visual similarity, there have been many techniques applied across the history:

SIFT, SURF, color histogram⁵
Conventional feature descriptors can be used to compare image similarity. SIFT feature descriptor is invariant to uniform scaling, orientation, and illumination changes, and makes it useful for applications like finding a small image within a larger image.
pHash⁶
This mathematical algorithm analyzes an image's content and represents it using a 64-bit number fingerprint. Two images’ pHash values are close to one another if the images’ content features are similar.
Image embedding with convolutional neural networks (convnet)⁸
Finding the image embedding from the convnet; usually it’s the first linear layer after the convolution and pooling.
Siamese Network or Deep Ranking ⁸
A more thorough deep learning solution, but the result model depends heavily on the training data, and may lose generality.

Solution with BigDL

To recommend houses based on image similarity, we first compare the query image of the selected listing photo with the title images of candidate houses. Next, a similarity score for each candidate house is generated. Only the top results are chosen based on ranking. By working with domain experts, the following measure for calculating image similarity for house images was developed.

    For each image in the candidates, compare with query image {

        class score: Both house front? (Binary Classification)

        tag score: Compare important semantic tags. (Multinomial Classification)

        visual score: Visually similarity score, higher is better

        final Score = class score (decisive) + tag score (significant) + visual score
                                ~1                                        ~0.3                                [0, 1]
   }

In this project, both semantic similarity and visual similarity were used. BigDL provides a rich set of functionalities to support training or inference image similarity models, including:

Providing useful image readers and transformers based on Apache Spark and OpenCV* for parallel image preprocessing on Spark.
Natively supporting the Spark ML* Estimator/Transformer interface, so that users can perform deep learning training and inference within the Spark ML pipeline.
Providing convenient model fine-tuning support and a flexible programming interface for model adjustment.
Users can load pretrained Caffe*, Torch* or TensorFlow* models into BigDL for fine-tuning or inference.

Semantic Similarity Model

For semantic similarity, three image classification models are required in the project.

Model 1. Image classification: Determines whether the house front is exterior. We need to distinguish if the title image is or is not the house front. The model is fine-tuned from pretrained GoogLeNet v1 on the Places* dataset (https://github.com/CSAILVision/places365). We used the Places dataset for the training.

Following is the code for the model training with the DLClassifier* in BigDL. We loaded the Caffe model pretrained from the Places dataset, in which the last two layers (linear (1024 -> 365 and Softmax) were removed from the Caffe model definition. Then, a new linear layer with classNum was added, to help train the classification model we required.

Model 2. Image classification: House style (contemporary, ranch, traditional, Spanish). Similar to 1, the model is fine-tuned from pretrained GoogLeNet v1 on the Places dataset. We sourced the training dataset from photos for which MLSListings have been assigned copyrights.

Model 3. Image classification: House story (single story, two story, three or more stories). Similar to 1, the model is fine-tuned from pretrained GoogLeNet v1 on the Places dataset. We sourced the training dataset from photos for which MLSListings have been assigned copyrights.

Visual Similarity Model

We need to compute visual similarity to derive a ranking score.

For each query, the user will input an image for comparison against the thousands of candidate images, returning the top 1000 result in 0.1 second. To meet the latency requirement, we performed a direct comparison against precalculated features from images.

We first built an evaluation dataset to choose the best options for image similarity computation. In the evaluation dataset, each record contains three images.

Triplet (query image, positive image, negative image),where positive image is more similar to the query

if (similarity(query image, positive image) > similarity(query image, negative image))
correct += 1
 else
incorrect += 1

image. For each record, we can evaluate different similarity functions.

In the four methods listed above for computing image similarity, Siamese Network or Deep Ranking appear to be more precise, but due to the lack of training data to support meaningful models the results were inconclusive. With the help of the evaluation dataset we tried the remaining three methods, and both SIFT and pHash produced unreasonable results. We suspect that was because both of them cannot represent the essential characteristics of real estate images.

Using image embedding from the pretrained deep learning models on the Places dataset, the expected precision accuracy level was achieved:

Network	Feature	Precision
Deepbit*	1024 binary output	80%
GoogLeNet*	1024 floats	84%
VGG-16	25088 floats	93%

Similarity (m1, m2) = cosine (embedding (m1), embedding (m2)).

After L2 normalization, cosine similarity can be computed very efficiently. While VGG-16 embedding has a clear advantage, we also tried the SVM model trained from the evaluation dataset to assign different weight to each of the embedding features, but this only gives limited improvement, and we are concerned that the SVM model may not be general enough to cover the real-world images.

Image Similarity-Based House Recommendations

The complete data flow and system architecture is displayed as follows:

Image of data flow and system architecture

In production, the project can be separated into three parts:

Model training (offline)
The model training mainly refers to the semantic models (GoogLeNet v1 fine-tuned on the Place dataset) and also finding the proper embedding for visual similarity calculation. Retraining may happen periodically depending on model performance or requirement changes.
Image inference (online)
With the trained semantic models (GoogLeNet v1) in the first step and the pretrained VGG-16, we can convert the images to tags and embeddings, and save the results in a key-value cache. (Apache HBase* or SQL* can also be used).
All the existing images and new images need to go through the inference above and converted into a table structure, as shown:
The inference process can happen periodically (for example, one day) or triggered by a new image upload from a real estate listing entry. Each production image only needs to go through the inference process once. With the indexed image tagging and similarity feature, fast query performance is supported in a high concurrency environment.
API serving for query (online)
The house recommendation system exposes a service API to its upstream users. Each query sends a query image and candidate images as parameters. With the indexed image information shown in the table above, we can quickly finish the one-versus- many query. For cosine similarity, processing is very efficient and scalable.

Demo

We provided two examples from the online website:

Example 1

Images of houses listings online

Example 2

Images of houses listings online

Summary

This paper described how to build a house recommendation system based on image analysis utilizing Intel’s BigDL library on Microsoft Azure integrated to MLSListings through RESO APIs. Three deep learning classification models were trained and fine-tuned from pretrained Caffe models in order to extract the important semantic tags from real estate images. We further compared different visual similarity computation methods and found image embedding from VGG to be the most helpful inference model in our case. As an end-to-end industry example, we demonstrated how to leverage deep learning with BigDL to enable greater deep learning-based image recognition innovation for the real estate industry.

References

Intel-Analytics/BigDL, https://github.com/intel-analytics/BigDL.
Vision-Based Real Estate Price Estimation, https://arxiv.org/pdf/1707.05489.pdf.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, Going Deeper with Convolutions. CoRR, vol. abs/1409.4842, 2014, http://arxiv.org/abs/1409.4842.
Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In: ICLR. 2014. p. 1–14. arXiv:arXiv:1409.1556v6.
Histogram of Oriented Gradients, https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients.
pHash, The Open Source Perceptual Hash Library, https://www.phash.org/.
Convolutional Neural Networks (CNNs / ConvNets), http://cs231n.github.io/convolutional-networks/.
J. Wang. Learning Fine-Grained Image Similarity with Deep Ranking. https://research.google.com/pubs/archive/42945.pdf.

↧

Fog Data Processing POC with the Up Squared* Board

November 27, 2017, 10:04 am

Latest and popular articles on Intel Technologies

≫ Next: Potential 'NaN' results with Windows users on MKL 2018.0.1

≪ Previous: Using BigDL to Build Image Similarity-Based House Recommendations

Intro

With the amount of continuously generated data on the rise, the cost to upload and store that data in the cloud is increasing. Data is being gathered faster than it is stored and immediate action is often required. Sending all the data to the cloud can result in latency and presents risks when internet connectivity is intermittent. Fog computing, also known as edge computing, involves processing data at the edge for immediate analysis and decisions. Hence it can help reduce network load by sending only critical, processed data to the cloud. This proof of concept (POC) will explore fog computing in a digital signage scenario where the display is recording the number of people viewing the display throughout the day. OpenCV will be used to search the camera feed for faces and Dash* by Plotly* to graph out the results in real time without the cloud.

Up Squared* Board

Figure 1: Up Squared* Board

Set-up

Hardware

The POC uses the Up Squared* board, a high performance maker board with an Apollo Lake processor, with Ubuntu* 16.04 LTS as the OS. For more information on the Up Squared board, please visit: http://www.up-board.org/upsquared/.

The camera used is the Logitech* C270 Webcam with HD 720P which connects to the board by USB.

Software

OpenCV needs to be installed on the Up Squared board to process the images. Reference here for instructions on how to install it on Ubuntu 16.04.

The processed data will be graphed with Dash by Plotly to create a web application in Python*. For more information on Dash: https://plot.ly/dash/

To install Plotly, Dash, and the needed dependencies:

pip install dash==0.19.0  # The core dash backend
pip install dash-renderer==0.11.1  # The dash front-end
pip install dash-html-components==0.8.0  # HTML components
pip install dash-core-components==0.14.0  # Supercharged components
pip install plotly --upgrade  # Plotly graphing library used in examples

#dependencies
pip install pandas
pip install flask
apt get install squlite3 libsqlite3-dev

Face Detection

Face detection is done in Python with OpenCV using the Haar Cascades front face alt algorithm. Each found face is individually tracked in the frame so that each viewer can be counted. This data is recorded to a sqlite database with their face ID and timestamps. Possible expansions to this would be doing facial recognition or using other algorithms to detect demographics of viewers.

The code, seen below, will create and connect to sqlite3 to initialize the database faces.db. The tables inside the database can only be created the first time the code is run, hence it is inside a try clause. OpenCV will then connect to the camera and loop over the feed looking for faces. When it finds a face or faces, it will write send the array of face rectangles to the face_tracking method which will which give each face an ID and then track it based on position. Each face class object stores its current x and y location as well as the time it was last seen and its face ID. If the face disappears for more than 2 seconds, it is aged out of the array and assumed the person moved on or looked away for too long; this will also help with any in-between frames where OpenCV might fail to detect the face where there really is one. This data is then written to the database as the face ID and the timestamp it was seen and to another table is inputted the total number of viewer faces at that timestamp.

Data in the visitorFace table with time stamps and ID

Figure 2: Data in the visitorFace table with time stamps and ID

#!/usr/bin/env python

from __future__ import print_function

# Import Python modules
import numpy as np
import cv2
import Face
import sys
import os
import pandas as pd
import sqlite3
from datetime import datetime
import math

#array to store visitor faces in
facesArray = []

#connect to sqlite database
conn = sqlite3.connect('/home/upsquare/Desktop/dash-wind-streaming-master/Data/faces.db')
c = conn.cursor()

# Create tables
try:
    c.execute('''CREATE TABLE faces
                (date timestamp, time timestamp, date_time timestamp, numFaces INT)''')
    c.execute('''CREATE TABLE visitorFace
                (date_time timestamp, faceID INT)''')
except:
	print('tables already exist')

#find the last visitor faceID to start counting from
df = pd.read_sql_query('select MAX(faceID) FROM visitorFace', conn)
lastFid = df.iloc[0]['MAX(faceID)']
if lastFid is None:
	fid=1
else:
	fid = 1 + int(df.iloc[0]['MAX(faceID)'])

try:
    # Checks to see if OpenCV can be found
    ocv = os.getenv("OPENCV_DIR")
    print(ocv)
except KeyError:
    print('Cannot find OpenCV')


# Setup Classifiers
face_cascade = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')

#draw rectangles around faces
def draw_detections(img, rects, thickness = 2):
    for x, y, w, h in rects:
        pad_w, pad_h = int(0.15*w), int(0.05*h)
        cv2.rectangle(img, (x, y), (x+w, y+h), (0, 255, 0), thickness)

#track visitor faces based on their location
def face_tracking(rects):
	global fid
	global facesArray
	numFaces= len(rects)
	#age faces out when no faces are detected
	if len(rects) == 0:
		for index, f in enumerate(facesArray):
			if((datetime.now()-f.getLastSeen()).seconds > 2):
				print("[INFO] face removed " + str(f.getId()))
				facesArray.pop(index)
	for x, y, w, h in rects:
		new = True
		xCenter = x + w/2
           	yCenter = y + h/2
		for index, f in enumerate(facesArray):
			#age the face out
			if((datetime.now()-f.getLastSeen()).seconds > 2):
				print("[INFO] face removed " + str(f.getId()))
				facesArray.pop(index)
				new = False
				break
			dist = math.sqrt((xCenter - f.getX())**2 + (yCenter - f.getY())**2)
			#found an update to existing face
			if dist <= w/4 and dist <= h/4:
				new = False
				f.updateCoords(xCenter,yCenter,datetime.now())
				c.execute("INSERT INTO visitorFace VALUES (strftime('%Y-%m-%d %H:%M:%f'),"+ str(f.getId())+")")
				break
		#add a new face
		if new == True:
			print("[INFO] new face " + str(fid))
			f = Face.Face(fid, xCenter, yCenter, datetime.now())
			facesArray.append(f)
			fid += 1
			c.execute("INSERT INTO visitorFace VALUES (strftime('%Y-%m-%d %H:%M:%f'),"+ str(f.getId()) +")")
	print(len(facesArray))
	c.execute("INSERT INTO faces VALUES (DATE('now'),strftime('%H:%M:%f'),strftime('%Y-%m-%d %H:%M:%f'), "+ str(len(facesArray))+")")

try:
    # Initialize Default Camera
    webcam = cv2.VideoCapture(0)
    # Check if Camera initialized correctly
    success = webcam.isOpened()
    if success == True:
        print('Grabbing Camera ..')
    elif success == False:
        print('Error: Camera could not be opened')

    while(True):
        # Read each frame in video stream
        ret, frame = webcam.read()
        # Perform operations on the frame here
        # First convert to Grayscale
        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
        # Next run filters
        gray = cv2.equalizeHist(gray)
        faces = face_cascade.detectMultiScale(gray, 1.3, 5, minSize=(30, 30), flags=cv2.CASCADE_SCALE_IMAGE)
        print('Number of faces detected: ' + str(len(faces)))
	#send the faces to be tracked for visitorFace faces
	face_tracking(faces)

	#draw on the cv viewing window
        out = frame.copy()
	draw_detections(out,faces)
        cv2.imshow('Facetracker', out)

	#commit the sqlite insertions to the database
        conn.commit()

        # Wait for Esc Key to quit
        if cv2.waitKey(5) == 27:
            break
    # Release all resources used
    webcam.release()
    cv2.destroyAllWindows()

except cv2.error as e:
    print('Please correct OpenCV Error')

Code 1: faceCounting.py code for detecting faces

Below is the Face class for each visitor face in the facesArray.

class Face:
    path = []
    def __init__(self, id, x, y, lastSeen):
        self.id = id
        self.x = x
        self.y = y
        self.lastSeen = lastSeen
    def getId(self):
        return self.id
    def getX(self):
        return self.x
    def getY(self):
        return self.y
    def getLastSeen(self):
        return self.lastSeen
    def updateCoords(self, newX, newY, newLastSeen):
        self.x = newX
        self.y = newY
	self.lastSeen = newLastSeen

Code 2: face.py for Face Class

Processing and Graphing

To visualize the data after doing some data processing on it, Dash will be used to create a graph that will update itself as new data comes in. The graph will also be in a web application running at localhost:8050. This way insights and information can be seen directly in the fog.

In addition to graphing the number of faces data, the data will be smoothed using Pandas exponentially weighted moving average (EWMA) method to create a less jagged line.

    #query the datetime and the number of faces seen at that time
    df = pd.read_sql_query('SELECT numFaces, date_time from faces;'
                            , con)
    #do a EWMA mean to smooth the data
    df['EWMA']= df['numFaces'].tail(tail+100).ewm(span=200).mean()

Code 3: Code snippet to query the number of faces and do EWMA smoothing

Face Counting Data graph and EWMA smoothed data

Figure 3: Face Counting Data graph and EWMA smoothed data

From the visitorFace table, the face IDs and timestamps will be used to calculate the time someone looked at the display, sometimes referred to as dwelling time. This type of data could be monetized to drive revenue of ads at peak traffic times and report ad impact/reach. The dwelling time is calculated by getting the first and last timestamp for each face ID and then finding the difference between them.

#query the datetime and the number of faces seen at that time, we need this for the next query
    dfFaces = pd.read_sql_query('SELECT numFaces, date_time from faces;'
                            , con)
    #calculate the viewing time of each visitor face
    #query the first date time and last date time that a visitor face was seen from the tail of the dfFaces data
    #taking the data from the tail will make the Viewing Time Data graph and the Face Counting Data Graph line up
    df = pd.read_sql_query("""select faceID, MIN(date_time) as First_Seen , MAX(date_time) as Last_Seen
		from (select * from visitorFace Where date_time > "%s") group by faceID;""" %(min(dfFaces['date_time'].tail(tail)))
		, con)
    #calculate the seconds a visitor face viewed the display
    df['VD']= (pd.to_datetime(df['Last_Seen']) - pd.to_datetime(df['First_Seen'])).dt.seconds

Code 4: Code snippet to query the viewing time for each visitor face

Scatter plot of each visitor face viewing time

Figure 4: Scatter plot of each visitor face viewing time

The Dash graph application itself has a few key pieces: the page layout, the callback for the update interval, and the graph components. The layout for this application has four main graphs with four headers: Face Counting Data, Viewing Time Data, Daily Face Count, and Hourly Face Count. The interval will call the callback every second to pull the last 1000 data entries from the sqlite database to graph, hence the graph will update itself as new data is added to database. For the graph components, each line or scatter dot is called a trace and the overall graph’s axis’s values are defined.

from __future__ import print_function
import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output, State, Event
import plotly.plotly as py
from plotly.graph_objs import *
from flask import Flask
import numpy as np
import pandas as pd
import os
import sqlite3
import datetime as dt
from datetime import timedelta

#number of most recent datapoints to grab
tail = 1000

app = dash.Dash()

#dash page layout
app.layout = html.Div([
    html.Div([
        html.H2("Face Counting Data")
    ]),
    html.Div([
        html.Div([
            dcc.Graph(id='num-faces'),
        ]),
        dcc.Interval(id='face-count-update', interval=1000),
    ]),
    html.Div([
        html.H2("Viewing Time Data")
    ]),
    html.Div([
        html.Div([
            dcc.Graph(id='viewing-data'),
        ]),
        dcc.Interval(id='viewing-data-update', interval=1000),
    ]),
    html.Div([
        html.H2("Daily and Hourly Face Count")
    ]),
    html.Div([
        html.Div([
            dcc.Graph(id='num-faces-daily'),
        ]),
        dcc.Interval(id='face-count-daily-update', interval=1000),
    ]),
    html.Div([
        html.H2("Hourly Face Count")
    ]),
    html.Div([
        html.Div([
            dcc.Graph(id='num-faces-hourly'),
        ]),
        dcc.Interval(id='face-count-hourly-update', interval=1000),
    ]),
], style={'padding': '0px 10px 15px 10px','marginLeft': 'auto', 'marginRight': 'auto', "width": "1200px",'boxShadow': '0px 0px 5px 5px rgba(204,204,204,0.4)'})



#callback method for face counting data graph interval
@app.callback(Output('num-faces', 'figure'), [],
              [],
              [Event('face-count-update', 'interval')])
def gen_num_faces():
    con = sqlite3.connect("./Data/faces.db")
    #query the datetime and the number of faces seen at that time
    df = pd.read_sql_query('SELECT numFaces, date_time from faces;'
                            , con)
    #do a EWMA mean to smooth the data
    df['EWMA']= df['numFaces'].tail(tail+100).ewm(span=200).mean()

    trace = Scatter(
	x=df['date_time'].tail(tail),
        y=df['numFaces'].tail(tail),
        line=Line(
            color='#42C4F7'
        ),
        hoverinfo='x+y',
        mode='lines',
        showlegend=False,
        name= '# Faces'
    )

    trace2 = Scatter(
	x=df['date_time'].tail(tail),
        y=df['EWMA'].tail(tail),
        line=Line(
            color='#FF4500'
        ),
        hoverinfo='x+y',
        mode='lines',
        showlegend=False,
        name= 'EWMA'
    )

    layout = Layout(
        height=600,
        xaxis=dict(
            range=[min(df['date_time'].tail(tail)),
                   max(df['date_time'].tail(tail))],
            showgrid=True,
            showline=True,
            zeroline=False,
            fixedrange=True,
            nticks=max(8,10),
            title='Date Time'
        ),
        yaxis=dict(
            range=[0,
                   max(df['numFaces'].tail(tail))],
            showline=True,
            fixedrange=True,
            zeroline=False,
            nticks=max(2,max(df['numFaces'].tail(tail))),
	    title='Faces'
        ),
        margin=Margin(
            t=45,
            l=50,
            r=50,
            b=100
        )
    )

    return Figure(data=[trace,trace2], layout=layout)


#callback method for viewing time data graph interval
@app.callback(Output('viewing-data', 'figure'), [],
              [],
              [Event('viewing-data-update', 'interval')])
def gen_viewing_data():
    con = sqlite3.connect("./Data/faces.db")
    #query the datetime and the number of faces seen at that time, we need this for the next query
    dfFaces = pd.read_sql_query('SELECT numFaces, date_time from faces;'
                            , con)
    #calculate the viewing time of each visitor face
    #query the first date time and last date time that a visitor face was seen from the tail of the dfFaces data
    #taking the data from the tail will make the Viewing Time Data graph and the Face Counting Data Graph line up
    df = pd.read_sql_query("""select faceID, MIN(date_time) as First_Seen , MAX(date_time) as Last_Seen
		from (select * from visitorFace Where date_time > "%s") group by faceID;""" %(min(dfFaces['date_time'].tail(tail)))
		, con)
    #calculate the seconds a visitor face viewed the display
    df['VD']= (pd.to_datetime(df['Last_Seen']) - pd.to_datetime(df['First_Seen'])).dt.seconds

    #declare the traces array to hold a trace for each visitor face
    traces = []
    for i in range(len(df.index)):
        traces.append(Scatter(
            x= [df['First_Seen'][i],df['Last_Seen'][i]],
            y= [i],
            hoverinfo='name',
            mode='markers',
            opacity=0.7,
            marker={'size':df['VD'][i],'line': {'width':0.5,'color':'white'}
            },
            showlegend=False,
            name= df['VD'][i]
        ))
    layout = Layout(
        height=600,
        xaxis=dict(
            range=[min(dfFaces['date_time'].tail(tail)),
                   max(dfFaces['date_time'].tail(tail))],
            showgrid=True,
            showline=True,
            zeroline=False,
            fixedrange=True,
            nticks=max(8,10),
            title='Date Time'
        ),
        yaxis=dict(
            range=[0,len(df.index)],
            showline=True,
            fixedrange=True,
            zeroline=False,
            nticks=max(2,len(df.index)/4),
	    title='Viewer'
        ),
        margin=Margin(
            t=45,
            l=50,
            r=50,
            b=100
        )
    )

    return Figure(data=traces, layout=layout)


#callback method for daily face count graph interval
@app.callback(Output('num-faces-daily', 'figure'), [],
              [],
              [Event('face-count-daily-update', 'interval')])
def gen_num_faces_daily():
    weekAgo = dt.datetime.today() - timedelta(days=6)

    con = sqlite3.connect("./Data/faces.db")
    #query the first date time and last date time that a visitor face was seen
    df = pd.read_sql_query("""select faceID, MIN(date_time) as First_Seen , MAX(date_time) as Last_Seen
		from visitorFace where date_time >= "%s" group by faceID;""" %weekAgo , con)
    #reset the index to do a groupby
    df= df.set_index(['First_Seen'])
    df.index = pd.to_datetime(df.index)
    #group the visitor faces by day and record the number of faces per day
    dfDaily = df.groupby(pd.TimeGrouper('D')).size().reset_index(name='counts')

    trace = Scatter(
	x=dfDaily['First_Seen'],
        y=dfDaily['counts'],
        line=Line(
            color='#42C4F7'
        ),
        hoverinfo='x+y',
        mode='lines',
        name= 'Daily'
    )

    layout = Layout(
        height=600,
        xaxis=dict(
            range=[min(dfDaily['First_Seen']),
                   max(dfDaily['First_Seen'])],
            showgrid=True,
            showline=True,
            zeroline=False,
            fixedrange=True,
            nticks=max(2,len(dfDaily.index)),
            title='Date Time'
        ),
        yaxis=dict(
            range=[min(0, min(dfDaily['counts'])),
                   max(dfDaily['counts'])],
            showline=True,
            fixedrange=True,
            zeroline=False,
            nticks=max(4,len(dfDaily.index)),
	    title='Faces'
        ),
        margin=Margin(
            t=45,
            l=50,
            r=50,
            b=100
        )
    )

    return Figure(data=[trace], layout=layout)

#callback method for hourly face count graph interval
@app.callback(Output('num-faces-hourly', 'figure'), [],
              [],
              [Event('face-count-hourly-update', 'interval')])
def gen_num_faces_hourly():
    weekAgo = dt.datetime.today() - timedelta(days=6)
    con = sqlite3.connect("./Data/faces.db")
    #query the first date time and last date time that a visitor face was seen
    df = pd.read_sql_query("""select faceID, MIN(date_time) as First_Seen , MAX(date_time) as Last_Seen
		from visitorFace where date_time >= "%s" group by faceID;""" %weekAgo , con)
    #reset the index to do a groupby
    df= df.set_index(['First_Seen'])
    df.index = pd.to_datetime(df.index)
    #group the visitor faces by day and record the number of faces per hour
    dfHourly = df.groupby(pd.TimeGrouper('H')).size().reset_index(name='counts')


    trace = Scatter(
	x=dfHourly['First_Seen'],
        y=dfHourly['counts'],
        line=Line(
            color='#42C4F7'
        ),
        hoverinfo='x+y',
        mode='lines'
    )

    layout = Layout(
        height=600,
        xaxis=dict(
            range=[min(dfHourly['First_Seen']),
                   max(dfHourly['First_Seen'])],
            showgrid=True,
            showline=True,
            zeroline=False,
            fixedrange=True,
            nticks=14,
            title='Date Time'
        ),
        yaxis=dict(
            range=[min(0, min(dfHourly['counts'])),
                   max(dfHourly['counts'])],
            showline=True,
            fixedrange=True,
            zeroline=False,
            nticks=max(2,max(dfHourly['counts'])/10),
	    title='Faces'
        ),
        margin=Margin(
            t=45,
            l=50,
            r=50,
            b=100
        )
    )

    return Figure(data=[trace], layout=layout)

external_css = ["https://cdnjs.cloudflare.com/ajax/libs/skeleton/2.0.4/skeleton.min.css","https://fonts.googleapis.com/css?family=Raleway:400,400i,700,700i","https://fonts.googleapis.com/css?family=Product+Sans:400,400i,700,700i"]


for css in external_css:
    app.css.append_css({"external_url": css})

if __name__ == '__main__':
    app.run_server()

Code 5: app.py code for Dash by Plotly web graph

The above graphs give a very close up view of the traffic in front of the display. To get a higher level picture of what is going on, let’s arrange the data and look at how many visitor faces per day and per hour from the last week.

weekAgo = dt.datetime.today() - timedelta(days=6)

    con = sqlite3.connect("./Data/faces.db")
    #query the first date time and last date time that a visitor face was seen
    df = pd.read_sql_query("""select faceID, MIN(date_time) as First_Seen , MAX(date_time) as Last_Seen
		from visitorFace where date_time >= "%s" group by faceID;""" %weekAgo , con)
    #reset the index to do a groupby
    df= df.set_index(['First_Seen'])
    df.index = pd.to_datetime(df.index)
    #group the visitor faces by day and record the number of faces per day
    dfDaily = df.groupby(pd.TimeGrouper('D')).size().reset_index(name='counts')
   dfHourly = df.groupby(pd.TimeGrouper('H')).size().reset_index(name='counts')

Code 6: Code snippet to query face count daily and hourly

Daily Face Count graph

Figure 5: Daily Face Count graph

Hourly Face Count Graph

Figure 6: Hourly Face Count Graph

Summary

So concludes this POC of gathering and analyzing data in the fog. Running the face tracking application and the Dash application at the same time will show data graphing in real time. From here any of the processed data is ready to send to the cloud to store for long term.

About the author

Whitney Foster is a software engineer at Intel in the Software Solutions Group working on scale enabling projects for Internet of Things.

Notices

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

Intel, the Intel logo, and Intel RealSense are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others

↧

Potential 'NaN' results with Windows users on MKL 2018.0.1

November 28, 2017, 10:43 am

Latest and popular articles on Intel Technologies

≫ Next: MobileNets on Intel® Movidius™ Neural Compute Stick and Raspberry Pi* 3

≪ Previous: Fog Data Processing POC with the Up Squared* Board

Certain Intel® Distribution for Python* 2.7 users on Windows may experience occasional NaN results when computing remainder from division of two floating point numbers, as computed by function fmod. This affects math.fmod, Python's built-in operation %, numpy.fmod, as well as fmod uses through the package numexpr.

The issue is caused by the bug in Microsoft Visual Studio 2008* Run-Time library, see the Microsoft Support Page for more details.

The unexpected nan output is triggered when when a computation preceding the call to fmod resulted in the floating point underflow bit in FE mask to be set.

This issue may have been exposed by a call to Intel® Math Kernel Library's VML function while using MKL 2018.0.1.

In such a case, downgrading MKL to version 2018.0.0 provides a work-around.

To test whether you may be affected by this issue, please execute:

import numpy as np, numexpr as ne
gr = np.linspace(0., 27.2, 10000)
e = ne.evaluate("exp(-x*x)", local_dict={'x': gr}) # trigger call to MKL that results in a denormal number to set the said bit
assert np.allclose( 5.0 % 2.0, 1.0), "You may be affected, please try downgrading MKL to 2018.0.0"

↧

MobileNets on Intel® Movidius™ Neural Compute Stick and Raspberry Pi* 3

November 29, 2017, 8:56 am

Latest and popular articles on Intel Technologies

≫ Next: Analog Gauge Reader using OpenCV in Python

≪ Previous: Potential 'NaN' results with Windows users on MKL 2018.0.1

Introduction

Deep Learning at the edge gives innovative developers across the globe the opportunity to create architecture and devices promising to solve problems and deliver innovative solutions like the Google’s Clips Camera with Intel’s Movidius VPU Inside. An edge device typically should be portable and use low power while delivering scalable architecture for the deep learning neural network. This article will showcase one such deep learning edge solution that pairs the popular Raspberry Pi* 3 single board computer with the Intel® Movidius™ Neural Compute Stick.

With several classification networks available, a scalable family of networks offers out-of-the-box customization to create more appropriate solutions for a user’s power, performance, and accuracy requirements. One such family of networks is Google’s MobileNets*, which offers two variables that can be changed to create a custom network that is just right to solve the problem of using low power while maintaining high performance and accuracy in these types of devices.

* Data from Google’s Blog

The first variable is the size of the input image. As shown in the Google Research Blog article on Google’s MobileNets*, the complexity and accuracy varies with input size. Google also provided pre-trained ImageNet* classification checkpoints for various sizes in the same article.

The second variable is called the depth multiplier. Even though the structure of the network remains the same, changing the depth multiplier changes the number of channels for each of the layers. This affects the complexity of the network, and that affects the accuracy and frame rate of the network. In general, of course, the higher the frame rate from the network, the lower the accuracy.

The information below will walk you through how to set up and run the NCSDK, how to download NCAppZoo, and how to run MobileNet* variants on the Intel Movidius Neural Compute Stick. Finally, we demonstrate the usage of the benchmarkncs app from the NCAppZoo, which lets you collect the performance of one or many Intel Movidius Neural Compute Sticks attached to an application processor like Raspberry Pi 3.

Items Required

Raspberry Pi 3 with Power Supply and Storage (Suggest to buy a case)

Raspberry Pi 3 case
Raspberry PI 3 Model B
SD card (Min 32GB)
HDMI monitor or TV for display
Keyboard and mouse
Intel Movidius Neural Compute Stick

Procedure

Step 1: Install the latest Raspberry Pi 3 Raspbian OS

Step 2: Connect the Intel Movidius Neural Compute Stick to Raspberry Pi 3

Step 3: Install Intel Movidius Neural Compute SDK (NCSDK):

Use the following to download and install the NCSDK:

git clone https://github.com/movidius/ncsdk
cd ncsdk
make install
cd ..

Step 4: Clone the NCAppZoo repository

git clone https://github.com/movidius/ncsdk
cd ncappzoo

Step 5: Use the benchmarkncs.py to collect performance for MobileNets

cd apps/benchmarkncs
./mobilenets_benchmark.sh | grep FPS

Results

Given this scalable family of networks, one can find the perfect network with the required accuracy and performance. The following graph (data from Google’s blog) shows the tradeoffs of accuracy vs. performance for the ImageNet classification. The unofficial performance (FPS) on the Intel Movidius Neural Compute Stick is also shown on the same graph.

* Network Accuracy Data from Google’s Blog

As you can see from the above graph, the higher end MobileNets with DepthMultiplier=1.0 and input image size = 224x224 with a Top5 accuracy of 89.5% runs at 9x the speed (FPS) when an Intel Movidius Neural Compute Stick is attached to the Raspberry Pi, compared to running it natively on the Raspberry Pi 3 using the CPU.

Raspberry Pi has been very successful in bringing a wonderful platform to the developer community. While it is possible to do inferencing at a reasonable frame rate on Raspberry Pi 3, the NCS brings an order of magnitude more performance and makes the platform better when running CNN-based Neural Networks. As we can see from the table, using the Intel Movidius Neural Compute Stick along with the Raspberry Pi 3 increases the Raspberry Pi 3 performance for computing inference using MobileNets in the range of 718% to 1254%.

↧

Analog Gauge Reader using OpenCV in Python

November 28, 2017, 9:22 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® Security Dev API: Go Developer Guide

≪ Previous: MobileNets on Intel® Movidius™ Neural Compute Stick and Raspberry Pi* 3

This sample application takes an image or frame of an analog gauge an reads the value using computer vision. It consists of two parts, the calibration, and the measurement. During calibration the user gives it an image of the gauge to calibrate, and it prompts the user to enter the range of values in degrees. It then uses these calibrated values in the measurement stage to convert the angle of the dial into a meaningful value.

What you’ll learn

Circle detection
Line detection

Gather your materials

Python 2.7 or greater
OpenCV version 3.3.0 or greater
A picture of a gauge to try (or you can use the sample one provided)

This article continues on Github:

https://github.com/intel-iot-devkit/python-cv-samples/tree/master/examples/analog-gauge-reader

↧

Intel® Security Dev API: Go Developer Guide

November 28, 2017, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: Getting Started Guide for the Intel® Speech Enabling Developer Kit

≪ Previous: Analog Gauge Reader using OpenCV in Python

Introduction

Intel^® Security Dev API is an API library that makes it easy for application developers to provision RSA private keys onto Trusted Platform Modules (TPMs) using the Go programming language.

APIs

This release of the API library provides a Go wrapper that allows access to the following APIs:

Secure Data API: Set of API functions that enables you to protect data on the local device, even from the application itself.
Crypto API: Cryptographic utilities. This release provides RSA signing.

Document	Scope
Get Started Guide	The Get Started Guide describes how to install the SDK and open a Hello World code sample.
Code samples	The SDK contains complete code samples, including source files and makefiles, in `$GOPATH/src/intel.com/isec/samples/`. The samples demonstrate the tasks described in this Developer Guide and more.
FAQs	Have a question? Browse the FAQs for answers.
Release Notes	The Release Notes describe known issues.

Specifications

Target Devices

Intel^® Security Dev API is compatible with the following device configurations.

Trusted Platform Modules (TPMs), version 2.0 or higher.

Development Environment

Intel^® Security Dev API is compatible with the following development environment.

Item	Description
Hardware	Your development and deployment environments must feature a Trusted Platform Module (TPM) that supports the TPM 2.0 hardware standard and a compatible TPM Software Stack (TSS), such as Intel® TSS
Operating system	Ubuntu 16.04 64-bit Desktop Version

Hardware Security

Software to encrypt and protect sensitive information, such as data and algorithms, is readily available for almost all operating systems. Unfortunately, the security provided by software can only be as strong as the operating system hosting the software security solution. Sophisticated malware can attack vulnerabilities in the operating system and obtain the privileges of the operating system itself. With these privileges, the attacker can freely access sensitive information that the operating system was tasked to protect.

Hardware security provides an additional layer of protection by isolating sensitive information from the operating system.

Intel^® Security Dev API enables you to use the following hardware security technologies:

Trusted platform module (TPM)

Trusted Platform Module (TPM)

A trusted platform module (TPM) is a microcontroller, typically affixed to the motherboard of a computing device. All TPMs meet the standards of the Trusted Computing Group (TCG).

The API library supports TPMs that meet the TPM 2.0 standard and are FIPS-compliant. Using the APIs, you can protect an RSA private key and use the key to sign data. The signing process occurs inside the TPM, so that the key is not exposed to the operating system.

Secure Data API

On a trusted platform module (TPM) enabled device, you can use the Secure Data API to protect RSA private keys.

To protect a key, you create a Secure Data object, consisting of the following components:

Secret data: The key that requires protection from disclosure and modification.
Non-secret data (optional): Public information associated with the secret data. For example, a description of the key.
Protection policy: A set of attributes defining the security protection level of the data, particularly access level and allowed usage.

Once you have created the Secure Data object, you can reference it via the handle returned by the creation API. You can use the protected data by providing the handle to other API calls. In other words, your application exposes the handle rather than the actual data, reducing the risk of disclosure should your application become compromised. The protection policy defines which API calls your application is allowed to use. For example, the API provides get and set calls that allow the application to read and update the data, respectively, but the protection policy set by the initial data creator might prohibit the use of those calls.

Intel trusted logic running on the OS provides the RSA private key to the TPM. The TPM generates a wrapper key, encrypts the RSA private key with the wrapper key, and provides the encrypted RSA private key to Intel trusted logic. Intel trusted logic creates a Secure Data object of the encrypted RSA private key.

When the data is not in use (also referred to as "at-rest data"), you can seal the Secure Data object for storage.

Sealing

Sealing is hardware-based encryption for at-rest protection of data.

By using the API, you first create a Secure Data object to protect the data while the data is in use. When the data is at-rest, for example, in need of storage, you use the API to seal the Secure Data object. On TPM enabled devices, Intel trusted logic running on the OS uses AES-GCM to seal the Secure Data object. The key used for sealing is defined in the protection policy. You can specify local protection only.

A local protection policy is used to ensure that unsealing is only possible on the same device that the data was sealed on. The sealing key material is a TPM generated and protected storage key. The access password in Startup, if provided, is also added to the set of information used to derive the sealing key.

The resulting sealed Secure Data object is called a "sealed blob." To decrypt the data and begin to use it again, you use the API to recreate the Secure Data object. Before attempting to recreate the object, you must make sure the sealed blob is in the correct environment. In the case of local protection, the sealed blob must be on the device it originated from.

Create a Secure Data Object

The following summarizes how to use the Secure Data API in an application. The key size must be 2048 bit.

Import the isec gopackage:
```
import "intel.com/isec"
```

Create a protection policy by using the NewLocalPolicy function:

policy, err := isec.NewLocalPolicy(isec.DataUsageRsaPublicKey, isec.DataAccessUpdate, isec.DataFlagsNoVersionTracking)
if err != nil {
     fmt.Printf("Error string for '%+v' should be nil \n", err)
}

Create an RSA Private Key using the policy you just created:

pubKeyPolicy, err := isec.NewLocalPolicy(isec.DataUsageRsaPublicKey, isec.DataAccessUse, DataFlagsNoVersionTracking)
    if err != nil {
        logError("NewLocal Policy return with ", err)
        return nil, nil, err
    }
    priKeyPolicy, err := isec.NewLocalPolicy(isec.DataUsageRsaPrivateKey, isec.DataAccessUse, isec.DataFlagsNoVersionTracking)
    if err != nil {
        logError("NewLocal Policy return with ", err)
        return nil, nil, err, pubKey, priKey, isecerr := isec.GenerateKeyPair(isec.KeyGenTypeRSA, 2048, isec.KeyGenMethodDefault, "RRP_HW_Root_Key", pubKeyPolicy, priKeyPolicy)
    }
    if isecerr != nil {
    fmt.Printf("Key Generate failed with: %v", isecerr)
}
fmt.Printf("pubKey is %x\n", pubKey.handle)
fmt.Printf("priKey is %x\n", priKey.handle)

Create a Secure Data object of the key by using the NewSecureDataWithPolicy function and reference the handle of the policy object you created in step 1:

nonsecretdata := []byte("this is nonSecretData")
data := []byte("this is secret data")
sd, err := isec.NewSecureDataWithPolicy(data, nonsecretdata, policy)
    if err != nil {
        fmt.Printf("isec.NewSecureDataWithPolicy failed with %v", err)
    }
    fmt.Printf("SecureData object is %v", sd)
}

Get a Sealed Blob

You can seal the Secure Data object for storage. You can also recreate the Secure Data object from the sealed blob.

Call and reference the handle of the Secure Data object.
```
size, err := sd.GetSealedBlobSize()	
```
Call GetSealedBlob and reference the handle of the Secure Data object.
```
data, err := sd.GetSealedBlob()
```
At this point, the Secure Data handle is no longer needed. Call sd.Close() to delete the memory resources associated with the handle.
```
err = sd.Close()
if err != nil {
fmt.Printf("Error string for '%+v' should be nil \n", err)
}
```
Write your own code to place the blob in the desired location.

To unseal the data, call RecreateFromSealed and reference the buffer pointer and size. This call recreates the Secure Data object.

You can use the new handle in other API calls:

data := []byte("secret data")

	cipher, err := DataAsymmetricEncrypt(data, pubKey, EncryptionPaddingPKCS1_v15)
	sealed, err := priKey.GetSealedBlob()
	if err != nil {
		fmt.Fatalf("Get Sealed blob failed with %+v", err)
	}
	recreatedKey, err := RecreateFromSealed(sealed)
	if err != nil {
		fmt.Fatalf("Recreate from sealed failed with %+v", err)
	}
	plainTxt, err := DataAsymmetricDecrypt(cipher, recreatedKey, EncryptionPaddingPKCS1_v15)
	if ok := bytes.Compare(data, plainTxt); ok == 0 {
		fmt.Printf("Recreate from sealed blob success")
	} else {
		fmt.Fatalf("Recreate from sealed failed, expected %v, received %v", data, plainTxt)
	}

Get the Non-Secret Data

You can get the non-secret data in a Secure Data object.

Call GetNonSecretSize and reference the handle of the Secure Data object.
```
obtainedDataSize, err := sd.GetNonSecretSize()
```
Call GetNonSecretData().
```
data, err := sd.GetNonSecretData()
```

Manage Memory

When Secure Data objects are no longer needed, you must delete the memory resources associated with the objects. Call destroy()Close()and reference the object to delete method / function. This call prevents race conditions by using synchronization primitives to block execution to ensure the object is not in use by another thread.

err := sd.Close()

When a Secure Data object is an input of another Secure Data object, you must destroy the objects in the reverse order in which they were created. In other words, if Object A is an input of Object B, destroy Object B, then destroy Object A.

Protection Policies

Each Secure Data object contains a protection policy. The protection policy is a set of attributes that define allowed usage of the secret data (plaintext), the encryption method for sealing the Secure Data object, and more.

When the Secure Data object is in use, the protection policy is included in the form of an object. When you seal the object, the protection policy is packed as part of the sealed blob. When you recreate the object from the sealed blob, the protection policy is reconstructed and enforced.

For TPM enabled devices, you can use only the local protection policy: NewLocalPolicy. See Create a Secure Data Object for the appropriate attributes.

Crypto API

The Crypto API provides common cryptographic utilities and is intended for cryptography experts. The API supports trusted platform module (TPM) enabled devices only.

RSA Signing

You can sign a hash with an RSA private key on TPM enabled devices. RSA signing is performed inside the TPM. The output is signed data.

In this release, Intel^® Security Dev API does not provide the ability to hash data.

The RSA signing function requires that you input the RSA private key as a Secure Data object. For more information, see Create a Secure Data Object.

To sign a hash with an RSA private key, call HashSign with the following parameters:

For the []byte parameter, provide the size of the data in bytes. The size must be 20 or 32 bytes.
For the hashed parameter, provide the pointer to the data buffer.
For the privateKey parameter, provide the Secure Data handle of the RSA private key.
For the padding parameter, provide the padding type.
For the hash parameter, provide the size of the buffer to contain the result.

Legal and Disclaimers

INTEL CONFIDENTIAL

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

No computer system can be absolutely secure.

Intel, the Intel logo, and Intel Core are trademarks of Intel Corporation in the U.S. and/or other countries.

↧