Intel® Advisor issue with resolving debug symbols

December 21, 2017, 9:18 am

Latest and popular articles on Intel Technologies

≫ Next: Tutorial: Using Inference to Accelerate Computer Vision Applications

≪ Previous: Announcing Intel® Graphics Performance Analyzers Version 2017 R4

Problem:

An issue was discovered in Intel Advisor® where the tool was taking a long time to resolve symbols. During the finalization phase Intel Advisor does not progress past the message:

“Processing information for instruction mix report”.

Solution:

The issue is fixed in version Intel Advisor 2018 Update 1

To work around the issue in earlier releases

Edit file
Linux* <advisor-install-dir>/config/collector/include/common.xsl
Windows* <advisor-install-dir>\config\collector\include\common.xsl
Remove or comment the xml block containing:
<transformation name="Instruction Mix" boolean:deferred="true">
Re Run Survey

↧

Tutorial: Using Inference to Accelerate Computer Vision Applications

December 22, 2017, 12:59 pm

Latest and popular articles on Intel Technologies

≫ Next: Enhancing High-Performance Computing with Persistent Memory Technology

≪ Previous: Intel® Advisor issue with resolving debug symbols

Introduction

This tutorial will walk you through the basics of using the Deep Learning Deployment Toolkit's Inference Engine (included in the Intel® Computer Vision SDK). Here, inference is the process of using a trained neural network to infer meaning from data (e.g., images). In the code sample that follows, a video (frame by frame) is fed to the Inference Engine (our trained neural network) which then outputs a result (classification of an image). Inference can be done using various neural network architectures (AlexNet*, GoogleNet*, etc.). This example uses a Single Shot MultiBox Detector (SSD) on GoogleNet model. For an example of how SSD is used see this article on the Intel® Developer Zone.

The Inference Engine requires that the model be converted to IR (Intermediate Representation) files. This tutorial will walk you through the basics taking an existing model (GoogleNet) and converting it to IR (Intermediate Representation) files using the Model Optimizer.

The result of this tutorial is that you will see inference in action on a video by detecting multiple objects, such as people or cars. Here's an example of what you might see on this sample image:

So What's Different About Running a Neural Network on the Inference Engine?

The Inference Engine optimizes inference allowing a user to run deep learning deployments significantly faster on Intel® architecture. For more information on the performance on Intel® Processor Graphics see this article
Inference can run on hardware other than the CPU such as the built-in Intel® GPU or Intel® FPGA accelerator card.

How Does the Inference Engine Work?

The Inference Engine takes a representation of a neural network model and optimizes it to take advantage of advanced Intel® instruction sets in the CPU, and also makes it compatible with the other hardware accelerators (GPU and FPGA). To do this, the model files (e.g., .caffemodel, .prototxt) are given to the Model Optimizer which then processes the files and outputs two new files: a .bin and .xml. These two files are used instead of the original model files when you run your application. In this example, the .bin and .xml files are provided.

In the above diagram, IR stands for Intermediate Representation, which is just a name for the .xml and .bin files that are inputs to the Inference Engine.

What you’ll Learn

How to install the OpenCL™ Runtime Package
How to install the Intel® Computer Vision SDK
How to generate the .bin and .xml (IR files) needed for the Inference Engine from a Caffe model
Run the Inference Engine using the generated IR files in a C++ application
Compare the performance of CPU vs GPU

Gather your materials

5th or greater Generation Intel® Core™ processor. You can find the product name in Linux* by running the ‘lscpu’ command. The ‘Model name:’ contains the information about the processor.

Note: The generation number is embedded into the product name, right after the ‘i3’, ‘i5’, or ‘i7’. For example, the Intel® Core™ i5-5200U processor and the Intel® Core™ i5-5675R processor are both 5th generation, and the Intel® Core™ i5-6600K processor and the Intel® Core™ i5 6360U processor are both 6th generation.

Ubuntu* 16.04.3 LTS
In order to run inference on the integrated GPU:
- A processor with Intel® Iris® Pro graphics or HD Graphics
- No discrete graphics card installed (required by the OpenCL™ platform). If you have one, make sure to disable it in BIOS before going through this installation process.
- No drivers for other GPUs installed, or libraries built with support for other GPUs

This article continues here on GitHub.

↧

Enhancing High-Performance Computing with Persistent Memory Technology

December 20, 2017, 4:48 pm

Latest and popular articles on Intel Technologies

≫ Next: Use GNU Debugger GDB to investigate segmentation fault(SIGSEGV)

≪ Previous: Tutorial: Using Inference to Accelerate Computer Vision Applications

Introduction

Persistent memory (PMEM) technology is set to revolutionize the world of in-memory computing by bringing massive amounts (up to 6 terabytes (TB) per two-socket system) of byte-addressable non-volatile memory (NVM) at speeds close to those of dynamic random access memory (DRAM) for a fraction of DRAM’s cost. The most impactful benefits for in-memory computing include reduced application start-up time (no need to recreate memory data structures) and increased memory capacity. Given these developments, the question arises as to whether high-performance computing (HPC) can also take advantage of PMEM technology.

This article addresses this question by dividing the potential space of impact into three areas: system, middleware, and application. It provides general information, potential architectural source code changes, and real-world application examples for each area. This article doesn’t cover every possible case, and since this is a new technology, the examples shown here are still a work in progress.

Overview of Persistent Memory (PMEM) Technology

PMEM in Perspective

PMEM technology can be thought of as the latest evolution in the journey of NVM technologies. NVM technologies range from classic magnetic tapes, hard disk drives, and floppy disks, to read-only memory chips and optical disks, to the latest solid state drives (SSDs) in the market today. The common factor among all these technologies has always been the larger capacity but also poorer performance when compared to DRAM. This has, historically, created the two levels of system storage (primary versus secondary) that we are familiar with today.

Primary storage is designed to be very fast in order to feed the CPU with all the “hot” data it needs during computation, while secondary storage is designed for parking “cold” data and programs that are not needed at the moment but that need to survive when power is turned off. Although secondary storage can be used to store “hot” data, and in fact is used as such when DRAM is not large enough (for example, when swapping memory pages to disk), this approach is undesirable due to the non-negligible performance impact. Simply put, primary storage is fast, small, and volatile, while secondary storage is slow, large, and persistent.

With that in mind, we can see why one of the key differences in the design between the two levels is data access granularity. While primary storage allows CPUs to address and randomly access single bytes of data, the unit of data access in secondary storage is usually a block of 4 KB (sometimes even greater). This bulk access to data is needed to compensate for access latencies, which are orders of magnitude larger than those of primary storage. This difference in access granularity between primary and secondary storage is also responsible for the need to create two data models for applications:

For primary storage, the data model is more complex and richer, such as trees, heaps, linked lists, hash tables, and so on
For secondary storage, the data model is less flexible, such as serialized data structures in markup languages (for example, XML), comma-separated values files, structured query language tables, and so on

How is PMEM Different?

The revolutionary aspect of PMEM is that it will be byte-addressable and fast (like primary storage) without sacrificing too many of the benefits of secondary storage like large capacity, persistence, and low cost per byte (see Figure 1). 3D XPoint™ Memory Technology, jointly developed by Intel and Micron*, makes all this possible, in addition to providing access latencies close to those of DRAM. PMEM DIMMs will be directly accessible by the CPU, removing intermediate overheads such as the PCIe* bus transfer protocol. Although the world will still need secondary storage to cheaply archive massive amounts of data, PMEM is positioned to be the technology that will allow a large number of applications and systems to unify their data models.

Image of chart shows how PMEM is both byte-addressable and persistent
Figure 1. How PMEM technology compares to DRAM and SSDs in terms of performance versus capacity. The figure also shows how PMEM is both byte-addressable and persistent.

An application that wants to persist some data structures on a PMEM device needs to make sure that modifications to that data structure are done atomically (for example, by using transactions) so as to avoid corruption caused by data stored in CPU caches not being flushed on time before power is turned off. However, an application can forgo this part if all it wants is more primary storage capacity. This can come in handy for memory-bound HPC applications, as we will see next.

The System’s Point of View

The first benefit that PMEM will bring to HPC will be a larger primary storage capacity (6 TB per two-socket system). To understand how HPC applications could potentially take advantage of PMEM, first we need to conceptually visualize how this new technology will fit inside the overall memory hierarchy.

Three Logical Architectures

As shown in Figure 2, applications can use three logical architectures when integrating PMEM: DRAM as cache, PMEM as a DRAM extension, and DRAM as a temporary buffer.

Image of CPU architecture
Figure 2. Three logical architectural possibilities for using PMEM as extended capacity for memory-bound applications

In the DRAM as cache scenario (see Figure 2a), applications will use PMEM as a new layer inside the memory hierarchy. Applications will allocate memory for their data structures in PMEM, hence using PMEM as primary storage, while using DRAM only as L4 cache. However, with this approach, all data consumed (that is, addressed) by the CPU during regular computation will still be against DRAM. This means that data movement between DRAM and PMEM will need to be handled using some kind of explicit caching supporting code.

In the PMEM as DRAM extension scenario (see Figure 2b), applications will use all the available memory capacity as a single memory pool. Memory can first be allocated in DRAM; if more is needed, allocation then continues on PMEM. With this approach, data consumed by the CPU during regular computation will be either in DRAM or PMEM, causing variability in access latencies depending on what part of the data the CPU accesses.

In the DRAM as temporary buffer scenario (see Figure 2c), applications composed of different computational kernels, each one using different memory usage patterns, could utilize one type of memory or the other depending on the particulars of each kernel. An example of such a kernel is the 3D Fast Fourier Transform (3D-FFT), which transforms data in order to be used with spectral methods. 3D-FFTs require multiple passes over the same data points, hence it is advantageous to compute it always against DRAM. Note that logical architecture (a) is really a subset of (c).

Stencil Applications

Memory-bound, large-scale HPC applications will directly benefit by being able to allocate larger problem sizes. An example of such applications are stencil (that is, nearest neighbor) computations. Stencil applications are used in the implementation of partial differential equation solvers through iterative finite-differences techniques. Solving the 3D heat equation is a typical stencil problem.

H_t+1 [i,j,k]=a H_t [i,j,k]+b (H_t [i-1,j,k]+H_t [i,j-1,k]+H_t[i,j,k-1]+H_t[i+1,j,k]+H_t [i,j+1,k]+H_t [i,j,k+1])

This equation is a stencil representing a single out-of-place (that is, a new value is stored in H_t+1, not H_t) Jacobi iteration executed for each data point (i,j,k) in a 3D grid. Since this access pattern is regular and predictable—and hence data can be smartly pre-fetched to DRAM before it is used—it is ideal for the DRAM as cache architecture. Indeed, this kind of data pre-fetching is known in the literature as “blocking.” For example, with blocking, data is split into core blocks of equal size that fit neatly into L3 cache, in a way that cache misses can be significantly reduced. Likewise, core blocks can subsequently be split into thread blocks for parallelization or even register blocks to take advantage of data level parallelism (that is, vectorization).

Following this logic, we can think of DRAM blocks—an extra layer of blocking to optimize data pre-fetching to DRAM. Figure 3 shows a high-level description of blocking with the new added layer on the far left.

Image of block data
Figure3. Pre-fetching data for stencil applications using blocking with an extra layer (far left) added for PMEM. This figure is a modified version of Figure 2 that appears in Stencil Computation Optimization and Auto-tuning on State-of-the-Art Multicore Architectures.

The Middleware’s Point of View

Another way in which HPC applications can take advantage of PMEM is by enhancing libraries and services sitting at the middleware layer, making them “PMEM-Aware.” The idea is to bring the benefits of this new technology to applications, while also avoiding significant coding efforts.

PMEM-Aware Checkpoint/Restart

Checkpoint/Restart (C/R) in HPC can be enhanced by adding a PMEM-aware buffer at the local node level. These checkpoints can then be transferred asynchronously from PMEM to a remote Distributed File System (DFS) or even to intermediate burst buffer servers, without slowing the progress of execution significantly. This use of C/R is known in the community as hierarchical C/R (see Figure 4).

Persistence assures applications that the last checkpoint done will be readily available as soon as all processes finish check-pointing to PMEM (even before finishing remote copying to the DFS). This is so, of course, as long as the failure in question does not affect the data saved on the PMEM DIMMs. Persistence can also help reduce the frequency of remote check-pointing to DFS. The classic Young’s formula (T_c= √(2 × C ×MTBF)) says that the frequency of check-pointing (1/T_c) is inversely proportional to the mean time between failures (MTBF). Because PMEM adds an extra layer of security to data, the probability of failure decreases, making MTBF increase. Less-frequent remote check-pointing means less overhead overall (transferring huge amounts of data remotely is not cheap).

Although other alternatives exist for doing local checkpoints, such as SSDs, PMEM’s unique features will likely make it the key technology in this regard.

image map
Figure 4. PMEM as a first level in hierarchical check-pointing.

Since PMEM’s capacity will be larger than that of DRAM, it will be possible to store more than one checkpoint at a time if needed. This buffering can help improve network utilization by coordinating remote copying with phases of low application network traffic.

MPI-PMEM Extensions

Probably the most well-known middleware in HPC is the Message Passing Interface (MPI). This section describes two extensions that have already been written for MPICH*: the open source implementation of MPI by Argonne National Laboratories. You can download them in the linked GitHub* repository, and use and modify them. Of course, these extensions aren’t the only ones possible.

The first extension, located under the directory mpi_one_sided_extension, makes MPI one-sided communication PMEM-Aware by allowing processes to declare “persistent windows.” These windows, which outlive the execution of the application, can be useful for C/R. Another use case is parameter sweep scenarios, when the same application needs to be run with the same input data but using different input parameters. For these cases, the input data can be loaded just once from secondary storage to PMEM and then reused for multiple executions, hence reducing application start-up time.

The second extension, located under the directory mpiio_extension2, integrates PMEM into MPI-IO using two modes: (1) PMEM_IO_AWARE_FS and (2) PMEM_IO_DISTRIBUTED_CACHE. In (1), all the available PMEM DIMMs attached to all the nodes are combined into a single logical view to form a PMEM DFS. Requests for data not present in the current node are forwarded to the appropriate one using a broker. In (2), PMEM DIMMs attached to the nodes serve as a huge cache (again, up to 6 TB per two-socket system) for data stored in a remote DFS. If we think again that PMEM will enjoy speeds close to those of DRAM, the potential for mode (2) to boost MPI-IO performance cannot be overlooked.

Persistent Workflow Engines

Another class of HPC middleware used in the scientific community is workflow engines (WEs) such as Galaxy* or Swift*. Jobs that run as workflows must divide the work into individual tasks. These tasks are run independently from one another (they are data-independent so they do not share any state, and all input data is passed by value), and each one is usually an autonomous piece of software. You can think of a WE as the glue that sticks together different software pieces, which by themselves would not talk to each other, to create a coherent whole ─ connecting outputs to inputs, scheduling, allocating needed resources, and so.

One of the main issues with WEs is that, for the most part, tasks talk to each other via files (and sometimes database engines). A task usually runs to perform a specific analysis, for which it reads its input(s) file(s) and writes the results as output file(s), which will be used by other tasks as input(s) and so on. Here we can see how PMEM can be leveraged to create a fast buffer for a workflow’s intermediate data instead of relying so much on files. Tasks can also be optimized to use specific memory data structures directly instead of having to recreate them from flat files, as is usually the case, which can also help simplify code and speed up execution.

The Application’s Point of View

At the application level, the applications themselves are directly responsible for defining what data structures should be permanent and act accordingly (for example by writing to them atomically to avoid potential corruptions). Intel is closely collaborating with other key players in the industry through the Storage and Networking Industry Association (SNIA), which has developed the Persistent Memory Developer Kit (PMDK) based on the NVM Programming Model (NPM) standard. PMDK is composed of multiple open source libraries and APIs, whose goal is to aid programmers in adapting their applications to PMEM. Although its usage is not mandatory, it is nonetheless recommended, especially for newcomers.

In the case of HPC applications, especially simulations, the benefit of persistent data structures is unclear. When we think of access latencies, we need to ask: What is the added benefit of persistence that can make it worth having larger access latencies, plus the overhead of transactions? In addition, given the nature of HPC simulations where data evolves over time, is it worth persisting data that will change soon? Apart from check-pointing, it is difficult to think of other obvious benefits that persistence can bring to HPC simulations. If you think otherwise and have a good use case, please contact me.

However, other applications used in conjunction with HPC simulations can benefit by having persistent data structures. An example of this is with situ visualization.

Interactive In Situ Visualization with PMEM

In situ visualization is a technique designed to avoid excessive data movement between the HPC system, where simulations are run, and the visualization system, where graphics are rendered (see Figure 5). Instead of check-pointing data to a file system to be used later as input for visualization, the visualization itself—or part of it, for visualizations are usually expressed as a sequence of data transformations—is done in the HPC system at the same time as data is being generated by the simulation. A visualization library is called at the end of each time step with raw data passed, in most cases, by reference (to avoid data copying as much as possible). The simulation can continue to the next step only when the visualization is done.

visualization map
Figure5. (a) Traditional HPC visualization versus (b) in situ. In (a), the simulation performs expensive checkpoints (step 2) to store the raw data for visualization. However, in (b) a large part of the data transformation and rendering, if not all, is performed in the HPC system itself. Transformed and/or rendered data, which is smaller than raw data, is then forwarded to the visualization application or stored for later use.

One of the limitations of this approach is lack of flexibility. Once the simulation advances to the next time step, data from the previous one is usually overwritten in memory, which limits the opportunities to interact with the visualization, such as changing parameter values for coloring, adding or removing filters, slicing, adding extra lights, moving the camera, and so on. In addition, restarting a simulation, which may have been running for days, to change some visualization parameters is not feasible either.

Here is where PMEM can help by allowing the persistence of a window of time steps. This window, in turn, can allow simulation interaction by changing parameters and re-rendering the simulation from the beginning of the window. One can imagine a scenario where users, dissatisfied with the visualization that is currently being generated, may want to explore the use of different visualization options and parameters before continuing with the simulation, but without restarting it. Since the window is persistent, it outlives the simulation. The visualization could, theoretically, be accessed and interacted with long after the simulation is done.

I am working on a prototype to make in situ visualization interactive with PMEM using the library libpmemobj from PMDK. I chose ParaView* as the visualization application, which is well known in the HPC community. I will keep everybody posted on my progress and the lessons learned from this exciting project.

Summary

This article explored the idea of enhancing HPC with PMEM technology. Starting with a general definition of PMEM, the article then described the potential impact to the system, middleware, and application, as well as potential architectural code changes and real-world application examples about each area. All the examples shown in the article are still works in progress and subject to changes over time. I welcome any new ideas and developments from the community. Email me at eduardo.berrocal@intel.com.

About the Author

Eduardo Berrocal joined Intel as a cloud software engineer in July 2017 after receiving his PhD in Computer Science from Illinois Institute of Technology (IIT) in Chicago, Illinois. His doctoral research interests focused on data analytics and fault tolerance for HPC. In the past he worked as a summer intern at Bell Labs (Nokia), as a research aide at Argonne National Laboratory, as a scientific programmer and web developer at the University of Chicago, and as an intern in the CESVIMA laboratory in Spain.

Resources

Stencil Computation Optimization and Auto-tuning on State-of-the-Art Multicore Architectures, Kaushik Datta et al., http://mc.stanford.edu/cgi-bin/images/e/ec/SC08_stencil_autotuning.pdf.
Link to MPI-PMEM Extensions code in GitHub.
The Persistent Memory Developer Kit (PMDK).
The Non-Volatile Memory Programing (NMP) Standard: https://www.snia.org/tech_activities/standards/curr_standards/npm.
The Open Source, multi-platform data analysis and visualization application ParaView: https://www.paraview.org/.
In-Situ Visualization: State-of-the-art and Some Use Cases, Marzia Rivi et al., CINECA & Scientific Computing Laboratory, Institute of Physics Belgrade, University of Belgrade, http://www.prace-ri.eu/IMG/pdf/In-situ_Visualization_State-of-the-art_and_Some_Use_Cases-2.pdf.
The Message Passing Interface (MPI).
A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers, Kento Sato et al., 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), http://ieeexplore.ieee.org/abstract/document/6846437/.
MPICH: a high performance and widely portable implementation of the Message Passing Interface (MPI) standard, http://www.mpich.org.
Galaxy Workflow Engine Project: https://galaxyproject.org/.
Swift Workflow Tool: http://swift-lang.org/main/.
Optimization of a multilevel checkpoint model with uncertain execution scales, Sheng Di et al., Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14), 2014, https://dl.acm.org/citation.cfm?id=2683692.

↧

Use GNU Debugger GDB to investigate segmentation fault(SIGSEGV)

December 24, 2017, 11:38 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® and MobileODT* Competition on Kaggle*: 1st Place Winner

≪ Previous: Enhancing High-Performance Computing with Persistent Memory Technology

Overview

GNU debugger, GDB can be your best friend to help you to perform the source-level debugging during the application development. In this article, we will not go through the GDB’s basic usages which should be easy to be located with a google search. This article is specifically to show how GDB can help application crash scenarios by a case of segmentation fault error. Some code snippets are also used to guide how to apply GDB’s commands on it. Additionally, Integrated Eclipse IDE of Intel System Studio 2018 will be used to illustrate GDB's usages.

Use GDB’s backtrace command for tracking the callstack when crash happens

You will want to discover the code line where your application fails at when application crash happens. GDB’s backtrace can help you get the callstack, function call history information , right after the accident occurs in most of cases. Considering the following example codes to write data to an uninitialized string to invoke a segmentation fault, typing “backtrace” in the debugger console window instantly reveal where this segmentation fault fails at in code-line level. Check the screenshot below.

//// code snippets - start
static char buffer [256];
static char* string;

static long do_that() {
 gets(string);
 return 0;
};

static long do_this() {
 do_that();
 return 0;
};

int main()
{
   printf ("input a string: ");
   do_this();
   printf ("\n\n the input string: %s\n", string);
}
//// code snippets - end

Figure 1. GDB’s backtrace example

Sometimes backtrace does not work

backtrace is useful but it is not always able to reveal the debug callstack in details. Considering the following sample code snippets, backtrace cannot show the helpful information here.

////code snippets - start
struct A_arg {
    long arg1;
    long arg2;
    long count1;
    long count2;
    long count3;
    long count4;
};

struct B_arg {
    long arg1;
    long arg2;
};

static long fun_A(void* argB) {
    struct A_arg* myarg = (struct A_arg*) argB;
    if (!myarg) return -1;
    myarg->count4 += 1;// access the out-of-boundary data of B_arg, corrupt stack

////code snippets - end

Check the figure below, backtrace misinterpret the last function call “0x7ffffffffbca0 in ??()” which is invalid. This is all due to the code snippets accidentally update the wrong memory content which happens to corrupt the callstack.

Figure 2. GDB’s backtrace cannot interpret callstack

Use Intel System Studio 2018’s function call history view to review clear call stacks

Instead of interpreting the call stack when the segmentation fault occurs, you can to log the function call history during the runtime and Intel System Studio 2018 integrates a Function Call History view to facilitate the debugging experience.

To use this feature, you have to make sure 4th Intel generation core platform or higher are used. The next step is to run a GDB command “record btrace bts” before the crash happens. “record btrace bts” enables Branch Trace Store feature to record the history of function calls. Once the problematic program hits the exception and sends out signals like SIGSEGV for segmentation fault, you will right away see Function Call History windows to print out the functions history the program has traversed. That gives you a full debug information for how this crash happens. Check the Debugger Console window and Function Call History window for details in the figure below.

Figure 3. Use Intel System Studio 2018 to review function call history.

For higher Intel platforms than 4th Intel Core, you can type “record btrace pt” to enable Intel Processor Trace with a compressed log format which is lower overhead and more records entries compared to bts(Branch Trace Store).

Intel System Studio 2018 provides IDE integrate Function Call History to support viewing the records from “record btrace bts” and “record btrace pt”. You can also use “record function-call-history” to view the records in plain text format in debugger console. Now 90-days free and full license of Intel System Studio 2018 is available via http://intel.ly/system-studio. Check register and download on that web page.

More debug usages you might want to know.

There are quite a few cases of segmentation fault caused by out-of-boundary pointers or arrays access. You can try the compiler options like --check-pointers, -check-pointers-undimensioned to aid GDB to debug these invalid access. Other tools like <valgrind> can be used to detect memory leakages, memory access validation and other memory issues., <strace> allows you to review logs to understand how the program invokes system calls and interacts with system level components to further debug more complicated issues.

Intel® and MobileODT* Competition on Kaggle*: 1st Place Winner

December 22, 2017, 2:04 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® and MobileODT* Competition on Kaggle*: 4th Place Winner

≪ Previous: Use GNU Debugger GDB to investigate segmentation fault(SIGSEGV)

A Lithuanian Team Tests the Capabilities of AI to Improve the Precision and Accuracy of Cervical Cancer Screening

Editor's note: This is one in a series of case studies showcasing finalists in the Kaggle* Competition sponsored by Intel® and MobileODT*. The goal of this competition was to use artificial intelligence to improve the precision and accuracy of cervical cancer screening.

Members of winning team

First-place winners for the Intel® & MobileODT* Cervical Cancer Screening Kaggle* Competition: (from left) Jonas Bialopetravičius, Ignas Namajūnas, and Darius Barušauskas of Team TEST.

Abstract

More than 1,000 participants representing 800 data scientist teams developed algorithms to accurately identify a woman’s cervix type based on images as part of the Intel and MobileODT* Competition on Kaggle. Such identification can help prevent ineffectual treatments and allow health care providers to offer proper referrals for cases requiring more advanced treatment.

This case study follows the process used by the first-place-winning team, TEST (Towards Empirically Stable Training), to create an algorithm that would improve this life-saving diagnostic procedure.

Kaggle Competitions: Data Scientists Solve Real-world Machine Learning Problems

Image of woman with a mobile platform

The goal of Kaggle competitions is to challenge and incentivize data scientists globally to create machine-learning solutions in a wide range of industries and disciplines. In this particular competition – sponsored by Intel and MobileODT, developer of mobile diagnostic tools – more than 1,000 participants representing over 800 data scientist teams each developed algorithms to correctly classify cervix types based oncervical images.

In the screening process for cervical cancer, some patients require further testing while others don't; because this decision is so critical, an algorithm-aided determination can improve the quality and efficiency of cervical cancer screening for these patients. The challenge for each team was to develop the most efficient deep learning model for that purpose.

Team TEST Applies AI Expertise to Cervical Cancer Screening

The winning team consists of these Master and Grandmaster Kaggle competitors all from Lithuania:

Ignas Namajunas

Ignas Namajūnas, Mathematics BS and Computer Science MS, has nearly three years of research and development experience. He served as research lead for nine months on a surveillance project.

Darius Barušauskas

Darius Barušauskas, MSc in Econometrics, worked for more than six years in machine learning and deep learning applications. He has created more than 30 models and credit scoring for financial, utilities and telco sector companies. Barušauskas achieved grandmaster tier in less than a year since joining Kaggle.

Jonas Bialopetravicius

Jonas Bialopetravičius, Software Engineering BS, Computer Science MS, has more than six years of professional experience in computer vision and machine learning. He is currently studying astrophysics, where he applies deep learning methods.

The team’s experience in successfully training object detectors gave it a considerable advantage. Bialopetravičius and Namajūnas won a previous deep learning competition that required similar know-how, which they easily transferred to this project. "We saw this challenge as an opportunity to bring our experience to an important task and potentially improve the quality of cancer diagnostics," said Namajūnas. "As we had a good toolset to do so, it seemed very logical to adapt our technological know-how for this specific task."

Determining the Steps to a Most Efficient Solution

Team TEST not only realized the special importance of this competition – literally saving lives – but also saw it could be approached as an object detection challenge, where they already had achieved success.

Team members divided responsibilities: Barušauskas created cervix bounding boxes to achieve smaller region of interest within an image, set up validation sets for training, and attempted model stacking. Namajūnas examined data, searched for the right augmentations, and tended to training/testing details. Bialopetravičius worked on the general problem-solving pipeline, trained models and experimented with data augmentations. The team did not meet face-to-face during the challenge but communicated via Slack* messages.

Members used the Faster R-CNN detection framework with VGG16 as the feature extractor. Their main challenge was setting up a good validation set, finding the right augmentations for data, and resolving training details to optimize validation scores.

In total, they used six Faster R-CNN models. A separate model was first trained on all available bounding-box annotated data, which then was run on the stg1 testing set to obtain bounding-boxes. The resulting boxes, combined with the stg1 test set labels, were used for training the rest of the models. This could be generalized to new data, if class labels were provided for each image. Although they believed human-annotated bounding boxes would probably deliver the best result, the team concluded it would be more efficient overall to use bounding boxes generated by the models versus not having bounding boxes at all.

Of the five models that were trained on all of the data, one was trained for classification in night-vision images. This model was used only when a testing image was identified to be night-vision (which was easily done since the color distribution made it obvious). For the majority of remaining images, four different models were used, each doing inference on an image nine times (varying the scale and rotation); the output was then averaged.

In addition, some models were run with a modified non-maximum suppression scheme, yielding a total of 54 inferences over the four models. Team TEST combined the output of different models by taking the mean of individual predictions.

Augmentations – Color Proves a Key Insight

Data augmentations played a crucial role in the team’s competitive performance. While examining data, team members noticed that the most discriminative features were related to how much red blood-like structure was visible. This inspired an important strategy: augmenting the contrast of the red channel (i.e., the color of blood and tissue) was particularly helpful.

The augmentations in order of importance were:

Augmenting the contrast of the red channel
Randomly rotating the images (in the range of - 45:45 degrees)
Turning night vision images to grayscale
Blurring the images

Additional data was sampled so that the proportion of original:additional dataset images would be 1:2; had this not been done, the proportions would be closer to 1:4.

Simplified model

One of the models in the ensemble - red color contrast augmentations 0.4 - could be used separately. Team TEST managed to achieve a log loss of 0.79035 with it (in comparison to the winning submission of 0.76964). So even discarding the rest of the ensemble would yield a first-place showing in the leaderboards. This model only needs to be trained once and does a total of nine inferences. The number of inferences could probably be reduced further without a large drop in log loss.

Team TEST used a customized py-R-FCN (which included Faster R-CNN) code starting from this GitHub* repository.

Training and Inference Methods

The training effectiveness relied heavily on generating extra data. They trained R-CNN like detectors to discover the bounding box of the cervix, simultaneously classifying its type; no models were trained on whole images.

Team members found it beneficial to cast the problem as an object detection exercise (they had their own bounding box annotations) since the region of interest was usually quite small. Each model generated inferences on various scales and rotations of the testing images and the predictions were averaged using a simple arithmetic mean.

Training and Prediction Times

One of the models in the ensemble, red color contrast augmentations 0.4, achieved log loss of 0.79 (ensemble achieved 0.77), a score good enough to win the competition. This model trained in eight hours and needs 0.7 seconds to generate predictions for a single image. (Ensemble needs around 50 hours of training and seven to 10 seconds for inference.)

Dependencies

Arch Linux*
Python* 2.7 and 3
Caffe*
R-FCN + Faster R-CNN (https://github.com/Orpine/py-R-FCN)

Results and Key Findings…and a Plan to Keep Saving Lives

In its post-competition write-up, Team TEST noted: "Our log loss of ~0.77 is equivalent to always giving the correct class around 46% confidence. Better accuracy could be achieved with more data and a more precise labeling."

One of their key insights involved the importance of a proper validation scheme. "We noticed that the additional dataset had many similar photos as in the original training set, which itself caused problems if we wanted to use additional data in our models," they wrote. "Therefore, we applied K-means clustering to create a trustworthy validation set. We clustered all the photos into 100 clusters and took 20 random clusters as our validation set. This helped us track if the data augmentations we used in our models were useful or not."

Graph of Confusion matrix validation set 1
Figure 1. Confusion matrix validation set 1, which is sampled from "original" data: loss is 0.62, accuracy is 73%.

Graph of Confusion matrix validation set 2
Figure 2. Confusion matrix validation set 2, which is sampled from "additional" data: loss 0.76, accuracy is 68%.

For their achievement in the Kaggle Competition, Team TEST will share a $50,000 first-place prize. Going forward, the members intend to apply the lessons from their Kaggle experience to other real-life challenges: they and two other associates are founding their own startup to apply their deep learning expertise in other life-saving medical technologies.

"Since the competition we have been focusing on radiological tasks – lungs, brains, liver, cardiovascular, and so forth," said Namajūnas. "Hands-on Deep Learning experience has helped us to make quite a few models. We are also about to deploy our first model to a local hospital."

Learn More About Intel Initiatives in AI

Intel commends the AI developers who contributed their time and talent to help improve diagnosis and treatment for this life-threatening disease. Committed to helping scale AI solutions through the developer community, Intel makes AI training and tools broadly accessible through the Intel® AI Academy.

Take part as AI drives the next big wave of computing, delivering solutions that create, use and analyze the massive amount of data that is generated every minute.

Sign up with Intel AI Academy to get the latest updates on competitions and access to tools, optimized frameworks, and training for artificial intelligence, machine learning, and deep learning.

↧

Intel® and MobileODT* Competition on Kaggle*: 4th Place Winner

December 22, 2017, 4:57 pm

Latest and popular articles on Intel Technologies

≫ Next: Art’Em – Artistic Style Transfer to Virtual Reality Final Update

≪ Previous: Intel® and MobileODT* Competition on Kaggle*: 1st Place Winner

Kaggle* Master Luis Andre Dutra e Silva Develops Two AI Solutions to Improve the Precision and Accuracy of Cervical Cancer Screening

Intel® and MobileODT*

Abstract

More than 1,000 participants from over 800 data scientist teams developed algorithms to accurately identify a woman's cervix type based on images as part of the Intel and MobileODT* Competition on Kaggle*. Such identification can help prevent ineffectual treatments and allow health care providers to offer proper referrals for cases requiring more advanced treatment.

This case study details the process used by fourth-place winner Luis Andre Dutra e Silva – including his innovative use of Intel® technology-based tools – to develop an algorithm to improve the process of cervical cancer screenings. To do so, he developed two solutions then narrowed it to one for submission.

Kaggle Competitions: Data Scientists Solve Real-world Problems Using Machine Learning

Woman scientist/developer/tester

The goal of Kaggle competitions is to challenge and incentivize data scientists globally to create machine-learning solutions for real-world problems in a wide range of industries and disciplines. In this particular competition – sponsored by Intel and MobileODT, developer of mobile diagnostic tools – more than 1,000 participants from over 800 data scientist teams each developed algorithms to correctly classify cervix types based oncervical images.

In the screening process for cervical cancer, some patients require further testing while others don't. Because this decision is so critical, an algorithm-aided determination can significantly improve the quality and efficiency of cervical cancer screening for these patients. The challenge for each team was to develop the most efficient deep learning model for that purpose.

A Kaggle Master Competitor Rises to the Challenge of Cervical Cancer Screening

A veteran of multiple Kaggle competitions, Luis Andre Dutra e Silva was drawn to this challenge by its noble purpose, "and the possibility to explore new technologies that could be used further in other fields," he said. A federal auditor in the Brazilian Court of Audit, Silva plans to use AI knowledge in multiple applications in his job.

Two Approaches to Code Optimization

As a solo entrant, Silva determined from the beginning to try two different approaches and later verify which was best.

Solution 1 was based on the paper "Supervised Learning of Semantics-Preserving Hash via Deep Convolutional Neural Networks" (deep CNNs) by Huei-Fang Yang, Kevin Lin and Chu-Song Chen dated February 14, 2017. It promoted the concept of deriving binary hash codes representing the most prominent features of an image training a deep CNN. The network proposed by the authors is called Supervised Semantics Preserving Deep Hashing (SSDH). After training the SSDH network on each type of cervix, the semantic hashes would be the input of a Gradient Boosting Machine model using Boost as a classifier. The following image from the paper explains the inner workings of the proposed neural network architecture:

Figure 1. Supervised Semantics Preserving Deep Hashing. The hash function is constructed as a latent layer with K units between the deep layers and the output. (Huei-Fang Yang, Kevin Lin, Chu-Song Chen. Supervised Learning of Semantics-Preserving Hash via Deep Convolutional Neural Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017, 1-15. K. Lin, H.-F. Yang, J.-H. Hsiao, C.-S. Chen. Deep Learning of Binary Hash Codes for Fast Image Retrieval. CVPR Workshop on Deep Learning in Computer Vision, DeepVision June 2015.

Solution 2 was based on training a U-Net that would be capable of generating bounding boxes for each of the three types of cervix and, finally, making an ensemble of four classification models based on the automatically generated bounding boxes of the competition's test set. It was based on the article, "U-Net: Convolutional Networks for Biomedical Image Segmentation" by Olaf Ronneberger, Philipp Fischer and Thomas Brox dated May 18, 2015.

Figure 2.Illustration from the article "U-Net: Convolutional Networks for Biomedical Image Segmentation." (University of Freiburg, Germany, 2015)

The Right Choice of Hardware and Software Tools Puts Silva in the Money

Silva adopted an open approach toward choosing different software configurations and hardware devices. But it was the commitment to working with Intel® technology that earned him his fourth-place finish in this Kaggle competition for "Best Use of Intel Tools"– an honor claiming a $20,000 prize.

After checking out all available tools optimized for Intel® architecture, he recompiled them on an Intel® Xeon Phi™ coprocessor-based workstation. "Since I have been a hardware enthusiast for a long time now, I have two excellent Intel® workstations at home," he said. Both were built from scratch with Intel® Xeon® processors.

Equipped with all the necessary hardware tools, various software alternatives were chosen to explore the suitability to accomplish both solutions.

A Step-by-step Process

Silva's first step was to obtain an SSDH model from GitHub*. The SSDH neural network is represented by this graph:

Figure 3. A representation of the SSDH neural network. Image obtained from NVIDIA Digits open source tool.

His next task was to compile Berkeley Vision and Learning Center (BVLC) Caffe*, with Intel® Math Kernel Library (Intel® MKL) as Basic Linear Algebra Subprograms (BLAS) library and NVIDIA Collective Communications Library (NCCL) for GPUs' intercommunications. He then trained the SSDH network using the four GPUs; training time was approximately nine to 10 hours. Silva created a pycaffe script to extract the semantic hashes from each image of the training set and then used the extracted hashes to train an XGBoost model to learn each type of cervix.

Only after this point did Silva submit his results to the Kaggle competition; they were good enough to be deemed feasible. With the Kaggle competition feedback, he started his second solution.

Silva started by training the U-Net with bounding boxes of each cervix type from the training set. He used forward passes of the trained Keras U-Net to obtain the test set bounding boxes. Using only the training set Regions of Interest of cervix types, Silva trained and ensembled four Caffe Models to classify the test Region of Interest (ROIs) as one of the three types of cervix.

Training the Networks

Training for each of the classification models (two GoogleNet and two AlexNet) in the second solution approach was done using Caffe with Intel® Deep Learning SDK to train the networks. "I chose the Intel DL SDK because it had image augmentation on the fly and it was a good test for the tools I had used so far," he said. "Using the Intel Deep Learning SDK, the training time was comparable to GPU and certainly the Intel® software tools and hardware must have a fundamental role for that performance."

Overcoming the Lack of Medical Training

Silva's greatest obstacle, understandably, was his limited medical background – he was incapable of distinguishing between the three cervix types just by observing them. "If I had that knowledge, I could make some preprocessing in the images in order to make each type more evident to the classifiers," he said. "But since it is not possible to have that knowledge in a short period of time, I trusted my cross-validation algorithm in order to be sure I was on the right path."

Results and Key Findings: What Set This Approach Apart

The first solution involved a complete knowledge of Caffe building and installation, because the customized version of that framework is more than one year old. Nevertheless, the SSDH model was considerably efficient in the task of producing semantic hashes that represented each type of image.

The second solution was inspired in real world medical imaging tools for deep learning and it demonstrated that the U-Net architecture was really the best fit for the problem statement.

After 25,000 iterations, Silva was able to achieve a plateau of 81% accuracy of the SSDH model. Other results and findings from both his models are detailed in the charts below. Pictures were generated by Tensorboard service, Intel Deep Learning Training Tool and Microsoft Excel.

Figure 4. Illustration of results and key findings. Graph A shows the accuracy of the SSDH model, which was trained with the full image dataset. After iteration 25000 it plateaus at approximately 81%. Graph B depicts Caffe* SSDH model loss during 10 hours of training. Graph C demonstrates Keras UNet validation dice coefficient scalar. Although it is very unstable, it shows a slight trend toward increasing. Graph D shows Keras UNet training dice coefficient indicating a stable trend of increasing in the training set.

Figure 5. Illustration of results and key findings. Panel A demonstrates Intel® Deep Learning SDK and Caffe* AlexNet model 1 with augmentation and light fine-tuning. Panel B shows Intel® Deep Learning SDK and Caffe* GoogleNet model 2 with augmentation and original weights.

Learn More About Intel Initiatives in AI

Take part as AI drives the next big wave of computing, delivering solutions that create, use and analyze the massive amounts of data that are generated every minute.

Sign up with Intel AI Academy to get the latest tools, optimized frameworks, and training for artificial intelligence, machine learning, and deep learning.

↧

Art’Em – Artistic Style Transfer to Virtual Reality Final Update

December 28, 2017, 12:13 pm

Latest and popular articles on Intel Technologies

≫ Next: Cannot Find "stdint.h" after Upgrade to Visual Studio 2017

≪ Previous: Intel® and MobileODT* Competition on Kaggle*: 4th Place Winner

Art’Em is an application that uses computer vision to bring artistic style transfer to real time speeds in VR compatible resolutions. This program takes in a feed from any source (OpenCV - Webcam, User screen, Android Phone camera (IP Webcam) etc.) and returns a stylized Image.

1. Introduction

Various tools were used to build this application.

This paper has been divided into 3 sections, each of which explore different ways of accelerating the current process of Artistic style transfer.

The first section introduces the concept of XNOR-Nets and conducts an in-depth case study of how parallelization can be performed efficiently without any approximations. While the method functions, integrating the kernels with trainable deep learning models prove to not be feasible within the time frame of this project.

The second section studies generator networks and what is in some ways ‘one-shot’ image stylization. The method works well at VR compatible resolutions but is largely limited by the fact that each network takes a long time to learn a new style.

The third section studies a technique called Adaptive Instance normalization, which allows the user to not only stylize extremely fast, but also instantly switch to any style of choice.

2. XNOR Networks

XNOR-nets (Exclusively NOT OR) have been explored in depth in the following articles ^[1]^[2]. The image below summarizes how you can replace matrix multiplication operations with simple operations (XNOR followed by Population count) provided every matrix element is either 1 or -1. This shows great promise in terms of the network speedup.

Upon in-depth study of binarization, we saw great loss of Image semantics, which can be clearly seen in the image below. This however is the result of an extremely destructive method of binarization. If the network is trained with the constraint that every matrix in the network must be a scalar real value multiplied by a matrix containing only +1 and -1, it has been proven ^[3] that very decent Image recognition results can be obtained. The results obtained by AllenAI proved that a trained AlexNet XNOR network can obtain top-1 accuracies of 43.3%.

As shown in my previous technical article, creating an unoptimized XNOR general matrix multiply kernel can give speedups as great as 6 times over the cuBLAS kernel. This is simply because XNOR networks are applicable only to a very specific use case. Thus is it not too surprising that it is so much faster.

While this is great, the most important part of a convolutional neural network is the convolution. However, the promise of speedup in a convolutional layer isn’t as great as it was for a fully connected layer. This is revealed quite simply when one tries to execute the basic methodology adopted for both actions.

[Source ^[4]]

As you can see, convolution requires us to select submatrices, and multiply them with a kernel. Whereas, in matrix multiplication, an entire row is to be multiplied with an entire column of a matrix.

The packing of bits to a data type is the most time consuming process in an XNOR-net. So, if we were to imagine a MxM matrix convolving over a NxN matrix, we will have to pack (N-M+1)²submatrices to a data type and then execute a XNOR function followed by a population count to generate the ‘convolved feature’.

However, for a matrix multiplication if we were to multiply two NxN matrices together, we will only have to pack 2N submatrices (Rows and columns respectively) to be able to implement this function.

Thus we lose out on a low of the expected performance gains. To confirm my suspicion, I implemented a low precision convolutional network using CUDA (Compute Unified Device Architecture) C programming. The implementation can be found here ^[5]. To run this code you must have a CUDA compatible device. Your ability to work with different Image dimensions also largely depends upon the VRAM available to you. The network has been parallelized by utilizing the shared memory available. It currently functions only for kernels of sizes 4x4. However, extending that will not be a big issue.

Allocation of binarized kernels is done before the convolution timer begins. The benchmarking is performing using the nvprof tool. The choice of timing the algorithm after binarizing the kernels is due to the fact that a saved network would already have its kernels saved in the form we desire. A shared array of length 256 unsigned ints per block is allocated. The variables necessary for indexing are also allocated. The block size is set to (16,16), this is done so that we can allocate a shared array within the shared memory constraint. The binarization code then begins to loop through the kernel depth, for which it iterates through every channel in the input ‘image’. Within each of these iterations, every thread populates its own element of the array with an unsigned int which is packing the sub-matrix which has been binarized. After this is complete, we simply execute a line of code which populates the output array with the value of the multiplication of the kernel with the respective submatrix generated from the input ‘image’. Note that the output array is flattened, but this can easily be handled. One very important thing to be careful about when attempting to parallelize the channels to make this applicable to frameworks like TensorFlow would be to not exceed the maximum grid and block size limitations. These would give rise to memory allocation and access errors.

This is a very basic implementation, and while the network has been parallelized per convolution, channel convolutions have not been parallelized. There is scope for increasing the speeds mentioned below many-fold, but the same applies to a full precision convolutional operation. Even though the bitwise convolution would perform better with further parallelization, it has been omitted from the code.

While the network does run around 2 times faster than a basic general convolution kernel, the network speed up doesn’t increase enough by my methods to justify the loss in accuracy.

One of the other notable advantages of using a binarized network is the fact that the entire network can be stored in much lesser space. Thus an entire VGG-19 model can be reduced from ~512 MB to only ~16 MB. This will be great for embedded devices which do not have the space to hold such large models, and even for GPU and CPUs, as the entire model could be loaded at once into the VRAM of less powerful hardware or hold larger datasets.

We must thus continue to a better technique for this use case, one which preserves the accuracy to a large extent and also gives significant speedup.

3. Generator networks

Adapting the research done in this paper ^[6], we will study the implementation of a generator stylization network and Instance Normalization.

Essentially, implementing an optimization based model of artistic style transfer will generally give better, more varied results but is extremely time consuming. We will thus learn a generator neural network for every style we wish to adapt to. This will give much faster stylization speed at the expense of inferior quality and diversity compared to generating stylized images by optimization.

There are two steps involved in building a generator network.

Designing a generator network

This is one of the most important parts of this endeavor. A generator network must transfer style well but also be small enough to give good frame rates. If you do not want to train your own network, it is suggested to use this ^[7] implementation of the generator network. The network consists of three convolutional layer, followed by 5 residual blocks, followed by two transposed convolution which ends with a convolutional layer.

The generator network utilizes a method of normalization known as “Instance Normalization (IN)”. This method is different from batch normalization as it computes the statistics for each batch element independently. The illustration below (Taken from the poster made by Dmitry Ulyanov ^[8]) demonstrates how Batch Normalization and Instance Normalization differ. IN stands for Instance normalization, and BN stands for Batch Normalization.

Training of generator network

Once the generator network has been set up and the MS-COCO dataset has been saved along with the VGG19 model, we need to implement a loss function.

The loss function includes a style loss, content loss and a total variational loss. Including total variational loss helps removal of noise from the generated images.

The input image is fed to the generator network, we will call the output of the generator network the generated image. This is then forwarded to a VGG-19 model, along with the content and style targets. A loss function value is calculated by comparing the layers of the VGG-19 when passing the generated image, style target and the content target.

For this purpose, the ADAM optimizer was utilized. However, the L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno) optimization algorithm has been shown to give much better results and faster convergence speeds for optimization based style transfer.

The image below depicts the training process.

[Source ^[9]]

The gif below is indicative of the learning process of such a generator network. The entire learning process was not captured here, but every 50 iterations a snap was recorded. The training was performed on the Intel® Nervana™ DevCloud. In only 900 iterations, it learnt a lot about how to stylize an image.

After the training, we can simply take an Image Transform Net and run the entire process on it.

Dataset

The dataset used to train generator networks was the MS-COCO dataset.^[10]. The MS-COCO dataset is a large-scale object detection, segmentation, and captioning dataset. We will not be needing any of the labels or object information. We only require a large number of images to train the network on. These will be our content images. As for our style image, a generator network can only learn one style. This will be discussed later with a new technique being introduced, but for now we are limited to having only one style trained per network.

With the generator network at default configuration, we achieved speeds of around 15 frames per second at a resolution of 600x540. However upon training a smaller generator network we achieved around 17 frames per second at full VR resolution. This metric involves utilizing the ImageGrab feature along with the stylization. The ImageGrab feature by default maxes out at around 30 frames per second. While this result is impressive, the result of pruning the network led to extreme quality degradation. All of this was performed on a modestly powerful graphcs processing unit on a laptop. Thus we can expect at least 25 frames per second at full VR resolution on a desktop grade GTX 1080 Ti along with a powerful Intel processor.

The stylization on the pruned network is not too great. The result below is obtained after 7000 iterations. Much better results can be expected if the training is done for longer. However the pruned model gives very high network speeds.

Further training of this network will be done. The corner artifacts are also undesirable, however I have not been able to tackle them effectively while training yet.

The GIF below demonstrates the network stylizing at default configuration. At half VR resolution, the GIF is a real time depiction of stylization. Results similar to this can be expected at VR resolution for the pruned network.

The current GUI of the application is a simple Tkinter interface which allows you to use the ImageGrab feature to stylize your screen at any resolution. Since the concept is under development, making the code transparent is very important and thus no proper front end planning has been done, as any such planning would have to adapt to a VR interface. The VR platform also becomes important and the camera feed source needs to also be taken into consideration. With the current test phase interface, 5 preset stylization options are also provided. The models must be downloaded from this link. ^[11] Anyone downloading the original code can easily train their own models by using this ^[12] code base. The OpenCV model which is webcam compatible has not been shared yet as there are plans to integrate the same with an external camera (Phone) via software like IP Webcam for Android.

4. Adapting to multiple styles

One of the most notable problems with the implementation above is the fact that there needs to be a new model trained for every new style we would like. On powerful hardware, it takes around 4-6 hours to train one such model. This is clearly not feasible if we want to make a large scale application that supports people to use different styles tuned to their liking.

While maintaining the earlier code for fast stylization, I attempted to work with the Adaptive Instance Normalization technique. This is a great methodology to adapt to any style instantly and deliver promising results. You can find the original implementation here ^[13].

The idea behind the technique of Adaptive instance normalization shown above is as follows:

VGG Encoder

VGG-19 is an ImageNet model. An image recognition model such as the VGG-net has several convolutional layers followed by fully connected layers. The fully connected layers ‘decompose’ the convolutional layer data into any number of categories. We do not need the fully connected layers as our aim is not to classify images, but extract information about the image. Since the VGG-net is a good image recognition model, we can imagine that every layer would contain some relevant data about the image. These layer outputs can be ‘imagined’ using the image below. You can clearly see that as the depth increases, more abstract information about the input will be extracted.

Thus, if we were to pass an image through the VGG19 network without the fully connected layers, we can imagine that the 512 filters in the VGG19 conv5_1 layer would extract a lot of important information about the image.

More specifically, we can utilize this information to encode the ‘style’ information to the ‘content’ information. This is exactly what Adaptive Instance normalization does.

AdaIN

The AdaIN module of the network receives the content ‘information’ x, and the style ‘information’ y. Then it aligns the channel-wise mean and variance of x to match those of y. This is not a trainable module, rather just a transformation of the content information with respect to the style information. You can see the exact relation in the Image which represents the entire model above.

You can imagine a feature channel from the style ‘information’ that detects a certain style of brush strokes. This will have a very high activation, and the same will be scaled to the incoming content ‘information’ x. Thus our stylized encoded data is ready. We now need to decode this information to an actual image.

Decoder

The function of the decoder is to essentially decode the output of the AdaIN module. The decoder mirrors the encoder. The pooling layers are replaced with nearest up-sampling. Normalization is not used in the decoder, this is because of the fact that Instance normalization normalizes each sample to a single style, whereas batch normalization does that to a batch of samples.

Thus, this network brings the encoded data to the original size with appropriate stylization.

Implementation

While utilizing the online torch pretrained models were effective, the network speed could be made much, much faster if we used more compact networks. The dataset consists of the MS-COCO dataset for content image. For the stylization image dataset, we had the default Kaggle dataset for ‘Painter by numbers’. I was unable to download this data set. Thus for training purposes, I have requested access to the BAM (Behance-Artistic-Media) dataset.^[15]

One of my first realizations whilst implementing this model was that when the Adaptive Instance Normalization module was disabled, the Decoder generated very low contrast images. Since I did not have the ability to further train the network till I have the art dataset, I decided to utilize PIL Image modules to increase the contrast of the decoder output till the output looked natural/close to the input image. This is somewhat necessary because of the fact that if there are no transformations being done to the encoder data, a perfect decoder would simply give the original image back. This post processing of the image gave much better stylization results, albeit occasionally a bit over to saturated.

There are two very important observations I realized after implementing the entire model, while the stylization was average the color adaptation was very poor. This might be a fault in my own implementation. However, giving extra weight to the style mean in the channel over the content mean provided me with much better color adaptation.

With the pre-trained models, I was unable to replicate the quality of results presented by the research paper. However, we achieved around 8 frames per second on a mobile GTX 1070 at VR resolution. Much faster results can be expected from a more powerful desktop grade VR capable device. As for the quality, this could also be because the large scale features are not exactly transformed by the decoder. I hope to train the decoder to take inputs from a less deep version of the VGG and apply the AdaIN module. This could be one of the reasons why I never saw any large scale changes in the image. In my own model, there was very little style being transferred fully.

Further reducing the size of the network to create a decoder which takes input from the third convolutional module of the VGG network or using an entire new ImageNet model could give much faster stylization speeds, perhaps even at par with the earlier approach of using a generator network.

The image below depicts the results of stylization by this method. We can see clearly that there is not a lot of change in much of the images. There has been a slight color shift due to the contrast increase as well. However I believe that this makes up for the decoder till improvements can be made by training.

Summary and future work

The project succeeds at bringing artistic style transfer to real time and integrates it with a camera/OpenCV module. We explored two different methodologies of stylizing an image, both of which give very promising results. This framework can very easily be extended for use with VR. It also contains useful performance analysis of XNOR-nets and its feasibility.

There are several tangents which can be made much better. We are yet to train the smaller generator network which gives a pleasant stylization speed of 17-20 frames per second on a laptop grade graphical processing unit. Utilizing a smaller encoder-decoder pair will also allow the user to adapt to any style of his choice. We can also expect to see more large scale changes by virtue of changing the layer we use for the AdaIn module, which is what we wanted. Some parameters need to be tuned to get better results from Adaptive Instance normalization. The generator networks give very good results at a respectable frame rate.

Acknowledgements

I would like to thank the Intel Student Ambassador Program for AI which provided me with the necessary training resources on the Intel DevCloud and the technical support to be able to develop this idea. This project was supported by the Intel Early Innovator Grant.

References

[1] https://software.intel.com/en-us/blogs/2017/09/21/art-em-week-2

[2] https://software.intel.com/en-us/blogs/2017/10/02/art-em-artistic-style-transfer-to-virtual-reality-week-4-update

[3] https://github.com/allenai/XNOR-Net

[4] https://cdn-images-1.medium.com/max/800/1*ZCjPUFrB6eHPRi4eyP6aaA.gif

[5] https://github.com/akhauriyash/XNOR-convolution/blob/master/xnorconv.cu

[6] https://arxiv.org/abs/1701.02096

[7] https://github.com/lengstrom/fast-style-transfer/blob/master/src/transform.py

[8] https://dmitryulyanov.github.io/about

[9] https://cs.stanford.edu/people/jcjohns/papers/eccv16/JohnsonECCV16.pdf

[10] http://cocodataset.org/#home

[11] https://drive.google.com/drive/folders/0B9jhaT37ydSyRk9UX0wwX3BpMzQ?usp=sharing

[12] https://github.com/lengstrom/fast-style-transfer

[13] https://github.com/xunhuang1995/AdaIN-style

[14] https://github.com/jonrei/tf-AdaIN

[15] https://bam-dataset.org/

↧

Cannot Find "stdint.h" after Upgrade to Visual Studio 2017

December 29, 2017, 2:50 am

Latest and popular articles on Intel Technologies

≫ Next: Intel and MobileODT* Competition on Kaggle*: 2nd Place Winner

≪ Previous: Art’Em – Artistic Style Transfer to Virtual Reality Final Update

Problem Description

When running Visual Studio 2017 C++ compiler under Intel(R) C++ Compiler environment, or Visual Studio 2017 solution contains mixed projects with both Intel compiler and Visual Studio C++ compiler may encounter

fatal error C1083: Cannot open include file: '../../vc/include/stdint.h': No such file or directory

Root Cause

In some header files of Intel C++ compiler, we need to include particular Microsoft VC++ header files by path. With Microsoft Visual Studio 2015 and older, we can use the relative path, like “../vc”. Starting from Microsoft Visual Studio 2017 the include directory name contains full VC Tools version number.

For example, the Visual Studio 2017 stdint.h is here:

c:/Program files (x86)/Microsoft Visual Studio/2017/Professional/VC/Tools/MSVC/14.10.24930/include/stdint.h

For Visual Studio 2015, it is here:

c:/Program files (x86)/Microsoft Visual Studio 14.0/VC/INCLUDE/stdint.h

Solution

Workaround is to define __MS_VC_INSTALL_PATH macro in command line (-D option), e.g.:

-D__MS_VC_INSTALL_PATH="c:/Program files (x86)/Microsoft Visual Studio/2017/Professional/VC/Tools/MSVC/14.10.24930"

To resolve the issue still relies on Microsoft's support. Please see our registered an issue in Microsoft's forum:

https://visualstudio.uservoice.com/forums/121579-visual-studio-ide/suggestions/30930367-add-a-built-in-precompiled-macro-to-vc-that-poin

For users encountered this issue, you are encouraged to vote this idea from above link.

↧

Intel and MobileODT* Competition on Kaggle*: 2nd Place Winner

December 28, 2017, 4:09 pm

Latest and popular articles on Intel Technologies

≫ Next: What to do when Nested Parallelism Runs Amuck? Getting Started with Python module for Threading Building Blocks (Intel® TBB) in Less than 30 Minutes!

≪ Previous: Cannot Find "stdint.h" after Upgrade to Visual Studio 2017

Indrayana Rustandi Employs Convolutional Neural Networks Using AI to Improve the Precision and Accuracy of Cervical Cancer Screening

Editor’s note: This is one in a series of case studies showcasing finalists in the Kaggle* Competition sponsored by Intel and MobileODT*. The goal of this competition was to use artificial intelligence to improve the precision and accuracy of cervical cancer screening.

Abstract

More than 1,000 participants from over 800 data scientist teams developed algorithms to accurately identify a woman’s cervix type based on images as part of the Intel and MobileODT* Competition on Kaggle. Such identification can help prevent ineffectual treatments and allow health care providers to offer proper referrals for cases requiring more advanced treatment.

This case study follows the process used by second-place winner Indrayana Rustandi to build a deep learning model improving- this life-saving diagnostic procedure. His approach primarily used convolutional neural networks as the basis of his methods.

Kaggle Competitions: Data Scientists Solve Real-world Problems with Machine Learning

The goal of Kaggle competitions is to challenge and incentivize data scientists globally to create machine-learning solutions in a wide range of industries and disciplines. In this particular competition – sponsored by Intel and MobileODT, developer of mobile diagnostic tools – more than 1,000 participants from over 800 data scientist teams each developed algorithms to correctly classify cervix types based on cervical images.

A Kaggle Competition Veteran Takes on Cervical Cancer Screening

Indrayana Rustandi is a quantitative analyst in option market making at Citigroup providing data analytics services. Prior to entering the financial industry, he earned his Ph.D. in computer science at Carnegie Mellon University working on machine learning methods for brain imaging.

Rustandi began dabbling in Kaggle challenges about a year ago and credits his machine learning experience with success as a competitor. “But until this particular competition, I felt I did not spend sufficient time and focus for any single one,” he said. “I wanted to make sure that I could devote enough proper time to work on the competition and only make a submission when I am confident that I have given my best.”

Choosing an Approach to Code Optimization

The methods used were largely based on convolutional neural networks, in particular DenseNet-161 and ResNet-152 pre-trained on the ImageNet dataset. To those custom classification layers were added.

The main framework for the solution was PyTorch. On a single GPU, one component of the ensemble takes about six hours on average to compute with early stopping. Rustandi’s only feature engineering was to apply cervix segmentation posted in one of the kernels and combining segmented image-based models with non-segmented models. “Instead, what I found most useful was the incorporation of the additional data, along with extensive manual filtering of the train+additional data to be used for training – especially because a lot of the images are blurry or might not be relevant at all to the task,” he said.

Software and Hardware Resources Brought into Play

Software used by Rustandi included:

Python* 3.5+ in an Anaconda* 4.1 installation
PyTorch 0.1.12
Torchsample 0.1.2
TQDM 4.11.2
PyCrayon (optional, for TensorBoard* logging)
Rustandi used two workstations, both based on Intel technology. “That way I could run four experiments in parallel,” he explained, “one experiment per GPU.”

Machine Learning Model Design for Training and Testing

Figure 1. A 5-layer dense block with a growth rate of k = 4.

Variations of the model used:

Original images, one hidden layer with 192 hidden units prior to the classification layer
Original images, two hidden layers each with 192 hidden units prior to the classification layer. “I found that 192 gives the best performance,” he commented.
Cropped images using cervix segmentation, one hidden layer with 192 hidden units prior to the classification layer
Cropped images using cervix segmentation, two hidden layers each with 192 hidden units prior to the classification layer

For each of the above, he trained three models with different random seeds to get three components of the ensemble. In all, there were 24 models in the ensemble. Ensembling was performed simply by averaging the class probabilities output by each ensemble component, derived from a solution to a previous Kaggle competition.

Learning the neural network weights

In regard to the learning algorithm, Rustandi used stochastic gradient descent with momentum. He used learning rate of 1e-3 for the first five epochs, then a learning rate of 1e-4 with decay.

The model was trained for a maximum of 150 epochs with early stopping: “The stopping criterion is to stop training if the current validation error exceeds the best validation error by 0.1,” he explained. “The model with the best validation error is the one chosen to be part of the ensemble.”

Rustandi noticed in the validation set that there were some non-overlapping instances that the models misclassified. “When I looked at the saliency maps, I saw that the base models might focus on different areas of the images when making their decision,” he explained, “hence confirming even further the wisdom of including both base models as part of the ensemble.”

A straightforward way to simplify the methods was to reduce the number of components in the ensemble. He could also derive potential improvements when using the patient IDs and making sure that the validation dataset does not contain patients present in the training data. Finally, simpler models could be achieved by using less complex base models, such as versions of DenseNet and ResNet with fewer parameters.

Data Augmentation

Generally, data augmentation is a useful way to artificially increase the size of the dataset through different transformations, he said. This is particularly helpful in training neural networks because they converge to a better model (as measured by out-of-sample performance) when the dataset used to train them is sufficiently large and representative.

Training and Prediction Time

On a single GPU (either GTX 1070 or GTX 1080) with 16 images in a mini-batch, each epoch took three to four minutes. So, for the maximum number of 150 epochs, it would take 450-600 minutes (7.5-10 hours) to train a particular model, although early stopping can shorten the time.

If done on individual instances, generating predictions can take up to one second for each instance. But, it was determined to be more efficient to generate predictions for multiple instances simultaneously. On a single GPU, simultaneous predictions on multiple instances take roughly the same amount of time as separate predictions on individual instances – up to one second.

Results and Key Findings: ‘The Personal Touch’ Sets This Approach Apart

Because competition images resembled ordinary RGB images, Rustandi found that with some refinement - pretrained ImageNet models could be used to extract informative features for classification quite well. The biggest challenge was the existence of two available datasets: the REGULAR dataset and the ADDITIONAL dataset. The ADDITIONAL dataset in particular had the potential to help train the networks. But to get there, - certain problems had to be addressed in the dataset-- namely its inconsistent image value (blurry, duplicate, or irrelevant to the task). Similar problems possibly affected the REGULAR dataset as well, but to a lesser degree.

In the end, Rustandi chose to examine each image in both datasets manually, flagging those - determined to be problematic for exclusion from training. “I think this step in particular gave me quite an edge over the other competitors,” he said, “since a majority of them seemed to end up not using the ADDITIONAL dataset at all, and hence, not availing themselves of the useful information present in this dataset.”

During stage 1, he chose not to probe the leaderboard at all, declining to make any submissions until stage 2 of the competition. “Instead, I decided to rely on my own validation,” he said. “Also, the final models did not incorporate any stage 1 test data.”

		Prediction
Actual		Type 1	Type 2	Type 3
	Type 1	39	39	9
	Type 2	12	196	57
	Type 3	1	32	127

Each row in the confusion matrix marks the true class while each column marks the predicted class, using data from stage 1. Each element in the confusion matrix specifies the number of cases for the class in the corresponding row that gets classified as the class in the corresponding column. The sum of elements in each row is the number of total cases for each class in stage 1 data.

Rustandi’s patience and hands-on attention to detail paid off, earning a second place in the Kaggle Competition-and a $20,000 prize.

Learn More About Intel Initiatives in AI

Take part as AI drives the next big wave of computing, delivering solutions that create, use and analyze the massive amount of data that is generated every minute.

Sign up with Intel AI Academy to get the latest updates on competitions and access to tools, optimized frameworks, and training for artificial intelligence, machine learning, and deep learning.

↧

What to do when Nested Parallelism Runs Amuck? Getting Started with Python module for Threading Building Blocks (Intel® TBB) in Less than 30 Minutes!

December 27, 2017, 10:24 am

Latest and popular articles on Intel Technologies

≫ Next: Persistent Memory Programming—Frequently Asked Questions

≪ Previous: Intel and MobileODT* Competition on Kaggle*: 2nd Place Winner

Introduction and Description of Product

Intel® Threading Building Blocks (Intel® TBB) is a portable, open-source parallel programming library from the parallelism experts at INTEL®. A Python module for Intel® TBB is included in the Intel® Distribution of Python and provides an out-of-the-box scheduling replacement to address common problems arising from nested parallelism. It handles coordination of both intra- and inter-process concurrency. This article will show you how to launch Python programs using the Python module for Intel® TBB to parallelize math from popular Python modules like Numpy* and Scipy* by way of Intel® Math Kernel Library (Intel® MKL) thread scheduling. Please note that Intel® MKL also comes bundled free with the Intel® Distribution of Python. Intel® TBB is the native threading library for Intel® Data Analytics Acceleration Library (Intel® DAAL), which is a high-performance analytics package with a fully functional Python API. Furthermore, If working with the full Intel® Distribution for Python package, it is also the native threading underneath Numba*, OpenCV*, and select Scikit-learn* algorithms (which have been accelerated with Intel® DAAL).

How to Get Intel® TBB

To install full Intel® Distribution of Python package, which includes Intel® TBB, click below for installation guides:

Anaconda* Package
YUM Repository
APT Repository
Docker* Images

To install from Anaconda cloud:

conda install –c intel tbb

(It will change to ‘tbb4py’ in Q1 of 2018. Article will be updated accordingly)

Drop-in Use with Interpreter Call (no other code changes)

Simply drop in Intel® TBB and determine if it is the right solution for your problem statement!

Performance degradation due to over-subscription can be caused by nested parallel calls, many times unbeknownst to the user. These sort of “mistakes” are easy to make in a scripting environment. Intel® TBB can be turned on easily for out-of-the-box thread scheduling with no code changes. In the faith of the scripting culture of the Python community, this allows for quick checking of Intel® TBB’s performance recovery. If you already have math code written, you can easily launch with the “-m tbb ” interpreter flag, followed by script name and any required args for your script. It’s as easy as this:

python -m tbb script.py args*

NOTE: See the Interpreter Flag Reference Section for full list of available flags.

Interpreter Flag Reference

Command Line Usage

python -m tbb [-h] [--ipc] [-a] [--allocator-huge-pages] [-p P] [-b] [-v] [-m] script.py args*

Get Help from Command Line

python -m tbb –-help
pydoc tbb

List of the currently available interpreter flags

Interpreter Flag	Description of Instruction
-h, --help	show this help message and exit
-m	Executes following as a module (default: False)
-a, --allocator	Enable TBB scalable allocator as a replacement for standard memory allocator (default: False)
--allocator-huge-pages	Enable huge pages for TBB allocator (implies: -a) (default: False)
-p P, --max-num-threads P	Initialize TBB with P max number of threads per process (default: number of available logical processors on system)
-b, --benchmark	Block TBB initialization until all the threads are created before continue the script. This is necessary for performance benchmarks that want to exclude TBB initialization from the measurements (default: False)
-v, --verbose	Request verbose and version information (default: False)
--ipc	Enable inter-process (IPC) coordination between TBB schedulers (default: False)

Additional Links

Intel Product Page

Short Introduction Video

SciPy 2017 proceedings

SciPY 2016 Video Presentation

DASK* with Intel® TBB Blog Post

↧

Persistent Memory Programming—Frequently Asked Questions

December 26, 2017, 2:45 pm

Latest and popular articles on Intel Technologies

≫ Next: Distributing with Python*: Quick Guide to Using Intel® MPI (updated for Python 3.6)

≪ Previous: What to do when Nested Parallelism Runs Amuck? Getting Started with Python module for Threading Building Blocks (Intel® TBB) in Less than 30 Minutes!

Introduction

In this FAQ we answer questions people have asked about persistent memory programming. If you don’t find the information you need here, visit the Intel® Developer Zone’s Persistent Memory (PMEM) Programming site. There you’ll find articles, videos, code samples, and links to other resources to support your work in this exciting technology.

About the Persistent Memory Developer Kit

The Persistent Memory Development Kit (PMDK), formerly known as the Non-Volatile Memory Library (NVML), is a growing collection of libraries and tools designed to support development of persistent memory-aware applications. The PMDK project currently supports ten libraries, targeted at various use cases for persistent memory, along with language support for C, C++, Java*, and Python*. It also includes tools like the pmemcheck plug-in for valgrind, and an increasing body of documentation, code examples, tutorials, and blog entries. The libraries are tuned and validated to production quality and are issued with a license that allows their use in both open- and closed-source products. The project continues to grow as new use cases are identified.

Why was NVML renamed the Persistent Memory Developer Kit (PMDK)?

The reason for the name change and how it affects developers is explained in this blog on the pmem.io website: Announcing the Persistent Memory Development Kit. Pmem.io is the official website for the PMDK.

Basic Persistent Memory Concepts

This section contains frequently asked questions about basic persistent memory concepts.

What is persistent memory?

Persistent memory is:

Byte-addressable like memory
Persistent like storage
Cache coherent
Load/store accessible—persistent memory is fast enough to be used directly by the CPU
Direct access—no paging from a storage device to dynamic random-access memory (DRAM)

What is DAX?

Direct Access (DAX) is the mechanism that enables direct access to files stored in persistent memory without the need to copy the data through the page cache.

DAX removes the extra copy by performing reads and writes directly to the storage device. Without DAX support in a file system, the page cache is generally used to buffer reads and writes to files. It is also used to provide the pages that are mapped into user space by a call to mmap. For more info please refer to the article Direct Access for files atThe Linux Kernel Archives

What is the persistent memory-aware file system?

The persistent memory file system can detect whether there is DAX support in the kernel. If so, when an application opens a memory mapped file on this file system, it has direct access to the persistent region. EXT4*, XFS* on Linux* and NTFS* are examples of a persistent memory-aware file system.

In order to get DAX support, the file system must be mounted with the “dax” mount option. For example, on the EXT4 file system you can mount as follows:

mkfs –t ext4 /dev/pmem0
mount –o dax /dev/pmem0 /dev/pmem

What is the difference between file system DAX and device DAX? When would you use one versus the other?

File system DAX is where an application requires file system support for features like checking for file permissions, access control, and so on.

Device DAX is the device-centric equivalent of file system DAX. It allows memory ranges to be allocated and mapped without the need of an intervening file system.

For device DAX, a user’s application can access data directly. Both paths require the support of the operating system or the kernel.

What is the SNIA* NVM Programming Model for persistent memory?

This Storage Networking Industry Association (SNIA) specification defines recommended behavior between various user space and operating system (OS) kernel components supporting non-volatile memory (NVM). This specification does not describe a specific API. Instead, the intent is to enable common NVM behavior to be exposed by multiple operating system-specific interfaces. Some of the techniques used in this model are memory mapped files, DAX, and so on. For more information, refer to the SNIA NVM Programming Model.

How is memory mapping of files different on byte-addressable persistent memory?

Though memory mapping of files is an old technique, it plays an important role in persistent memory programming.

When you memory map a file, you are telling the operating system to map the file into memory and then expose this memory region into the application’s virtual address space.

For an application working with block storage, when a file is memory mapped, this region is treated as byte-addressable storage. What is actually happening behind the scene is page caching. Page caching is where the operating system pauses the application to do I/O, but the underlying storage can only talk in blocks. So, even if a single byte is changed, the entire 4K block is moved to storage, which is not very efficient.

For an application working with persistent memory, when a file is memory mapped, this region is treated as byte-addressable (cache line) storage and page caching is eliminated.

What is atomicity—why is this important when working with persistent memory?

In the context of visibility, atomicity is what other threads can see. In the context of power-fail atomicity, it is the size of the store that cannot be torn by a power failure or other interruption. In x86 CPUs, any store to memory has an atomicity guarantee of only 8 bytes. In a real world application, data updates may consist of chunks larger than 8 bytes. Anything larger than 8 bytes is not power-fail atomic and may result in a torn write.

What is BTT and why do we need it to manage sector atomicity?

The Block Translation Table (BTT) provides atomic sector update semantics for persistent memory devices. It prevents torn writes for applications that rely on sector writes. The BTT manifests itself as a stacked block device, and reserves a portion of the underlying storage for its metadata. It is an indirection table that re-maps all the blocks on the volume, and can be thought of as an extremely simple file system whose sole purpose is to provide atomic sector updates.

I would like to purchase new servers and devices that support persistent memory programming. What are the hardware and software requirements that must be met to support a persistent memory application?

There are platform and software requirements. You’ll need servers based on the future Intel® Xeon® Scalable processor family platform, code-named Cascade Lake. These platforms are targeted to be delivered in 2018. From a software perspective, you’ll need a Linux or Windows* distribution that supports persistent memory file systems like EXT4, XFS (Linux), and NTFS (Windows), and includes persistent memory device drivers.

What are some of the challenges of adapting software for persistent memory?

The main challenges of implementing persistent memory support are:

Ensuring data persistence and consistency
Ability to detect and handle persistent memory errors

What is the importance of flushing in persistent memory programming?

When an application does a write, it is not guaranteed to be persistent until it is in a power failure protected domain. There are two ways of ensuring that writes are in failure protected domain—either flush (+fence) after writing, or extend the failure protected domain to include CPU caches (this is what eADR does). On platforms with eADR there's no need for manual cache line flushing.

What are the options available for flushing CPU caches from user space?

There are three instructions available:

CLFLUSH flushes one cache line at a time. CLFLUSH is a serialized instruction for historical reasons, so if you have to flush a range of persistent memory, looping through it and doing CLFLUSH will mean flushes happen one after another.
CLFLUSHOPT can flush multiple cache lines in parallel. Follow this instruction with SFENCE, since it is weakly ordered. That's the optimization referred to by the OPT in the opcode name. For more details on the instructions, search for the topic CLFLUSH—Flush Cache Line in the document Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4.
CLWB behaves like CLFLUSHOPT, with the caveat that the cache line may remain valid in the cache.

Also, there is an Advanced Configuration and Power Interface (ACPI) property that tells you if caches flushing is automatic. If not, you’ll need to implement it. To support developers working on generic server platforms, new interfaces are being created to check the ACPI property and enable you to skip the flushes, if possible.

Why do we need transactions when working with persistent memory?

Transactions can be used to update large chunks of data. If the execution of a transaction is interrupted, the implementation of the transactional semantics provides assurance to the application that power-failure atomicity of an annotated section of code is guaranteed.

Why was the PCOMMIT instruction removed?

The reason for removing the PCOMMIT instruction is that today's platforms with non-volatile dual in-line memory modules (NVDIMMs) are expected to support asynchronous DRAM refresh (ADR) to flush the contents from a memory controller when there is a power failure. Since the sole purpose of PCOMMIT was to flush the contents from the memory controller this instruction was deprecated, and references to this instruction have been removed from PMDK. See the Deprecating the PCOMMIT Instruction blog for details.

Can Intel® Transactional Synchronization Extensions (Intel® TSX) instructions be used for persistent memory?

As far as the CPU is concerned, persistent memory is just memory and the CPU can execute any type of instructions on persistent memory. The problem here is atomicity. Intel® TSX is implemented on the cache layer, so any flushes of the cache will naturally HAVE TO abort the transaction, and if we don't flush until after the transaction succeeds, the failure atomicity and visibility atomicity may be out of sync.

Do PMDK libraries use Intel TSX instructions?

No.

Basic Questions on Persistent Memory Programming with PMDK

Why use PMDK?

PMDK is designed to solve persistent memory challenges and facilitate the adoption of persistent memory programming. Use of PMDK is not a requirement for persistent memory programming, but it offers developers well-tested, production-ready libraries and tools in a comprehensive implementation of the SNIA NVM programming model.

Does PMDK work on non-Intel NVDIMMs?

Yes.

Is the use of PMDK required to access Intel® persistent memory NVDIMMs?

PMDK is not a requirement but a convenience for adopting persistent memory programming. You can either use PMDK libraries as binaries or you can choose to reference the code for the libraries if you are implementing persistent memory access code from scratch.

What is the difference between SPDK and PMDK?

PMDK is designed and optimized for byte-addressable persistent memory. These libraries can be used with NVDIMM-Ns in addition to Intel persistent memory NVDIMMs powered by 3D XPoint™ storage media.

The Storage Performance Development Kit (SPDK) is a set of libraries for writing high-performance storage applications that use block IO.

The difference is that PMDK is focused on persistent memory and SPDK is focused on storage, but the two sets of libraries work fine together if you happen to need them both at the same time.

What language bindings are provided for PMDK?

All the libraries are implemented in C, and we provide custom bindings for libpmemobj in C++, Java* and Python*. At this time, Java and Python bindings are works-in-progress.

I am not using transactions. Is there a library I can use from PMDK to access persistent memory?

Yes. libpmem is a simple library that detects the types of flush instructions supported by the CPU, and uses the best instructions for the platform to create performance-tuned routines for copying ranges of persistent memory.

Is there a library that supports transactions?

Yes. There are three: libpmemobj, libpmemblk, and libpmemlog.

libpmemobj provides a transactional object store, providing memory allocation, transactions, and general facilities for persistent memory programming.
libpmemlog provides a pmem-resident log file. This is useful for programs like databases that append frequently to a log file.
libpmemblk supports arrays of pmem-resident blocks, all the same size, that are atomically updated. For example, a program keeping a cache of fixed-size objects in pmem might find this library useful.

Can I use malloc to allocate persistent memory?

No. PMDK provides an interface to allocate and manage persistent memory.

How are PMDK libraries tested? Are they tested on real NVDIMMs?

These libraries were functionally validated on persistent memory emulated using DRAM. We are in the process of testing with actual hardware.

Is PMDK part of any Linux* or Windows* distributions?

Yes. PMDK libraries, but not tools, are included in Linux distributions from Suse*, Red Hat Enterprise Linux*, and Ubuntu*, and the list may grow in the future.

For Microsoft Windows, PMDK libraries, but not tools are included in Windows Server 2016 and Windows® 10. For details, see the pmem.io blog PMDK for Windows.

To get the complete PMDK, download it from the PMDK GitHub repository.

Does PMDK support ARM64*?

Currently only 64-bit Linux* and Windows* on x86 are supported.

Are there examples of real-world applications using PMDK?

Yes. For example, we have added persistent memory support for Redis*, which enables additional configuration options for managing persistence. In particular, when running Redis in Append Only File mode, all commands can be saved in a persistent memory-resident log file, instead of a plain-text append-only file stored on a conventional hard disk drive. Persistent memory resident log files are implemented in the libpmemlog library.

To learn more about the persistent memory implementation of Redis, including build instructions, visit the Libraries.io GitHub site pmem/redis.

PMDK—libpmem

When should I use libpmem versus libpmemobj?

libpmem provides low-level persistent memory support. If you’ve decided to handle persistent memory allocation and consistency across program interruptions yourself, you will find the functions in libpmem to be useful. Most developers use libpmemobj, which provides a transactional object store, memory allocation, transactions, and general facilities for persistent memory programming. It is implemented using libpmem.

Do 3D XPoint™ storage-based devices like Intel® Optane™ memory module 32 GB PCI Express* M.2 80 mm MEMPEK1W032GAXT support libpmem?

No. PMDK is designed and optimized for byte-addressable persistent memory devices only.

What is the difference between pmem_memcpy_persist and pmem_persist?

The difference is that pmem_persist does not copy anything, but only flushes data to persistence (out of the CPU cache). In other words:

pmem_memcpy_persist(dst, src, len) == memcpy(dst, src, len) + pmem_persist(dst, len)

PMDK—libpmemobj

Why is running libpmemobj too slow on SSDs?

PMDK is designed and optimized for byte-addressable persistent memory while SSDs are block based. When running libpmemobj on SSDs, this requires a translation from block to byte addressing, which adds additional time to a transaction. Also, it requires moving whole blocks from SSD to memory and back for reading and flushing writes.

How does an application find objects in the memory mapped file when it restarts after a crash?

Libpmemobj defines memory mapped regions as pools and they are identified by something called a layout. Each pool has a known location called root, and all the data structures are anchored off of root. When an application comes back from a crash it asks for the root object, from which the rest of the data can be retrieved.

What do the following terms mean in the context of libpmemobj?

Object Store: Object store refers to treating blobs of persistence as variable-sized objects (as opposed to files or blocks).
Memory pool: Memory mapped files are exposed as something called memory pools.
Layout: A string of your choice used to identify a pool.

Does libpmemobj support local and remote replication?

Yes, libpmemobj supports both local and remote replication through use of the sync option on the pmempool command or the pmempool_sync() API from the libpmempool(3) library.

Support for Transactions

Is there support for transactions that span multiple memory pools where each pool is of a different type?

There is no support for transactions that span multiple memory pools where each pool is of the same or a different type.

Multithread Support

How are pmem-aware locks handled across crashes? Or, how is concurrency handled in libpmemobj?

libpmemobj keeps track of the generation number that gets increased each time a pmemobj pool is opened. When a pmem-aware lock is acquired, such as a PMEM mutex, the lock is checked against the pool's current generation number to see if this is the first use since the pool was opened. If so, the lock is initialized. So, if you have a thousand locks held and the machine crashes, all those locks are dropped because the generation number is incremented when the pools are open, and it is decremented when the pools are closed. This prevents you from having to somehow find all the locks and iterate through them.

Thread Safety

Are pool management functions thread safe? For example, thread1 creates/opens/closes file1 and thread2 creates/opens/closes file1. Are these actions thread safe?

No. Pool management functions are not thread safe. The reason is the shared global state that we can't put under a lock for runtime performance reasons.

Is pmem_persist thread safe?

pmem_persist's job is to make sure that the passed memory region gets out of CPU caches, nothing more. It doesn't care about what's stored in this region. Store and flush are separate operations. If you want to store and persist atomically, you have to do the locking around both operations yourself.

PMDK—Pools

Pool Creation

Is there a good rule of thumb to determine what percentage of a pmemobj pool is usable or how big the pool should be if I want to allocate N objects of specific size?

libpmemobj uses roughly 4 kilobytes for each pool + 512 kilobytes per 16 gigabytes of static metadata. For example, a 100 gigabyte pool would require 3588 kilobytes of static metadata. Additionally, each memory chunk (256 kilobytes) used for small allocations (<= 2 megabytes) uses 320 bytes of metadata. Also, each allocated object has a 64-byte header.

I am trying to create a 100+ GB persistent memory pool. How do I use pmempool if I need a large pool?

One way is to ensure that you have persistent memory reserved before you use the pmempool with the command create. For more details, type the command man pmempool-create.

Create a blk pool file of size 110GB

$ pmempool create blk –-size=110G pool.blk

Create a log pool file of maximum allowed size

$ pmempool create log –M pool.log

Is there support for multiple pools within a single file?

No. Having multiple pools in a single file is not supported. Our libraries support concatenating multiple files to create a single pool.

Expanding Existing Pool Size

What is a good way to expand a persistent memory pool; for example, when I begin with a single 1 GB mapped file and later the program runs out of persistent memory?

We are often asked the question of whether the pool grows after creation. No, but you can use a holey file to create a huge pool, and then rely on the file system to do everything else. This usage model doesn't seem to satisfy most people as it is not how traditional storage solutions work. For details, see Runtime extensible zones in the PMEM GitHub* repository.

What is a good way to expand a libpmemobj pool; for example, when I begin with a single 1 GB mapped file and later the program runs out of persistent memory?

When using a file on a persistent memory-aware file system, all our libraries rely on file system capability to support sparse files. This means that you just create a file as large as you could possibly want, and the actual storage memory use would be only what is actually allocated.

However, with device DAX, that is no longer an option, and we are planning on implementing a feature that would allow pools to grow automatically in the upcoming release.

Deleting Pools

Is there a way to delete a memory pool via libpmemobj?

No. The pmemobj_close() function closes the memory pool and does not delete the memory pool handle. The object store itself lives on in the file that contains it and may be re-opened at a later time.

You can delete a pool using one of the following options:

Deleting the file from the file system that you memory mapped (object pool).
Using the pmempool "rm" command.

Persistent Memory over Fabrics

What is the purpose of librpmem and rpmemd?

librpmem and rpmemd implement persistent memory over fabric (PMOF). Persistent memory over fabric enables replication of data remotely between machines with persistent memory.

librpmem is a library in PMDK that will run on the initiator node and persistent memory is a new remote PMDK daemon that will run on each remote node that data is replicated to. The design makes use of the OpenFabrics Alliance (OFA) libfabric application-level API for the backend Remote Direct Memory Access (RDMA) networking infrastructure.

PMDK—Miscellaneous

Debugging

How do you enable libpmemlog debug logs?

Two versions of libpmemlog are typically available on a development system.

The application is linked using the -lpmemlog option. This option is optimized for performance and skips checks that impact performance, and never logs any trace information or performs any run-time assertions.
The application includes:
- Libraries under /usr/lib/PMDK_debug that contain run-time assertions and trace points.
- Set the environment variable LD_LIBRARY_PATH to /usr/lib/PMDK_debug or /usr/lib64/PMDK_debug, depending on the debug libraries installed on the system.

The trace points in the debug version of the library are enabled using the environment variable PMEMLOG_LOG_LEVEL.

Glossary

ADR (Asynchronous DRAM Refresh)

A platform-level feature where the power supply signals other system components that power-fail is imminent, causing the Write Pending Queues in the memory subsystem to be flushed

NVDIMM

A non-volatile dual in-line memory module. Intel will release an NVDIMM based on 3D XPoint Memory Media toward the end of 2018.

Persistent Domain or Power-fail Protected Domain

When storing to pmem, this is the point along the path taken by the store where the store is considered persistent.

WPQ (Write Pending Queue)

Write pending queues are part of the memory subsystem.

Resources

Intel Developer Zone Persistent Memory Programming site
Persistent Memory Programming with PMDK at pmem.io
Persistent Memory Google Group
Persistent Memory Programming on GitHub
The SNIA NVM Programming Model
Webinar: Persistent Memory Programming Using Non-volatile Memory Libraries (now part of the Persistent Memory Developer Kit (PMDK))
Code Samples at https://github.com/pmem/PMDK-examples

↧

Distributing with Python*: Quick Guide to Using Intel® MPI (updated for Python 3.6)

December 31, 2017, 4:21 pm

Latest and popular articles on Intel Technologies

≫ Next: Seamless Edge-to-Cloud IoT Integration Speeds Time to Market

≪ Previous: Persistent Memory Programming—Frequently Asked Questions

Introduction

Intel® MPI is an optimized library for standard MPI applications, designed to deliver high-speed and scalable cluster messaging on Intel® based architecture. For Python users, the Intel® MPI runtime is included in Intel® Distribution for Python (IDP) and can be used in Python through mpi4py* - a Python binding package for MPI.

This article will discuss the key benefits of using Intel® MPI over standard MPI libraries, Intel® MPI’s simplicity in integrating with Python, and illustrate a Python usage example using Intel® MPI libraries.

Highlights of Intel® MPI

Optimized MPI application performance Application-specific tuning

Automatic tuning
Support for Intel® Xeon Phi™ Processor
Support for Intel® Omni-Path Architecture Fabric

Lower latency and multi-vendor interoperability

Industry leading latency
Performance optimized support for the fabric capabilities through OpenFabrics*(OFI)

Faster MPI communication

Optimized collectives

Sustainable scalability up to 340K cores

Native InfiniBand* interface support allows for lower latencies, higher bandwidth, and reduced memory requirements

More robust MPI applications

Seamless interoperability with Intel® Trace Analyzer and Collector

Intel® MPI with Python*(3.6)

In the Intel® Distribution for Python, mpi4py offers Python binding packages for MPI standards and wraps around Intel® MPI, thus enabling an optimized message passing interface and requiring no execution code changes in command line terminal.

Section below implements a simple master-slave python routine (mpi-sample.py) that sends partial arrays from all slave nodes to the master node

As mentioned, Intel® MPI is offered in Intel® Distribution for Python which should be installed prior to using Intel® MPI.

Intel® Distribution for Python is available for free for Windows*, Linux* and macOS*. The recommended install method is through Anaconda’s distribution cloud.

1. Install Intel® Distribution for Python with Anaconda

conda create -n IDP intel -c python=3.6

2. Activate Intel® Distribution for Python on Linux

source activate IDP

3. Activate Intel® Distribution for Python on Windows

activate IDP

4. Install Intel® MPI

conda install mpi4py –c intel # installs Python binding packages that wraps Intel® MPI

5. Create MPI program (Below is an example of a basic Master-Slave python routine – mpi-sample.py)

from mpi4py import MPI
import numpy as np
import sys
# MPI Initialization
size = MPI.COMM_WORLD.Get_size ()
rank = MPI.COMM_WORLD.Get_rank ()
name = MPI.Get_processor_name ()

numRows = int (sys.argv[1])
numCols = int (sys.argv[2])
seeded = np.random.RandomState (42)
partialArray = seeded.rand (numRows, numCols)

if rank == 0:
    fullArray = [0] * size
    print ("Master begins receiving partial arrays....")
    if size > 1:
        for i in range (1, size):
            rank, size, name, fullArray[rank] = MPI.COMM_WORLD.recv (source=MPI.ANY_SOURCE, tag=1)
    print ("Master completes receiving all partial arrays")
else:
    MPI.COMM_WORLD.send ((rank, size, name, partialArray), dest=0, tag=1)

6. Run on terminal

Linux terminal

mpirun –n 4 python mpi-sample.py 100000 10

Windows terminal

mpiexec -n 4 python mpi-sample.py 100000 10

↧

Seamless Edge-to-Cloud IoT Integration Speeds Time to Market

January 2, 2018, 1:40 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel and MobileODT* Competition on Kaggle*: 3rd Place Winner

≪ Previous: Distributing with Python*: Quick Guide to Using Intel® MPI (updated for Python 3.6)

Executive Summary

Organizations that rely on the Internet of Things (IoT) for critical business processes are looking for ways to merge data silos, reduce security risks, and eliminate duplicate infrastructure. A fully integrated edge-to-cloud IoT infrastructure solution can help improve business insights that provide a true competitive advantage. But implementing such a solution can be complex; organizations need a planned approach to help the transition run smoothly.

Intel and Google have worked together to deliver a standards-based approach to help IoT developers, OEMs, independent software vendors (ISVs), and system integrators (SI) develop seamless solutions. With a joint reference architecture built on the Intel® Internet of Things (Intel® IoT) Platform and the Google Cloud Platform* (GCP*), IoT providers can gain the following capabilities and benefits:

Seamless data ingestion. With a standards-based reference architecture, data is easier to collect and devices are easier to control.
End-to-end security. The architecture is designed to protect device hardware.
Easy device onboarding. New devices can be automatically provisioned to platforms, providing security.
Robust scalability. With Intel and Google technologies, organizations can scale rapidly on demand.
Better insights. GCP’s analytics infrastructure with Intel’s analytics-at-the-edge capabilities can provide better insights for faster decision making, quicker time-to-market, and the opportunity to provide new services and solutions.

The Intel® IoT Platform and GCP joint reference architecture provides a comprehensive approach for connecting the device layer to the network layer and into the cloud.

Figure 1. The joint Intel and Google reference architecture makes connecting the
Internet of Things (IoT) from edge-to-cloud easier, with a focus on security at every layer.

Introduction

The Internet of Things (IoT) is speeding data collection from connected devices and sensors, resulting in an explosion of new devices and sensors that are generating massive volumes of data. This data can help organizations make smarter decisions and bring new products and services to market faster. Gartner Research estimates that by 2020, 25 billion enterprise-owned Internet-connected things across the globe stand to generate up to USD 2 trillion in economic benefit.¹ This presents tremendous opportunities for IoT solution providers, but developing an edge-to-cloud solution can be complex.

The technical challenges of IoT implementations often come from multiple IoT solutions dedicated to a variety of use cases within a single organization. These use cases can include monitoring chemical levels in manufacturing processes, occupancy-dependent lighting in offices, and retail security cameras, or monitoring available parking. Multiple implementations also lead to a lack of interoperability between devices and equipment from different manufacturers. Successful IoT solutions require a deep understanding of infrastructure, security, integration, and interoperability from edge to cloud. Although IoT implementations can be complex, organizations and solution providers can eliminate much of the complexity and meet the growing IoT demand with integrated IoT solutions from Intel and Google.

Solution Architecture

The Intel® Internet of Things (Intel® IoT) Platform and the Google Cloud Platform* (GCP*) each provide capabilities and benefits that help IoT developers, OEMs, independent software vendors (ISVs), and system integrators (SIs) develop industry-standard, seamless solutions.

Solution Overview and Benefits

Together, the Intel IoT and GCP joint reference architecture seamlessly transmits data from sensors, actuators, and other endpoint devices to the Google* cloud. A clearly defined, standard reference architecture that details edge, network, and cloud components provides the following:

Seamless data ingestion and device control for improved interoperability.
Robust security for end-to-end data and device protection.
Automated onboarding for simplified deployment of security-enabled devices.
Robust scalability with cloud-based infrastructure.
Customer insights through GCP’s analytics infrastructure.
Data monetization through additional services and applications.

This joint reference architecture discusses:

Intel IoT Platform. This illustrates the edge components, hardware security, and processors, as well as device provisioning, monitoring, and control.
Google Cloud Platform (GCP). This illustrates the cloud services, including data ingestion, flow, storage, and analytics.

The joint reference architecture is followed by an implementation overview, as well as a logistics and asset management use case example in Appendix A: Logistics and Asset Management Use Case.

Intel Internet of Things (Intel IoT) Platform

The Intel IoT Platform (Figure 2) includes a family of Intel® products. The ecosystem provides a foundation for easily connecting devices and delivering trusted data to the cloud. The benefits include:

A broad array of devices. Intel’s ecosystem of original device manufacturers (ODMs) offers a wide range of devices and sensors built on Intel® technology.
Security-focused solutions. Intel technology is designed for increased security at every layer, and includes seamless device preconfiguration capabilities.
Enhanced registration and management. With Wind River Helix Device Cloud*, device management and updates are seamlessly controlled from a central point in the cloud.

Figure 2. The Intel® IoT Platform connects a wide variety of devices to the cloud, using security-focused hardware and software solutions.

Google Cloud Platform (GCP)

GCP provides a security-enabled, cost-effective, and high-performance infrastructure in the cloud hosted through Google’s globally distributed data centers (Figure 3). Managed services provide access to this infrastructure for an overall solution. The benefits include:

Fully managed services. Google manages the setup and maintenance of the overall private infrastructure so customers can focus on building solutions.
Integrated development experience. GCP provides a wide range of services for an integrated, end-to-end developer experience.
Full control of the environment. Developers have full control of their computing environment, from data ingestion to presentation, through APIs in multiple languages.
Broad scale and reach. GCP offers outstanding scale and reach, resulting in a computing and data platform that is uniquely positioned to address the challenges of IoT.

Figure 3. Google Cloud Platform* provides developers with full control of the environment without having to set up and manage the infrastructure.

Solution Architecture Details

The Intel IoT and GCP joint reference architecture (Figure 4) utilizes three primary types of components and solutions: Intel® edge components, such as hardware security and processors; Intel® device and security management, such as device provisioning, monitoring, and control; and GCP cloud services, such as data ingestion, dataflow, storage, and analytics.

Intel IoT Platform Components

Edge components

Wind River Linux*. With built-in certifiable security capabilities and portability, Wind River* provides an IoT embedded Linux platform for hardware.
Intel® Security Essentials. Hardware root of trust, capabilities such as secure boot, trusted execution environment (TEE), and Intel® Enhanced Privacy Identifier (Intel® EPID) provide security to the platform at the hardware level.
Intel® processors. Intel® Quark™ system on a chip (SoC) and the Intel® Atom™, Intel® Core™, and Intel® Xeon® processor families provide high performance and scalability.

Device and security management

Wind River Helix Device Cloud*. Helix Device Cloud is an IoT portfolio of services and technologies that enable faster time to market; it provides device monitoring, control, software updates, registration, attestation, and security-enabled deployment at scale.
Intel® Secure Device Onboarding. Using the privacy-preserving properties of Intel EPID, an IoT identity standard, onboarding protocols and a rendezvous service, owners can automatically register with their devices in GCP when powered on.

GCP Components

GCP components may vary depending on implementation and are grouped into five primary functions:

Data ingestion

Cloud IoT Core*. Cloud IoT Core is a fully managed service that allows you to easily and securely connect, manage, and ingest data from millions of globally dispersed devices. Cloud IoT Core, in combination with other services on Google Cloud Platform, provides a complete solution for collecting, processing, analyzing, and visualizing IoT data in real time to support improved operational efficiency.
Cloud Pub/Sub*. Cloud Pub/Sub provides a fully managed, real-time messaging service that allows developers to send and receive messages between independent applications.
Cloud Stackdriver Monitoring*. Cloud Monitoring provides visibility into the performance, uptime, and overall health of cloud applications.· Cloud Stackdriver Logging*. Cloud Logging allows developers to store, search, analyze, and monitor log data and events, as well as to send alerts.

Pipelines

Cloud Dataflow*. Cloud Dataflow is a unified programming model that provides managed services for developing and executing a wide range of data processing patterns including extract, transform, load, and batch and continuous computation. Cloud Dataflow frees developers from operational tasks, such as resource management and performance optimization.

Storage

Cloud Storage*. GCP provides an object store solution for excellent IoT performance and price.
Cloud Datastore*. Cloud Datastore is a NoSQL database that is ideally suited for mobile and web endpoints.
Cloud Bigtable*. Cloud Bigtable is designed for workloads that require higher speed and lower latency, such as analytics.

Analytics

Cloud Dataflow*. Dataflow provides programming primitives, such as powerful windowing and correctness controls, that can be applied across both batch- and stream-based data sources.
BigQuery*. BigQuery is a fully managed, petabyte-scale, low-cost enterprise data warehouse for analytics.
Cloud Dataproc*. For Apache Spark* and Apache Hadoop*, Cloud Dataproc is designed for open source data tools for batch processing, querying, streaming, and machine learning.
Cloud Datalab*. Cloud Datalab is an interactive tool for exploring, analyzing, and visualizing data with a single click.

Application and presentation

App Engine*. App Engine is a platform-as-a-service (PaaS) solution used to develop applications without concern for the underlying infrastructure.
Container Engine*. Container Engine is a managed Kubernetes* solution that provides industry-specific solutions, such as fleet management.
Compute Engine*. Compute Engine is an infrastructure-as-a-service (IaaS) product that offers VMs on a variety of guest operating systems.

Figure 4. The Intel® IoT Platform and GCP* joint reference architecture details the connections for seamless device onboarding and ownership privacy.

Implementation Overview

The process of connecting devices, integrating data, and managing software upgrades follows these steps (Figure 4):

Onboarding Devices

1.During manufacturing, the silicon provider embeds Intel EPID credentials in a TEE of the processor. The ODM uses an open source toolkit from Intel to create a global unique identifier, assign a URL for the Intel® Secure Device Onboard (Intel® SDO) service, an automated onboarding service from which the device gets its new owner information. It then generates an ownership proxy that is used to cryptographically verify ownership of the device by GCP.

2.Upon purchase, along with the purchase receipt, an ownership proxy for the device is generated. The owner imports the ownership proxy into GCP, which then signals to Intel SDO.

3.When the device is powered on the first time, it contacts the Intel SDO, which redirects it to the IP address provided by its new designated GCP owner.

4.The GCP trust broker and gateway verify the device through its Intel EPID signature and ownership proxy, and then register the device for management with the GCP and Wind River Helix Device Cloud.

5.The Wind River Helix Device Cloud distributes the device certificate provided by the GCP and configures the pub/sub topic subscriptions on the gateway.

6.The GCP IoT software development kit (SDK) on the gateway authenticates the GCP using the device certificate and establishes a data path to the GCP.

Collecting and Integrating Data

7.Business applications on the gateway acquire data from connected sensors through a number of supported protocols, such as Z-Wave*, ZigBee*, and Bluetooth® technology.

8. The GCP IoT SDK on the gateway transmits sensor data to GCP through MQTT and HTTP messaging protocols.

9. Data messages are routed, processed, stored, and made available for enterprise integration.

Managing Devices and Software Updates

10.Application software managers push updates to the Wind River Helix Device Cloud using APIs.

11.The Wind River Helix Device Cloud prepares signed RPM packages and pushes them to the gateway.

12.The management agent on the gateway of the Intel IoT Platform upgrades the software.

Summary

Intel and Google’s end-to-end joint reference architecture for IoT offers a robust, security-enabled, yet simplified solution that gives IoT developers the tools and services to create high-performance solutions. With security-enabled, scalable interoperability, the Intel IoT and GCP joint reference architecture can provide the building blocks for any IoT application in any industry.

The joint reference architecture is reusable, preconfigured, and prevalidated. It can securely connect devices and deliver trusted data with interoperable hardware and software from the edge to the cloud. Each layer is designed with a focus on security and scalable hardware built on Intel technology, and optimized for performance across workloads.

Find the solution that is right for your organization. Contact your Intel representative or visit intel.com/securedeviceonboard.

Learn More

You may also find the following resources useful:

Appendix A: Logistics and Asset Management Use Case

Having visibility to where shipments are at any given time is a significant pain point for supply chain businesses. Market research shows that approximately USD 60 billion worth of cargo is stolen during transit each year.² Additionally, roughly one third of the food produced in the world for human consumption every year gets lost or wasted.³ The ability to trace the journey of a package, such as high-value or perishable goods, in real time can transform how companies manage, track, report, and secure products through logistics (shown in Figure A1). Table A1 illustrates an IoT solution using the Intel® IoT and Google Cloud Platform* (GCP*) joint reference architecture.

Figure A1. The Intel® IoT Platform and GCP* joint reference architecture provides visibility into the location of goods while in transit, helping transportation businesses reduce lost cargo.

Table A1. Technology Components for the IoT Shipment Visibility Use Case

Component	Description
Smart Sensors	Multiple battery-operated smart sensors used within a shipment communicate information (temperature, humidity, shock, tilt, fall, pressure, light, proximity) using IEEE 802.15.4 radio to the IoT gateway.
IoT Gateway using Intel® IoT Gateway Technology	Fixed or mobile battery-operated gateways running the Wind River Linux* OS are located on the shipping container, trucks, or pallets.
Wind River Helix Device Cloud*	SaaS-based device management software remotely manages the fixed and mobile IoT gateways.
Intel® Secure Device Onboarding	Cloud-based preconfigured software securely onboards fixed and mobile IoT gateways.
Google Cloud Platform*	Cloud IaaS and PaaS components (e.g., Cloud Pub/Sub, Cloud Dataflow, Cloud Storage, Firebase, and App Engine*) ingest, process, and analyze data received from the smart sensors through the IoT gateways, using the Pub/Sub messaging protocol.

1 gartner.com/smarterwithgartner/the-internet-of-things-and-the-enterprise

2 aic.gov.au/media_library/publications/tandi_pdf/tandi214.pdf

3 fao.org/save-food/resources/keyfindings/en

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

Cost reduction scenarios described are intended as examples of how a given Intel- based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer, or learn more at intel.com.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Bluetooth is a trademark owned by its proprietor and used by Intel Corporation under license.

* Other names and brands may be claimed as the property of others. 1117/JBOS/KC/PDF Please Recycle 334992-002US

↧

Intel and MobileODT* Competition on Kaggle*: 3rd Place Winner

January 2, 2018, 10:34 am

Latest and popular articles on Intel Technologies

≫ Next: Lone Echo* Shatters Gravity Limitations for VR Success

≪ Previous: Seamless Edge-to-Cloud IoT Integration Speeds Time to Market

Team GRXJ Seeks to Make a Difference, Using AI to Improve the Precision and Accuracy of Cervical Cancer Screening

Abstract

More than 1,000 participants representing 800 data scientist teams developed algorithms to accurately identify a woman’s cervix type based on images as part of the Intel and MobileODT* Competition on Kaggle*. Such identification can help prevent ineffectual treatments. And, it allows health care providers to offer proper referrals for cases requiring more advanced treatment.

This case study follows the process used by the third-place winning team, GRXJ. They pooled their respective skill sets to create an algorithm that would improve this life-saving diagnostic procedure.

Kaggle Competitions: Data Scientists Solve Real-world Problems Using Machine Learning

The goal of Kaggle competitions is to challenge and incentivize data scientists globally to create machine-learning solutions for real-world problems in a wide range of industries and disciplines. In this particular competition – sponsored by Intel and MobileODT, developer of mobile diagnostic tools – more than 1,000 participants representing 800 data scientist teams each developed algorithms to correctly classify cervix types based on cervical images.

In the screening process for cervical cancer, some patients require further testing while others don't; because this decision is so critical, an algorithm-aided determination can significantly improve the quality and efficiency of cervical cancer screening for these patients. The challenge for each team was to develop the most efficient deep learning model for that purpose.

Team Members Inspired by Kaggle Competition’s Meaningful Purpose

The members of Team GRXJ were drawn to participate by the prospect of helping to save lives. And, they saw that they could further their knowledge of cervical cancer while also acquiring new Deep Learning techniques. The team is named for the first initials of its four members:

Gilberto Titericz Jr., a San Francisco-based data scientist at Airbnb.
Russ Wolfinger, works as Director of Scientific Discovery and Genomics at SAS Institute in Cary, N.C.
Xulei Yang is a research scientist in IHPC, A*STAR, Singapore.
Joseph Chui of Philadelphia, a robotics engineer and software developer.

At the start of the competition, GRXJ team members entered as individual competitors. After several individual submissions, Wolfinger and Yang – former teammates in a previous Kaggle competition – agreed to collaborate. They then invited Titericz to join, to leverage his considerable machine learning ensembling skills. Finally, they added Chui to increase the diversity of their ensembled models.

“Team-up is a good strategy to win Kaggle competitions,” commented Titericz. “Choosing team members with different solutions helps improve the model performance when ensembling.” Collectively, the teammates’ time spent on the project ranged from 40 hours over the course of two months, to several hours a day nearing the end of the competition.

Choosing an Approach to Code Optimization

In the early stages, team members each developed and optimized their own single models independently with minimal code dependence:

Russ Wolfinger started by probing the leaderboard and had the public test set labels to train on a few weeks before the model submission deadline. For most models unet or faster rcnn was used to do object detection. Crops were created and bagged image classification used with resnet50 and vgg16 backbones and then hill climbing to ensemble based on 2-fold csv with the public train and test sets as their splits. Wolfinger used additional training data in only one of his models and included one custom model from Chui, plus one more based on unet from an additional contributor.

Xulei Yang divided the task into two stages:

Stage 1: cervical detection by using yolo models on full images. This entailed generating label txt files, training various yolo models, cropping cervical ROIs, re-training yolo models, post-filtering ROIs, and generating Stage 1 output which consisted of 4760 rois in additional set, 1780 rois in training set, and 512 rois in tst_stg1 set. All cropped rois are resized to 224x224 for further processing.
Stage 2: cervical classification by Resnet-50 on cropped cervical rois images. These steps included preprocessing using various data augmentation techniques, selection of CNNs for training, two sets of models, and final results.

Joseph Chui used Convolutional Neural Network in his training model. Since the competition didn't judge how fast the code runs on a specific platform, little was done to fine-tune time performance. Because it still took hours to train the model, care was taken when setting the values of the hyper parameters. In his model, Chui traded a faster training time by reducing the number of augmented data and increasing the learning rate. Owing to a limited amount of memory on the GPU, the size of mini-batch was constrained to a smaller number. Each picture in the model was cropped to a square-like ROI automatically before training or predicting. He performed fine-tuning of pre-trained models, VGG16, VGG19, Xception, and Inception-v3. Data augmentation x8 was used to train, which includes stage 1’s training and labeled test data. Keras on Tensorflow* was used as the programming interface to Python*. Multiple pre-trained models in Keras were fine-tuned, and the predictions from each model were averaged with equal weights before ensemble with other models in the team.

Gilberto Titericz Jr. developed two Keras VGG-16 models by fine tuning of pre-trained weights. He then used a five-fold cross strategy to generate out-of-fold (OOF) predictions (on training set) and average testing (on test set) csv files, and finally ensembled the csv files by using “Nelder-Mead" optimization approach to find the best weights. His ensemble method was also used for the final model’s selection based in performance and results ensembling.

Summary: Bringing It All Together as a Team

Their independent approach allowed each of them to improve their models within constrained schedules. Before the final stage, each single model of the solution was trained and predicted on two pre-agreed sets of labeled data. The prediction results of individual models were evaluated using hill climbing algorithms to yield optimal blending weights. These same blending weights were used to average the final predictions.

The team’s ultimate solution is based on predictions for six separate models – four from Wolfinger, and one each from Chui and Yang – that were blended to achieve the final predictions. To train the Deep Learning models, they used the algorithms: tf faster rcnn, unet, yolo and Keras fine tuned VGG16, VGG19, Exception, resnet50, and Inception v3.

Six models must be trained from three folders:
Models 1-4: r1, r2, r3, and r4. Faster R-CNN for region detection, cropping, and image classification. The most important features are central regions of the images. Tools used included tensorflow, mxnet, and keras. It took about one day to train all models.
Model 5: x1. Yolo v2 models are used for cervical objects detection. The detected roi images are then cropped and resized, and finally used for image classification based on fine-tuned ResNet-50 models. Tools include darknet and keras.
Model 6: j1. ROI extraction using OpenCV, combining the prediction results using fine-tuned Keras’s pre-trained models (VGG16, VGG19, Inception V3 and Exception).

The blending strategy chosen to mix the predictions is a geometric average according this formula: P = exp( ( log(r1) + 3*log(r2) + log(r3) + log(r4) + 3*log(x1) + log(j1) ) / 10 )
The weights are chosen based on a hill climbing algorithm.

Data Augmentation

According to Chui, data augmentation x8 was used to help in training the data, which included stage 1’s training and labeled test data. No data in “additional data” 7z files was used.

Training and Prediction Time

Convolutional Neural Network were used in Chui’s training model. Fine-tuning of pre-trained models, VGG16, VGG19, Xception, and Inception-v3 were performed. The prediction results from each model were averaged with equal weights before ensemble with the team.

Simple Features and Methods

The team found that it’s feasible to include or exclude pre-trained models. Instead of using four pre-trained models, one can cut down the training and predicting time in half by excluding two of the models. It is also possible to improve the performance by including other fine-tuned pre-trained models.

Results and Key Findings

Upon further investigation of their separate models, as well as their combinations, team members found that:

Single models like their j1 or x1 can achieve Private leaderboard (LB) around 0.82, only slightly worse than their final ensemble score in Private LB around 0.81.
A blend of r1^0.15 * j1^0.45 * x1^0.40 can achieve around 0.78 in Private LB, next to the best score (#1) 0.76 in Private LB.

This indicates a more advanced stacking method would achieve even better results in the competition.

Titericz noted that a simple geometric average choosing uniform weights for all models would perform better in private LB (0.80417 vs 0.80830) than the weights chosen by the hill climbing algorithm. “It makes sense since a simple geometric average is more robust against overfitting,” he said.

GRXJ made little use of additional data but did use bounding box annotations that were provided on the forum. They probed the leaderboard early in the competition and so had the test set labels. This freed the team to explore different cross-validation schemes and not be in a hurry to re-train all models during the final week. “We were able to avoid chasing leaderboard rank the whole time and focus on good fitting models,” said Wolfinger.

Learn More About Intel Initiatives in AI

Take part as AI drives the next big wave of computing, delivering solutions that create, use and analyze the massive amounts of data that are generated every minute.

Sign up with Intel AI Academy to get the latest tools, optimized frameworks, and training for artificial intelligence, machine learning, and deep learning.

Meet Team GRXJ

The four members of team GRXJ, named for each of their first initials, began this Kaggle challenge as individual competitors before pooling their respective strengths. They are:

Gilberto Titericz Jr., a San Francisco-based data scientist at Airbnb, holds a bachelor’s degree in Electronics Engineering and an MSc in Electric Engineering.

Russ Wolfinger, who has a PhD in Statistics, works as Director of Scientific Discovery and Genomics at SAS Institute in Cary, N.C.

Xulei Yang is a research scientist in IHPC, A*STAR, Singapore. He holds a PhD in Electrical and Electronics Engineering, and is an IEEE senior member. His current research focus is on deep learning for biomedical image analysis.

Joseph Chui of Philadelphia, a robotics engineer and software developer for 15 years, is focused on developing applications using GPUs and 3D graphics.

↧

Lone Echo* Shatters Gravity Limitations for VR Success

January 3, 2018, 10:43 am

Latest and popular articles on Intel Technologies

≫ Next: Working in VR to Create VR Content

≪ Previous: Intel and MobileODT* Competition on Kaggle*: 3rd Place Winner

It's still early in the virtual reality (VR) gold rush, but game makers around the world are gaining traction. From independents and top studios alike, VR titles are entering the marketplace with more quality, better stories, and far fewer of the side effects that plagued early titles. According to TechRadar, HTC's Viveport* subscription service now offers over 150 VR titles. In October 2017, the Oculus* Head of Content, Jason Rubin, told RoadtoVR.com that nine titles have made more than USD 1 million in the Oculus store.

Clearly, the momentum is building. New hardware running at 90 frames per second (FPS) gets much of the credit, as HTC Vive*, Oculus Rift*, and Razer* Open Source Virtual Reality (OSVR) all pump up the power. Still, to make a great VR title, developers must adhere to some basic guidelines to make their games immersive and fun.

Game screenshot of Captain Olivia Rhodes and AI-robot floating in zero gravity
Figure 1. Ready at Dawn Studios* explores the interaction possible between an AI-powered robot and Captain Olivia Rhodes as they mine the rings of Saturn.

One stunning example is Lone Echo* from Ready at Dawn* Studios in Irvine, California. It's a space-based narrative adventure game featuring exploration and puzzle solving. Players assume the role of Jack, a service android helping to mine the rings of Saturn in the year 2126. Working with Captain Olivia Rhodes, the player's goal is to use technical ingenuity and futuristic tools to overcome a variety of challenges and obstacles while investigating a mysterious spatial anomaly.

Lone Echo won the Game Critics Award for Best VR Game at the Electronic Entertainment Expo 2017 (E3 2017). Reviewers gave it high marks for storytelling and its superior VR components such as locomotion, UI, and interaction. Clearly, the game is a good example of what developers should do with their projects. While the major VR headset makers offer their own best practices for VR game development, Intel has created a document based on the latest lab sessions, testing, and UX theory to encapsulate what it takes to make a great VR game. This case study describes how Lone Echo succeeded by following Intel's guidelines before the document was published.

Ready at Dawn Gets to Work

The Ready at Dawn team first envisioned a VR title in 2014. Founded in 2003 and composed of former members of Naughty Dog* and Blizzard Entertainment*, they had already produced Daxter* and the God of War* series, and were wrapping up The Order: 1886*. They weren't initially sold on VR, but they saw possibilities after meeting with Sony* and Oculus. "We didn't want to simply start making a VR game," according to Ru Weerasuriya, a Ready at Dawn co-founder. In a 2017 interview for RoadtoVR.com, he recalled, "We wanted to determine what about VR we needed to do differently. And that's how it started: We created a movement model to figure out how to break certain boundaries in VR."

Image of AI-Robot
Figure 2. The Lone Echo* game creators purposefully chose to make the player's view from the service android perspective rather than from the human captain.

The team used its own Ready at Dawn (RAD) engine, which could hit the 90-FPS threshold and adapted well to VR. Adamant about putting the player into the role of the service android, rather than the human, they experimented continually with a movement system that offered better locomotion in zero gravity than any other game.

The result makes the feeling of complete weightlessness in zero gravity a vital part of the game. Players no longer depend on a complex arrangement of keys, buttons, clicks, and toggles to move through their environment. Instead, they can grasp, push off, glide, and climb, navigating around obstacles and solving puzzles.

Image of zero gravity experience in game
Figure 3. Experiencing zero gravity is one of the novelties of Lone Echo*.

Follow the Guidelines

Before driving the effort to create the Intel Guidelines for Immersive Virtual Reality Experiences, Susan Michalak studied anthropology at the University of Washington and then earned an advanced degree in human interface design. She consulted or worked at Apple*, Tektronix*, and Hewlett-Packard*, then settled at Intel in 2011 as a UX strategist focused on customer innovation in the Software and Services Group (SSG).

When asked to create VR heuristics, Michalak engaged Thug, LLC, a Portland, Oregon agency specializing in user research and experience design. "We wanted to understand what makes VR really immersive or enjoyable," Michalak said. "So, we tested a variety of games with a number of people, and we asked a lot of questions." The test subjects played the games and then completed exhaustive surveys that revealed their reactions.

After some analysis, the SSG team and Thug's researchers found several key aspects that correlated strongly with a sense of immersion and a sense of enjoyment. They also conducted a thorough literature review of existing data in the space, pulling in research from Microsoft*, Oculus, and HTC, among others. Based on this research, the team developed a set of guidelines to help developers create truly immersive experiences. A detailed review reveals that Lone Echo excels in three key areas of the guidelines:

Provides a safe physical foundation
Enhances basic realism
Goes beyond novelty

When players slide on the headset for the first time, they find an eerily familiar VR setting that's often an amazingly accurate representation of the real world. Lone Echo can transport players to a distant planet, challenge them to a task that would often be too daunting in reality, and teach them something unimaginable. Rarely do players find such a riveting experience that encompasses all three, but Lone Echo defies the typical shooter game storyline, adding depth and immersive potential.

In game screenshot of the Captain's quarters with workout gear
Figure 4. The captain's quarters feature workout gear to prevent muscle atrophy.

Physical Foundation Means Safety First

While one of the most important things to provide players in VR is physical safety, this game also enforces a concept of social safety.

Physical safety. Although the game requires whirling upper body motions and long-distance travel, players don't need a large playing area to enjoy the experience. The game comes to them, so there are no special requirements for staying safe. Players can ensure stability by wearing comfortable, flat shoes or no shoes at all.

Social safety. Captain Olivia Rhodes is a welcome change from the pattern of overly sexualized game characters. She's an intelligent, courageous space explorer, single-handedly running a space station on the rings of Saturn. Captain Rhodes is therefore a good role model, rather than merely eye candy.

A key factor to implementing social safety is discouraging negative behavior toward Captain Rhodes. When listening and responding to her questions through an arm-equipped chat tablet, her arm is often the nearest thing to grab onto for stability or to keep from floating away. In a respectful manner, Captain Rhodes will brush a player's hand away if they grip her for too long, as if to indicate that she finds your grasping gesture rude. This encourages respect for all players, whether they are human or a CPU.

VR sicknessprevention. Cybersickness is commonly recognized as the feeling of motion sickness or general queasiness caused by a VR experience's ineffective immersive traits. In earlier days of VR, this was a major deterrent. In fact, cybersickness affected 80 percent of VR users when the technology was first released.

Since then, technology has improved and Intel has established several criteria for game developers to follow to significantly reduce and avoid sickness for VR participants. Good system specs are a key. Carreira, founder and owner of InstaGeeked.com, says that, "To prevent motion sickness, players need to be able to look around [at] 90 FPS with no breaks, no crashes, and no latency, and that's it."

Joey Davidson at TechnoBuffalo.com reported from the Montreal International Game Summit 2015. He spotted a slide in a presentation from Vernon Harmon, the PlayStation* senior technical account manager at Sony Computer Entertainment America (SCEA), stating that, "Going fast is not optional" when it comes to VR. Harmon explained that "any dropped frames can cause discomfort, not just disruption of presence." This is more than just breaking immersion with choppy frame rates. This is about making players sick and uncomfortable, something that could easily hamstring VR.

Lone Echo does a noteworthy job of preventing cybersickness for its players. Here is how they did it, according to the Intel Guidelines for Immersive Virtual Reality Experiences:

Accurate viewpoint. The viewpoint of the character must be accurate and never veer off in a direction that is unnatural or unprovoked. The controllers in Jack's hands enable players to pivot without exposing them to a spinning motion. Instead, the world fades and brightens when players are in the new position, neatly avoiding motion sickness.

In game screenshot of Kronos's bridge
Figure 5. The Kronos II* bridge controls mining operations from remote dig sites to industrial processing plants.

Acceleration. In Lone Echo, acceleration is performed using one of two methods or both at the same time. On each of the player's wrists is a small jetpack that is initiated through a button on each hand's controller. These jetpacks are reminiscent of an underwater propulsion tool that deep-sea divers use to navigate deeper into the ocean. They are slow, but act as a very useful rudder in changing direction while gliding through zero gravity.

The second way to accelerate and move around the space station is to simply grab and push off of various floating objects, walls, Captain Rhodes, or anything else within reach. The game has no dead objects, so players can grab on to any beam or door handle that is most convenient. This makes the game a more immersive experience because players don't have to wonder if they can use an object.

When moving around the first few times, keeping control of direction is difficult. A challenging skill to master, once done it is the most fun way to move around. If players must stop, the game offers two options. The first is to push both jetpack toggle switches down, bringing players to a soft, yet effective halt. The second option is to grab on to nearby sturdy objects. This can result in the players' legs realistically swinging limply around them.

The wrist jetpacks help players gain control and tactically move around obstacles. When pushing off too fast for the first time, it is common for players to physically stumble as the character slams into a space station wall. When moving about in this manner, players must keep their feet in a wide stance, because their minds are convinced that they are truly floating in zero gravity.

In game screenshot
Figure 6. Acceleration takes time to conquer, but after a while, players truly feel as though they are moving through space.

Believable Immersion is Crucial

For a game to provide the most believable immersion into the virtual world, it must remain coherent and consistent. Lone Echo succeeds at this in several ways.

Believable zero gravity. Nothing feels abnormal or unbelievable, which the Intel VR guidelines emphasize as one of the most important factors in making the VR experience more immersive.

Clear glass lines the walls, giving players a glimpse into the ominous, star-speckled environment. The glass material is easily identifiable by its glare, subconsciously letting players know they are safely inside an airtight space station. The detail of the ship's interior structure has a bare-bones feel. When speaking with Captain Rhodes, she maintains excellent eye contact, speaks in a smooth British accent, and has an eerily real face. Players may stop to get a closer look at the magnificent details of her rosy, flushed cheeks, blonde hair casually pushed back, and the slight fatigue that can be read in her eyes.

Graphical integrity. When accelerating or moving around the station, several speeds can be used, according to the method chosen to move from point A to point B. Whether players use wrist jetpacks or climb slowly around, the frame rate is consistent and believable.

To help prevent cybersickness and avoid distraction by unimportant details, the Intel VR guidelines advise that the sky be full of detail while the floor is muted and calm. Lone Echo follows this guideline particularly well, with its strategic placement of lighting and darker, less detailed background and flooring. The impeccable and ominous detail in the skyline sticks with players as a harbinger of challenges to come.

Figure 7. Even the welding torch in Lone Echo has a unique sound, adding to the immersive experience.

Realistic sound. The Intel Guidelines for Immersive Virtual Reality Experiences state that realistic sound is crucial in making a game more immersive. In Lone Echo, the wrist jetpacks make a distinct noise, similar to a welding torch, which indicates that the jets kicked in. The game also adheres to the guidelines by making it obvious where sounds are coming from.

Taking into consideration the noise difference between voice and the impact of an object, vibrations are paired with the sound of grabbing a box or running into a wall. When a new hatch is opened, the rattling of the station can be felt through small vibrations in the controllers and an accompanying sound. This combination of engaged senses makes for a more convincing and real-life VR experience. All the while, a haunted house-like soundtrack plays quietly in the background, setting a chilling and futuristic scene.

Be Creative — Go Beyond Novelty

VR players are looking for new experiences that take advantage of VR, not just ported games from a 2D world into 3D. Lone Echo succeeds with several interesting additions.

Challenging plot line. As the Intel VR guidelines suggest, novelty is crucial for appeal and interest in a game, but it can only take the game so far before the player loses their initial zeal. Lone Echo uses impressive graphics and a unique map to build a plot line that keeps players hooked. The tasks are thought-provoking, forcing players to explore the corners of the current room, and they require a good handle on the skills learned in various tutorials along the way.

For example, an early challenge initially seems like a simple task: gather loose cargo and return it to its designated storage hold. But upon finding and grabbing the first cargo box, players realize they then have only one hand to help navigate back to the hold, forcing them to innovate with newly acquired skills. Small challenges like this help players sharpen their gliding and grasping skills, ultimately making them feel more comfortable in the zero-gravity terrain. While chipping away at a list of tasks, an underlying fear exists of something from the galaxy threatening the safety of the space station, keeping players dedicated and needing to learn about the mission's looming fate.

Game environment screenshot
Figure 8. Lone Echo offers soothing colors and beautiful graphics, enabling extensive engagement.

Tablets and tutorials. Players can view assigned tasks on a holographic tablet that is easily accessible from their wrist. The motion to open the list is similar to the swipe-to-open motion on a smartphone touch pad. A familiar vibration occurs when pressing the buttons, which makes it easy to use with the robotic dendrites.

For each new skill or tool introduced, the player completes a simulation test, which is essentially a tutorial. The game proceeds smoothly, with slight escalations in required skills so the player gains confidence and expertise as they conquer each challenge.

Extensive engagement. When reaching out to objects in the space station, players can engage with nearly everything. From cargo boxes, corners of structural beam supports, fellow robots, and latches on doors, almost everything has a surface that players can grab and push off from. This is crucial in grounding within the space station while floating around. With practice, players learn to grab the best spot on a box, for example, and push it at the proper angle. The push-offs become less calculated and more second nature.

Conclusion

When played on a system with the right specs—a fast graphics processing unit, but more importantly, a powerful CPU such as an offering from the new Intel® Core™ X-series Processor family—players can take full advantage of the video, audio, and artificial intelligence for a fresh, new experience. By sticking to emerging best practices, Lone Echo has set a high bar for VR games to follow.

As Michalak, says, "Ready at Dawn just happens to be really good at what they do, and their work helped us understand what makes a truly great VR game. But for developers who don't have that background or that strength, the Intel Guidelines for Immersive Virtual Reality Experiences can be very useful."

When a game concept is rare, impossible, dangerous, or expensive, it typically makes for a riveting experience, and Lone Echo offers a taste of all of the above. Best of all, players can explore living in zero gravity without having to endure freeze-dried food and that long flight to Saturn.

Resources

↧

Working in VR to Create VR Content

January 3, 2018, 11:51 am

Latest and popular articles on Intel Technologies

≫ Next: Boost Your C++ Applications with Persistent Memory – A Simple grep Example

≪ Previous: Lone Echo* Shatters Gravity Limitations for VR Success

It’s been a few years since Virtual Reality (VR)* re-entered the public consciousness and reignited the matrix-like fantasies of the 90s and early 2000s. There are now dozens of devices for experiencing VR, and VR platforms have seen their first wave of content released into the wild. However in spite of all of the progress made on the consumer side, the developer side of VR has remained mostly unchanged. Some new SDKs have been released, but the process for creating content for VR is still exactly the same. Though this problem isn’t discussed much, it has a huge effect on the quality and quantity of content that is created for the medium.

In this article, I discuss some of the challenges involved in creating content for VR from a developer’s perspective and look at some ways that our team is changing the process we use so that we can create content more efficiently, more effectively, and more collaboratively.

Flat Tools for 3D Content

For many years now, we’ve been using digital media to tell stories, play games, train ourselves, and communicate. Industries have been built and methodologies have been defined to support the creation of applications that occupy nearly every part of our lives. However VR represents a different paradigm for human-computer interaction, and much of the existing knowledge and infrastructure around creating digital media content doesn’t apply when creating content for VR.

The root of this change stems from the addition of the third spatial dimension that VR enables. We’ve been digital flatlanders and have now been given eyes to see and work within a space that was previously inaccessible.

From The Fourth Dimension: A Guided Tour of the Higher Universes by Rudy Rucker

But we’re still using the tools that were designed for creating flat media, and it turns out that creating 3D media is quite different. Scale transforms from the screen to the head-mounted display, and detail (or lack of) is much more apparent, accessible, and dynamic. Psychology is also a factor, because VR content evokes feelings of personal space and a sense of presence. Interactions with the world happen in three dimensions instead of two. Even visual language is completely different, and things like the “rule of thirds” and simple composition get turned on their heads.

Smoothing out sharp, pointy roots for our music game Shapesong. In VR getting near these felt pretty uncomfortable.

Here are some of the common pain points we’ve encountered when creating VR content:

Objects appear larger in VR than they do when designing on a flat screen. This means that we have to do a lot of jumping into VR and then back to the screen to get things feeling right.
Details that are not usually visible become very apparent in VR. For example, users in VR can get inches away from objects, revealing low-resolution textures or lack of depth. They can also look at an object, including its insides, from angles that are completely overlooked on a screen.
Objects with hard edges seem not only unnatural, but dangerous.
User interfaces require a completely different design than those for flat media. Photoshop* or other image editing software is much less effective as a result.
Interactions with VR content and environments require different design, which has yet to be discovered or well defined.
VR devices isolate the user from the world. This means that when reviewing features or content for VR with a team, you cannot just look over someone’s shoulder to see how things look.

While any of these pain points are immediately apparent the moment you see them in VR, it’s easy to overlook them when designing on a screen. All of this amounts to a “translation” process that has to happen for VR designers, and this translation takes time, effort, and resources. In lieu of VR content creation tools, our goal is to reduce or eliminate the flat-to-3D translation so that we can work, as we have been doing with flat media, in the same space our content will be consumed in.

Our Process

As described above, the biggest challenge in creating VR content is working around the flat-to-3D translation that must happen, since content creation tools are designed for flat interfaces. Our approach tries to eliminate any translation by shifting, wherever possible, our design process from traditional flat tools to 3D tools. Because no 3D tools have yet been released that are designed specifically for creating professional, production-ready content, we’ve had to get creative with what is currently available.

The following describes the current, high-level process we use.

Our VR content design process: paper -> VR prototyping/mockup -> polish/technical setup

Starting on “paper” is still the fastest way to create an initial concept. This approach is familiar to everyone, can be shared easily, and is supported by a plethora of tools. For us, paper designing means sketching onto notepads, writing descriptions for look and feel, gathering reference, and blocking out initial concepts. This is how we do ideation.

A concept sheet for the Tree Harp instrument in Shapesong. We created the concept here first before bringing it into Tilt Brush* to block out size and layout.

Because tools designed for creating production-ready 3D content do not yet exist, for final authoring we must still rely on their flat counterparts. Though some tools can get us close, they’re still in nascent stages and they leave a lot to be desired. Models created in VR tend to be a bit rough and unoptimized, while tools for other parts of the content creation process (rigging, animating, UV mapping, texturing, and so on) simply don’t exist.

So while we aren’t yet designing content entirely in VR, we are doing a lot of prototyping, mockups, and blocking with VR, which has helped to eliminate the flat-to-3D translation process.

The three tools we’re using are Tilt Brush*, Blocks*, and Medium*. Each of these tools has different strengths and weaknesses that lend themselves to creating certain types of content or design. Below, Jarrett Rhodes and Brandon Austin, two of our team members, share their personal experiences using these tools.

Tilt Brush* by Google

Jarrett Rhodes, Digital Artist

A Tilt Brush* sketch of the Tree Harp instrument from Shapesong. With this, we were able to quickly create designs for a number of layouts to play test in VR.

Using Tilt Brush is a great way to explore and conceptualize flow lines for interactive set pieces for VR. Working with lines makes it act like a sketchbook for getting down ideas and visualizing the placement of elements in a virtual space, which can then be exported as a guide to help you lay out your assets when creating them in a traditional digital workspace.

Benefits

With Tilt Brush, it’s easy to start throwing down flow lines to get a feel for the interactive space intended for player use when it comes to set pieces. Also, working with lines makes it easy to add instructional lines and notes that describe how a concept is intended to function when another user enters this scene. This ability is great for communication, because it can be difficult to explain specific intent when someone else is in the scene with a headset, which creates a barrier between you and them.

When it comes to concepting a scene as opposed to an object, dynamic brushes allow you to add particle effects that can help push the overall feel of the scene further. Dynamic brushes also react to any music you have playing in the background, which helps make the scene feel more alive as it pulsates to the music.

Drawbacks

Because you primarily work with lines, it can be difficult to create volume. I found that creating a wireframe first and then painting around in a paper-mache fashion with a wider-textured brush can get the job done, but it is a bit more time consuming. Also, when creating lines, you cannot erase parts of a line if you overdraw it or want to clean it up. You have to undo the entire line and then try to re-create it on the next or subsequent passes, which is time consuming.

Although you have many brush types and colors to choose from, exporting a scene with a lot of linetype variety can muddy up the scene with a plethora of shaders when you import it into a traditional digital media program such as Maya*. However, usually the intention is not to clean up the mesh, but to use it as a guide to build around to ensure interactive items translate better when used in VR.

Finally, if you don’t already have an idea of what you want to create before jumping in, it can be difficult to get started. Currently, you cannot import any reference images to bypass this issue, which is a sizable disadvantage. Although you can import scenes from the social network, it is unlikely that you’ll find precisely what you’re looking for.

Takeaways

Although it can be difficult to create a full-bodied asset from just line work within VR, especially if you don’t know in advance what you want to create, using Tilt Brush to create concepts within a virtual space has its advantages. It’s one thing for a 3D asset to look good from a 2D screen, but once a player can interact with it on a 1:1 scale, concepting out a rough sketch for more accurate proportions is quite useful.

Blocks* by Google

Jarrett Rhodes, Digital Artist

Google Blocks allows you to easily create simple geometric volumes to create low-poly assets in VR and block out rough concepts. Working with simple shapes is a more minimalist approach, which forces the user to think more about the core design and structure to create an asset rather than focus on the minute details. Even if you’ve never before worked with primitive shapes in any other 3D modeling program, Blocks is quite easy to learn.

Benefits

Not only are you working with primitive geometrical shapes, you are also using just six simple tools: shape, stroke, paint, modify, grab, and erase. Before creating a shape or stroke you can select how many faces or sides it has, essentially choosing shapes from pyramids and triangles to near spheres and octagons. You can also move each shape you make individually or even select a few and move them as a group. Being able to separate and edit an individual shape is an advantage in being able to perfect a single object. And having a three-axis grid to snap allows you to organize and edit objects more linearly.

Blocks is a great VR design tool to use if you’re making low-poly assets, because of its foundation of primitive low-poly shapes. However you aren’t restricted to making only primitive shapes. With the modify tool you can add extra edges to round out your designs or the stroke tool to make curving geometric shapes such as tubes. The modify tool also lets you move, rotate, scale, and extrude faces, vertices, and edges, which follows a similar workflow to box modeling. Being low-poly means that any creation you export will probably require minimal cleanup after exporting to another 3D modeling program.

Also, Blocks is free!

Drawbacks

If you want to make something more detailed to compare to current-gen game models, Blocks might not be capable enough for you. While it can still be useful as a concepting tool to use as a reference for a higher poly model, other VR design tools, such as Medium (see the next section), cater more to that workflow.

Another minor drawback is that you can’t quickly select the tool you want to use. You have to select the one you want by grabbing it from the menu or palette rather than being able to swipe the trackpad as you can with some other VR tools. And even though using the grid can help you do more precise editing, without the grid, modifying shapes can be a bit difficult and give awkward results when you want a linear modification.

Takeaways

If you are a low-poly artist, using Blocks is ideal. It follows a similar workflow working with primitives, and being able to import with minimal clean up means you can spend more time creating assets. Regardless of your skill set, it’s easy to learn, and though selecting tools might be a little clunky, for a free VR creation tool you can’t go wrong.

Medium* by Oculus

Brandon Austin, 3D Artist

A Medium* sculpt of a virtual avatar we worked on. We were able to take the final version of this and make only minor modifications in Maya before using it in one of our apps.

Oculus Medium is a fast and intuitive way to generate high-quality 3D sculpts from within a VR environment. The use of a natural interaction system that seeks to mimic real-world sculpting eliminates the barrier of clunky mouse and keyboard commands and 2D abstraction. The typical workflow begins with spraying clumps of clay into a basic form, and then carving away fine details until finally applying a rough paint job and materials.

Benefits

Medium let you create basic shapes and forms in a matter of minutes. The gestural nature of holding virtual sculpting tools and moving them in 1:1 3D space means assets will have a more natural look, and creating complex shapes such as spirals or curves becomes trivial. This ability also leads to models having more of a “sketchy” feel as the strokes pick up minute hand movements.

The seamless scaling of the workspace by pulling apart/pushing hands together allows for fine-detail work as well as an overview of the whole mesh. There is also the advantage of being able to walk around and view the mesh naturally in a 3D environment.

The built-in materials and paint tools allow for rapid iteration of different colors and textures without bringing the model into a separate tool. You can use multiple resolution levels to block out a basic mesh, and then sample it up to add detail. Saving “stamps” at a high resolution allows you to easily fill an object with reusable samples like leaves, scales, or even skin pores.

Medium offers a variety of sharing tools for exporting videos, images, and recordings that other Medium users can play back. This is a great feature for portfolios or dev logs when accompanied by Medium’s fantastic user feedback and sound effects.

Drawbacks

When using Medium, it’s important to realize that having to work within a VR headset for an extended period of time is more stressful on the eyes and arms than sitting at a traditional modeling workstation PC. Because of this increased strain, frequent breaks are absolutely required for most users.

The screen resolution on most VR headsets leads to a more blurry appearance than on a traditional screen. Resolution has a big impact on the visibility of small details and picking out minor errors with the sculpt.

Reference image importing is supported, but setting up orthographic references for front, side, and top has proven difficult. This problem is slightly alleviated by importing low poly base meshes from another program as a “stamp.”

It’s easy to accidentally rotate or scale the workspace, which can throw off the mirror tool or the pottery-style turntable. Separating the mesh into layers helps segment the piece and minimizes the impact of user errors.

Finally, Medium uses a proprietary file format, but can export as FBX in multiple resolution levels. However, the exported FBX will likely still require cleanup in a separate modeling program. I have found that using the retopology tools in Zbrush* worked the best, since it is already tailored to working with large vertex count files, but Maya should suffice for most models.

Takeaways

Medium is a great tool for throwing a rough idea into a 3D space and seamlessly iterating on it. Since hard-surface modeling can be a challenge, mechanical models are still easier to create in traditional tools like Maya or 3ds Max*. For artists used to working with physical media or working in ZBrush, this could be the next step in speeding up the creative process and experimenting with a unique workflow.

Looking Forward

Working in VR is an inevitable outcome for many professions and trades. The power of the medium to give us complete control over our environments and the things inside them, as well as communicate with each other over great distances naturally, will be too great to ignore once the medium has matured. We’re creating digital domains for ourselves in which we will be gods.

But we’re not quite there yet. We still need to create the tools that will help us build the worlds we will inhabit. Without them, we have to think creatively about the best ways to leverage what we have now, so that we can build what we’ll use tomorrow.

People like to say that you need to “get your VR legs” before you can really enjoy VR properly. I think that sentiment also reflects our current state and that of VR itself. As virtual explorers we are evolving and just emerging from a lifetime of ocean dwelling with our flat media. There is a whole new digital, 3D world to discover, and we’ve only just begun to adjust to walking in it. The possibilities for this frontier are profound and completely open, and they lay just beyond the horizon. All we have to do is pick a direction and start walking.

↧

Boost Your C++ Applications with Persistent Memory – A Simple grep Example

January 2, 2018, 8:06 am

Latest and popular articles on Intel Technologies

≫ Next: Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark*

≪ Previous: Working in VR to Create VR Content

Download Sample Code

Introduction

In this article, I show how to transform a simple C++ program—in this case a simplified version of the famous UNIX command-line utility grep—in order to take advantage of persistent memory (PMEM). The article starts with a description of what the volatile version of the grep program does with a detailed look at the code. I then discuss how you can improve grep by adding to it a persistent caching of search results. Caching improves grep by¹ adding fault tolerance (FT) capabilities and² speeding up queries for already-seen search patterns. The article goes on to describe the persistent version of grep using the C++ bindings of libpmemobj, which is a core library of the Persistent Memory Developer Kit (PMDK) collection. Finally, parallelism (at file granularity) is added using threads and PMEM-aware synchronization.

Volatile Grep

Description

If you are familiar with any UNIX*-like operating system, such as GNU/Linux*, you are probably also familiar with the command-line utility grep (which stands for globally search a regular expression and print). In essence, grep takes two arguments (the rest are options): a pattern, taking the form of a regular expression, and some input file(s) (including standard input). The goal of grep is to scan the input, line by line, and then output those lines matching the provided pattern. You can learn more by reading the grep manual page (type man grep on the terminal or view the Linux man page for grep online).

For my simplified version of grep, only the two aforementioned arguments are used (pattern and input), where input should be either a single file or a directory. If a directory is provided, its contents are scanned to look for input files (subdirectories are always scanned recursively). To see how this works in practice, let’s run it using its own source code as input and “int” as pattern.

The code can be downloaded from GitHub*. To compile the code from the root of the pmdk-examples repository, type make simple-grep. libpmemobj must be installed in your system, as well as a C++ compiler. For the sake of compatibility with the Windows* operating system, the code does not make any calls to Linux-specific functions. Instead, the Boost C++ library collection is used (basically to handle filesystem input/output). If you use Linux, Boost C++ is probably already nicely packaged for your favorite distribution. For example, in Ubuntu* 16.04, you can install these libraries by doing:

# sudo apt-get install libboost-all-dev

If the program compiles correctly, we can run it like this:

$ ./grep int grep.cpp
FILE = grep.cpp
44: int
54:     int ret = 0;
77: int
100: int
115: int
135: int
136: main (int argc, char *argv[])

As you can see, grep finds seven lines with the word “int” on it (lines 44, 54, 77, 100, 115, 135, and 136). As a sanity check, we can run the same query using the system-provided grep:

$ grep int –n grep.cpp
44:int
54:     int ret = 0;
77:int
100:int
115:int
135:int
136:main (int argc, char *argv[])

So far, we have the desired output. The following listing shows the code (note: line numbers on the snippets above do not match the following listing because code formatting differs from the original source file):

#include <boost/filesystem.hpp>
#include <boost/foreach.hpp>
#include <fstream>
#include <iostream>
#include <regex>
#include <string.h>
#include <string>
#include <vector>

using namespace std;
using namespace boost::filesystem;
/* auxiliary functions */
int
process_reg_file (const char *pattern, const char *filename)
{
        ifstream fd (filename);
        string line;
        string patternstr ("(.*)(");
        patternstr += string (pattern) + string (")(.*)");
        regex exp (patternstr);

        int ret = 0;
        if (fd.is_open ()) {
                size_t linenum = 0;
                bool first_line = true;
                while (getline (fd, line)) {
                        ++linenum;
                        if (regex_match (line, exp)) {
                                if (first_line) {
                                        cout << "FILE = "<< string (filename);
                                        cout << endl << flush;
                                        first_line = false;
                                }
                                cout << linenum << ": "<< line << endl;
                                cout << flush;
                        }
                }
        } else {
                cout << "unable to open file " + string (filename) << endl;
                ret = -1;
        }
        return ret;
}

int
process_directory_recursive (const char *dirname, vector<string> &files)
{
        path dir_path (dirname);
        directory_iterator it (dir_path), eod;

        BOOST_FOREACH (path const &pa, make_pair (it, eod)) {
                /* full path name */
                string fpname = pa.string ();
                if (is_regular_file (pa)) {
                        files.push_back (fpname);
                } else if (is_directory (pa) && pa.filename () != "."&& pa.filename () != "..") {
                        if (process_directory_recursive (fpname.c_str (), files)< 0)
                                return -1;
                }
        }
        return 0;
}

int
process_directory (const char *pattern, const char *dirname)
{
        vector<string> files;
        if (process_directory_recursive (dirname, files) < 0)
                return -1;
        for (vector<string>::iterator it = files.begin (); it != files.end ();
             ++it) {
                if (process_reg_file (pattern, it->c_str ()) < 0)
                        cout << "problems processing file "<< *it << endl;
        }
        return 0;
}

int
process_input (const char *pattern, const char *input)
{
        /* check input type */
        path pa (input);
        if (is_regular_file (pa))
                return process_reg_file (pattern, input);
        else if (is_directory (pa))
                return process_directory (pattern, input);
        else {
                cout << string (input);
                cout << " is not a valid input"<< endl;
        }
        return -1;
}

/* MAIN */
int
main (int argc, char *argv[])
{
        /* reading params */
        if (argc < 3) {
                cout << "USE "<< string (argv[0]) << " pattern input ";
                cout << endl << flush;
                return 1;
        }
        return process_input (argv[1], argv[2]);
}

I know the code is long but, trust me, it is not difficult to follow. All I do here is to check whether the input is a file or a directory in process_input(). In the case of the former, the file is directly processed in process_reg_file(). Otherwise, the directory is scanned for files in process_directory_recursive(), and then all scanned files are processed one by one in process_directory() by calling process_reg_file() on each one. When processing a file, each line is checked to see whether it matches the pattern or not. If it does, the line is printed to standard output.

Persistent Caching

Now that we have a working grep, let’s see how we can improve it. The first thing we see is that grep does not keep any state at all. Once the input is analyzed and the output generated, the program simply ends. Let say, for instance, that we are interested in scanning a large directory (with hundreds of thousands of files) for a particular pattern of interest, every week. And let’s say that the files in this directory could, potentially, change over time (although not likely all at the same time), or new files could be added. If we use the classic grep for this task, we would be potentially scanning some of the files over and over, wasting precious CPU cycles. This limitation can be overcome with the addition of a cache: If it turns out that a file has already been scanned for a particular pattern (and its contents have not changed since it was last scanned), grep can simply return the cached results instead of re-scanning the file.

This cache can be implemented in multiple ways. One way, for example, is to create a specific data base (DB) to store the results of each scanned file and pattern (adding also a timestamp to detect file modifications). Although this surely works, a solution not involving the need to install and run a DB engine would be preferable, not to mention the need to perform DB queries every time files are analyzed (which may involve network and input/output overhead). Another way to do it would be to store this cache as a regular file (or files), loading it into volatile memory at the beginning, and updating it either at the end of the execution or every time a new file is analyzed. Although this seems to be a better approach, it still forces us to create two data models, one for volatile RAM, and another for secondary persistent storage (files), and write code to translate back and forth between the two. It would be nice to avoid this extra coding effort.

Persistent Grep

Design considerations

Making any code PMEM-aware using libpmemobj always involves, as a first step, designing the types of data objects that will be persisted. The first type that needs to be defined is that of the root object. This object is mandatory and used to anchor all the other objects created in the PMEM pool (think of a pool as a file inside a PMEM device). For my grep sample, the following persistent data structure is used:

Figure 1. Data structure for PMEM-aware grep.

Cache data is organized by creating a linked list of patterns hanging from the root class. Every time a new pattern is searched, a new object of class pattern is created. If the pattern currently searched for has been searched before, no object creation is necessary (the pattern string is stored in patternstr). From the class pattern we hang a linked list of files scanned. The file is composed of a name (which, in this case, is the same as the file system path), modification time (used to check whether the file has been modified), and a vector of lines matching the pattern. We only create new objects of class file for files not scanned before.

The first thing to notice here are the special classes p<> (for basic types) and persistent_ptr<> (for pointers to complex types). These classes are used to tell the library to pay attention to those memory regions during transactions (changes done to those objects are logged and rolled back in the event of a failure). Also, and due to the nature of virtual memory, persistent_ptr<> should always be used for pointers residing in PMEM. When a pool is opened by a process and mapped to its virtual memory address space, the location of the pool could be different than previous locations used by the same process (or other’s processes accessing the same pool). In the case of PMDK, persistent pointers are implemented as fat pointers; that is, they consist of a pool ID (used to access current pool virtual address from a translation table) + offset (from the beginning of the pool). For more information about pointers in PMDK you can read Type safety macros in libpmemobj, and also C++ bindings for libpmemobj (part 2) – persistent smart pointer.

You may wonder why, then, is the vector (std::vector) of lines not declared as a persistent pointer. The reason is that we do not need to. The object representing the vector, lines, does not change once it is created (during construction of an object of class file), and hence there is no need to keep track of the object during transactions. Still, the vector itself does allocate (and delete) objects internally. For this reason, we cannot rely only on the default allocator from std::vector (which only knows about volatile memory, and allocates all objects in the heap); we need to pass a customized one—provided by libpmemobj—that knows about PMEM. This allocator is pmem::obj::allocator<line>. Once we have declared the vector that way, we can use it as we would in any normal volatile code. In fact, you can use any of the standard container classes this way.

Code Modifications

Now, let’s jump to the code. In order to avoid repetition, only new code is listed (the full code is available in pmemgrep/pmemgrep.cpp). I start with definitions (new headers, macros, namespaces, global variables, and classes):

...
#include <libpmemobj++/allocator.hpp>
#include <libpmemobj++/make_persistent.hpp>
#include <libpmemobj++/make_persistent_array.hpp>
#include <libpmemobj++/persistent_ptr.hpp>
#include <libpmemobj++/transaction.hpp>
...
#define POOLSIZE ((size_t) (1024 * 1024 * 256)) /* 256 MB */
...
using namespace pmem;
using namespace pmem::obj;

/* globals */
class root;
pool<root> pop;

/* persistent data structures */
struct line {
	persistent_ptr<char[]> linestr;
	p<size_t> linenum;
};

class file
{
	private:

	persistent_ptr<file> next;
	persistent_ptr<char[]> name;
	p<time_t> mtime;
	vector<line, pmem::obj::allocator<line>> lines;

	public:

	file (const char *filename)
	{
		name = make_persistent<char[]> (strlen (filename) + 1);
		strcpy (name.get (), filename);
		mtime = 0;
	}

	char * get_name (void) { return name.get (); }

	size_t get_nlines (void) { return lines.size (); /* nlines; */ }

	struct line * get_line (size_t index) { return &(lines[index]); }

	persistent_ptr<file> get_next (void) { return next; }

	void set_next (persistent_ptr<file> n) { next = n; }

	time_t	get_mtime (void) { return mtime; }

	void set_mtime (time_t mt) { mtime = mt; }

	void
	create_new_line (string linestr, size_t linenum)
	{
		transaction::exec_tx (pop, [&] {
			struct line new_line;
			/* creating new line */
			new_line.linestr
			= make_persistent<char[]> (linestr.length () + 1);
			strcpy (new_line.linestr.get (), linestr.c_str ());
			new_line.linenum = linenum;
			lines.insert (lines.cbegin (), new_line);
		});
	}

	int
	process_pattern (const char *str)
	{
		ifstream fd (name.get ());
		string line;
		string patternstr ("(.*)(");
		patternstr += string (str) + string (")(.*)");
		regex exp (patternstr);
		int ret = 0;
		transaction::exec_tx (
		pop, [&] { /* dont leave a file processed half way through */
		      if (fd.is_open ()) {
			      size_t linenum = 0;
			      while (getline (fd, line)) {
				      ++linenum;
				      if (regex_match (line, exp))
					      /* adding this line... */
					      create_new_line (line, linenum);
			      }
		      } else {
			      cout<< "unable to open file " + string (name.get ())<< endl;
			      ret = -1;
		      }
		});
		return ret;
	}

	void remove_lines () { lines.clear (); }
};

class pattern
{
	private:

	persistent_ptr<pattern> next;
	persistent_ptr<char[]> patternstr;
	persistent_ptr<file> files;
	p<size_t> nfiles;

	public:

	pattern (const char *str)
	{
		patternstr = make_persistent<char[]> (strlen (str) + 1);
		strcpy (patternstr.get (), str);
		files = nullptr;
		nfiles = 0;
	}

	file *
	get_file (size_t index)
	{
		persistent_ptr<file> ptr = files;
		size_t i = 0;
		while (i < index && ptr != nullptr) {
			ptr = ptr->get_next ();
			i++;
		}
		return ptr.get ();
	}

	persistent_ptr<pattern> get_next (void)	{ return next; }

	void set_next (persistent_ptr<pattern> n) { next = n; }

	char * get_str (void) { return patternstr.get (); }

	file *
       find_file (const char *filename) {
		persistent_ptr<file> ptr = files;
		while (ptr != nullptr) {
			if (strcmp (filename, ptr->get_name ()) == 0)
				return ptr.get ();
			ptr = ptr->get_next ();
		}
		return nullptr;
	}

	file *
       create_new_file (const char *filename) {
		file *new_file;
		transaction::exec_tx (pop, [&] {
			/* allocating new files head */
			persistent_ptr<file> new_files
			= make_persistent<file> (filename);
			/* making the new allocation the actual head */
			new_files->set_next (files);
			files = new_files;
			nfiles = nfiles + 1;
			new_file = files.get ();
		});
		return new_file;
	}

	void
	print (void)
	{
		cout << "PATTERN = "<< patternstr.get () << endl;
		cout << "\tpattern present in "<< nfiles;
		cout << " files"<< endl;
		for (size_t i = 0; i < nfiles; i++) {
			file *f = get_file (i);
			cout << "###############"<< endl;
			cout << "FILE = "<< f->get_name () << endl;
			cout << "###############"<< endl;
			cout << "*** pattern present in "<< f->get_nlines ();
			cout << " lines ***"<< endl;
			for (size_t j = f->get_nlines (); j > 0; j--) {
				cout << f->get_line (j - 1)->linenum << ": ";
				cout<< string (f->get_line (j - 1)->linestr.get ());
				cout << endl;
			}
		}
	}
};

class root
{
	private:

	p<size_t> npatterns;
	persistent_ptr<pattern> patterns;

	public:

	pattern *
	get_pattern (size_t index)
	{
		persistent_ptr<pattern> ptr = patterns;
		size_t i = 0;
		while (i < index && ptr != nullptr) {
			ptr = ptr->get_next ();
			i++;
		}
		return ptr.get ();
	}

	pattern *
	find_pattern (const char *patternstr)
	{
		persistent_ptr<pattern> ptr = patterns;
		while (ptr != nullptr) {
			if (strcmp (patternstr, ptr->get_str ()) == 0)
				return ptr.get ();
			ptr = ptr->get_next ();
		}
		return nullptr;
	}

	pattern *
	create_new_pattern (const char *patternstr)
	{
		pattern *new_pattern;
		transaction::exec_tx (pop, [&] {
			/* allocating new patterns arrray */
			persistent_ptr<pattern> new_patterns
			= make_persistent<pattern> (patternstr);
			/* making the new allocation the actual head */
			new_patterns->set_next (patterns);
			patterns = new_patterns;
			npatterns = npatterns + 1;
			new_pattern = patterns.get ();
		});
		return new_pattern;
	}

	void
	print_patterns (void)
	{
		cout << npatterns << " PATTERNS PROCESSED"<< endl;
		for (size_t i = 0; i < npatterns; i++)
			cout << string (get_pattern (i)->get_str ()) << endl;
	}
}
...

Shown here is the C++ code for the diagram in Figure 1. You can also see the headers for libpmemobj, a macro (POOLSIZE) defining the size of the pool, and a global variable (pop) to store an open pool (you can think of pop as a special file descriptor). Notice how all modifications to the data structure—in root::create_new_pattern(), pattern::create_new_file(), and file::create_new_line()—are protected using transactions. In the C++ bindings of libpmemobj, transactions are conveniently implemented using lambda functions (you need a compiler compatible with at least C++11 to use lambdas). If you do not like lambdas for some reason, there is another way.

Notice also how all the memory allocation is done through make_persistent<>() instead of the regular malloc() or the C++ `new` construct.

The functionality of the old process_reg_file() is moved to the method file::process_pattern(). The new process_reg_file() implements the logic to check whether the current file has already been scanned for the pattern (if the file exists under the current pattern and it has not been modified since last time):

int
process_reg_file (pattern *p, const char *filename, const time_t mtime)
{
        file *f = p->find_file (filename);
        if (f != nullptr && difftime (mtime, f->get_mtime ()) == 0) /* file exists */
                return 0;
        if (f == nullptr) /* file does not exist */
                f = p->create_new_file (filename);
        else /* file exists but it has an old timestamp (modification) */
                f->remove_lines ();
        if (f->process_pattern (p->get_str ()) < 0) {
                cout << "problems processing file "<< filename << endl;
                return -1;
        }
        f->set_mtime (mtime);
        return 0;
}

The only change to the other functions is the addition of the modification time. For example, process_directory_recursive() now returns a vector of tuple<string, time_t> (instead of just vector<string>):

int
process_directory_recursive (const char *dirname,
                             vector<tuple<string, time_t>> &files)
{
        path dir_path (dirname);
        directory_iterator it (dir_path), eod;
        BOOST_FOREACH (path const &pa, make_pair (it, eod)) {
                /* full path name */
                string fpname = pa.string ();
                if (is_regular_file (pa)) {
                        files.push_back (
                        tuple<string, time_t> (fpname, last_write_time (pa)));
                } else if (is_directory (pa) && pa.filename () != "."&& pa.filename () != "..") {
                        if (process_directory_recursive (fpname.c_str (), files)< 0)
                                return -1;
                }
        }
        return 0;
}

Sample Run

Let’s run this code with two patterns: “int” and “void”. This assumes that a PMEM device (real or emulated using RAM) is mounted at /mnt/mem:

$ ./pmemgrep /mnt/mem/grep.pool int pmemgrep.cpp
$ ./pmemgrep /mnt/mem/grep.pool void pmemgrep.cpp
$

If we run the program without parameters, we get the cached patterns:

$ ./pmemgrep /mnt/mem/grep.pool
2 PATTERNS PROCESSED
void
int

When passing a pattern, we get the actual cached results:

$ ./pmemgrep /mnt/mem/grep.pool void
PATTERN = void
        1 file(s) scanned
###############
FILE = pmemgrep.cpp
###############
*** pattern present in 15 lines ***
80:     get_name (void)
86:     get_nlines (void)
98:     get_next (void)
103:    void
110:    get_mtime (void)
115:    void
121:    void
170:    void
207:    get_next (void)
212:    void
219:    get_str (void)
254:    void
255:    print (void)
326:    void
327:    print_patterns (void)
$
$ ./pmemgrep /mnt/mem/grep.pool int
PATTERN = int
        1 file(s) scanned
###############
FILE = pmemgrep.cpp
###############
*** pattern present in 14 lines ***
137:    int
147:            int ret = 0;
255:    print (void)
327:    print_patterns (void)
337: int
356: int
381: int
395: int
416: int
417: main (int argc, char *argv[])
436:    if (argc == 2) /* No pattern is provided. Print stored patterns and exit
438:            proot->print_patterns ();
444:            if (argc == 3) /* No input is provided. Print data and exit */
445:                    p->print ();
$

Of course, we can keep adding files to existing patterns:

$ ./pmemgrep /mnt/mem/grep.pool void Makefile
$ ./pmemgrep /mnt/mem/grep.pool void
PATTERN = void
        2 file(s) scanned
###############
FILE = Makefile
###############
*** pattern present in 0 lines ***
###############
FILE = pmemgrep.cpp
###############
*** pattern present in 15 lines ***
80:     get_name (void)
86:     get_nlines (void)
98:     get_next (void)
103:    void
110:    get_mtime (void)
115:    void
121:    void
170:    void
207:    get_next (void)
212:    void
219:    get_str (void)
254:    void
255:    print (void)
326:    void
327:    print_patterns (void)

Parallel Persistent Grep

Now that we have come this far, it would be a pity not to add multithreading support too; especially so, given the small amount of extra code required (the full code is available in pmemgrep_thx/pmemgrep.cpp).

The first thing we need to do is to add the appropriate header for pthreads and for the persistent mutex (more on this later):

...
#include <libpmemobj++/mutex.hpp>
...
#include <thread>

A new global variable is added to set the number of threads in the program, which now accepts a command-line option to set the number of threads (-nt=number_of_threads). If -nt is not explicitly set, one thread is used as default:

int num_threads = 1;

Next, a persistent mutex is added to the pattern class. This mutex is used to synchronize writes to the linked list of files (parallelism is done at the file granularity):

class pattern
{
        private:

        persistent_ptr<pattern> next;
        persistent_ptr<char[]> patternstr;
        persistent_ptr<file> files;
        p<size_t> nfiles;
        pmem::obj::mutex pmutex;
        ...

You may be wondering why the pmem::obj version of mutex is needed (why not use the C++ standard one). The reason is because the mutex is stored in PMEM, and libpmemobj needs to be able to reset it in the event of a crash. If not recovered correctly, a corrupted mutex could create a permanent deadlock; for more information you can read the following article about synchronization with libpmemobj.

Although storing mutexes in PMEM is useful when we want to associate them with particular persisted data objects, it is not mandatory in all situations. In fact, in the case of this example, a single standard mutex variable—residing in volatile memory—would have sufficed (since all threads work on only one pattern at a time). The reason why I am using a persistent mutex is to showcase its existence.

Persistent or not, once we have the mutex we can synchronize writes in pattern::create_new_file() by simply passing it to the transaction::exec_tx() (last parameter):

transaction::exec_tx (pop,
                             [&] { /* LOCKED TRANSACTION */
                                   /* allocating new files head */
                                   persistent_ptr<file> new_files
                                   = make_persistent<file> (filename);
                                   /* making the new allocation the
                                    * actual head */
                                   new_files->set_next (files);
                                   files = new_files;
                                   nfiles = nfiles + 1;
                                   new_file = files.get ();
                             },
                             pmutex); /* END LOCKED TRANSACTION */

The last step is to adapt process_directory() to create and join the threads. A new function—process_directory_thread()—is created for the thread logic (which divides work by thread ID):

void
process_directory_thread (int id, pattern *p,
                          const vector<tuple<string, time_t>> &files)
{
        size_t files_len = files.size ();
        size_t start = id * (files_len / num_threads);
        size_t end = start + (files_len / num_threads);
        if (id == num_threads - 1)
                end = files_len;
        for (size_t i = start; i < end; i++)
                process_reg_file (p, get<0> (files[i]).c_str (),
                                  get<1> (files[i]));
}

int
process_directory (pattern *p, const char *dirname)
{
        vector<tuple<string, time_t>> files;
        if (process_directory_recursive (dirname, files) < 0)
                return -1;
        /* start threads to split the work */
        thread threads[num_threads];
        for (int i = 0; i < num_threads; i++)
                threads[i] = thread (process_directory_thread, i, p, files);
        /* join threads */
        for (int i = 0; i < num_threads; i++)
                threads[i].join ();
        return 0;
}

Summary

In this article, I have shown how to transform a simple C++ program—in this case a simplified version of the famous UNIX command-line utility, grep—in order to take advantage of PMEM. I started the article with a description of what the volatile version of the grep program does with a detailed look at the code.

After that, the program is improved by adding a PMEM cache using the C++ bindings of libpmemobj, a core library of PMDK. To conclude, parallelism (at file granularity) is added using threads and PMEM-aware synchronization.

About the Author

Eduardo Berrocal joined Intel as a Cloud Software Engineer in July 2017 after receiving his PhD in Computer Science from the Illinois Institute of Technology (IIT) in Chicago, Illinois. His doctoral research interests were focused on (but not limited to) data analytics and fault tolerance for high-performance computing. In the past he worked as a summer intern at Bell Labs (Nokia), as a research aide at Argonne National Laboratory, as a scientific programmer and web developer at the University of Chicago, and as an intern in the CESVIMA laboratory in Spain.

Resources

The Persistent Memory Development Kit (PMDK), http://pmem.io/pmdk/.
Manual page for the grep command, https://linux.die.net/man/1/grep.
The Boost C++ Library collection, http://www.boost.org/.
Type safety macros in libpmemobj, http://pmem.io/2015/06/11/type-safety-macros.html.
C++ bindings for libpmemobj (part 2) – persistent smart pointer, http://pmem.io/2016/01/12/cpp-03.html.
C++ bindings for libpmemobj (part 6) – transactions, http://pmem.io/2016/05/25/cpp-07.html.
How to emulate Persistent Memory, http://pmem.io/2016/02/22/pm-emulation.html.
Link to sample code in GitHub.

↧

Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark*

January 4, 2018, 3:58 pm

Latest and popular articles on Intel Technologies

≫ Next: Increasing Efficiency and Uptime with Predictive Maintenance

≪ Previous: Boost Your C++ Applications with Persistent Memory – A Simple grep Example

In recent years, the scale of datasets and models used in deep learning has increased dramatically. Although larger datasets and models can improve the accuracy in many artificial intelligence (AI) applications, they often take much longer to train on a single machine. However, it is not very common to distribute the training to large clusters using current popular deep learning (DL) frameworks, compared to what’s been around for a long time in the Big Data area, as it’s often harder to gain access to a large graphics processing unit (GPU) cluster, and the lack of convenient facilities in popular DL frameworks for distributed training. By controlling the cluster distribution capabilities in Apache Spark*, BigDL successfully performs very large-scale distributed training and inference.

In this article, we demonstrate a parameter server (PS) style of parameter synchronization (using peer-to-peer allreduce) in BigDL to reduce the communication overhead along with coarse-grained scheduling, which is able to provide significant speedups for large-scale distributed deep learning training.

What is BigDL

BigDL (https://github.com/intel-analytics/BigDL) is a distributed deep learning library for Apache Spark developed by Intel and contributed to the open source community for the purposes of uniting big data processing and deep learning. The goal of BigDL is to help make deep learning more accessible to the Big Data community, by allowing them to continue the use of familiar tools and infrastructure to build deep learning applications.

As shown in Figure 1, BigDL is implemented as a library on top of Spark, so that users can write their deep learning applications as standard Apache Spark programs. As a result, it can be seamlessly integrated with other libraries on top of Apache Spark (for example, Apache Spark SQL and DataFrames, Apache Spark ML Pipelines, Apache Spark Streaming, Structured Streaming, and so on), and can directly run on existing Apache Spark or Hadoop* clusters.

Figure 1. BigDL implementation.

Communications in BigDL

In Apache Spark MLlib, a number of machine learning algorithms are based on using synchronous mini-batch stochastic gradient descent (SGD). To aggregate parameters, these algorithms use the reduce or treeAggregate methods in Spark, as shown in Figure 2.

In this process, the time spent at the driver linearly increases with the number of nodes. This is both due to the CPU and network bandwidth limitations of the driver. The CPU cost arises from merging partial results, while the network cost incurred is a result of transferring one copy of the model from each of the tasks (or partitions). Thus, the centralized driver becomes a bottleneck when there are a large number of nodes in the cluster.

Figure 2. Parameter synchronization in Apache Spark MLlib.

Figure 3. Parameter synchronization in BigDL.

Figure 3 shows how parameter managers inside BigDL implement a PS architecture (through the AllReduce operation) for synchronous mini-batch SGD. After each task computes its gradients, instead of sending gradients back to the driver, gradients from all the partitions within a single worker are aggregated locally. Then each node will have one gradient: This ensures that data transferred on each node will not increase if we increase the number of partitions in a node. After that, the aggregated gradient on each node is sliced into chunks, and these chunks are exchanged between all the nodes in the cluster. Each node is responsible for a specific chunk, which in essence implements a Parameter Server architecture in BigDL for parameter synchronization. Each node retrieves gradients for the slice of the model that this node is responsible for from all the other nodes and aggregates them in multiple threads. After the pair-wise exchange completes, each node has its own portion of aggregated gradients and uses this to update its own portion of weights. Then the exchange happens again for synchronizing the updated weights. At the end of this procedure, each node will have a copy of the updated weights.

As the parameters are stored in Apache Spark BlockManager, each task can get the latest weights from it. As all nodes in the cluster play the same role and the driver is not involved in the communication, there is no bottleneck in the system. Besides, as the cluster grows, data size transferred on each node remains the same and thus lowers the time spent in parameter aggregation, enabling BigDL to achieve near-linear scaling. Figure 4 shows that for Inception v1, the throughput of 16 nodes is ~1.92X of 8 nodes, while for ResNet*, it is ~1.88X. These results show that BigDL achieves a near linear scale-out performance.

However, we find that increasing the number of partitions still leads to an increase in training time. Our profiling showed this increase was because of the significant scheduling overhead present in Apache Spark for low-latency applications. Figure 5 shows the scheduling overheads as a fraction of average compute time for Inception v1 training as we increase the number of partitions. We see that with partition numbers greater than 300, Apache Spark overheads takes up more than 10 percent of average compute time and thus slows down the training process. To work around this issue, BigDL runs a single task (working on a single partition) on each worker, and each task in turn runs multiple threads in the deep learning training.

Figure 4. BigDL scaling behavior.

Figure 5. Apache Spark overheads as a fraction of average compute time for Inception v1 training.

What is Drizzle

Drizzle is a research project at the RISELab to study low-latency execution for streaming and machine learning workloads. Currently, Apache Spark uses a bulk synchronous parallel (BSP) computation model, and notifies the scheduler at the end of each task. Invoking the scheduler at the end of each task adds overhead, and results in decreased throughput and increased latency. We observed that for many low-latency workloads, the same operations are executed repeatedly; for example, processing different batches in streaming or iterative model training in machine learning. Based on this observation, we find that we can improve performance by amortizing the number of times the scheduler is invoked.

In Drizzle, we introduce group scheduling, where multiple iterations (or a group) of computations are scheduled at once. This helps decouple the granularity of task execution from scheduling and amortize the costs of task serialization and launch. One key challenge here is in launching tasks before their input dependencies have been computed. We solve this using prescheduling in Drizzle, where we proactively queue tasks to be run on worker machines, and rely on workers to trigger tasks when their input dependencies are met.

How Drizzle Works with BigDL

In order to exploit the scheduling benefits provided by Drizzle, we modified the implementation of the distributed optimization algorithm in BigDL. The main changes we made include refactoring the multiple stages of computation (like gradient calculation, gradient aggregation, and so on) to be part of a single DAG (Directed Acyclic Graph) of stages submitted to the scheduler. This refactoring enables Drizzle to execute all the stages of computation without interacting with the centralized driver for control plane operations. Thus, when used in conjunction with the above-described parameter manager, we can execute BigDL iterations without any centralized bottleneck in the control plane and data plane.

Performance

UC Berkeley RISE Lab executed performance benchmarks to measure the benefits from using BigDL. These benchmarks were run using Inception v1 on ImageNet and Visual Geometry Group (VGG) on Cifar-10 on Amazon EC2* clusters. sr4.x2 large machines, where each machine has four cores were used. Big is configured to use one partition per core.

Figure 6. Drizzle with VGG on Cifar-10.

Figure 7. Drizzle with Inception v1 on ImageNet.

Figure 6 shows that for VGG on 32 nodes, there is a 15 percent improvement when using Drizzle with group size 20. For Inception v1 on 64 nodes, (Figure 7), there is consistent performance improvement when increasing group size in Drizzle, and there is 10 percent improvement with a group size of 10. These improvements map directly to the scheduling overheads that we observed without Drizzle in Figure 5.

Summary

This article demonstrated how BigDL performs parameter aggregation and how Drizzle reduces Apache Spark scheduling overhead. To get started with BigDL with Drizzle, try out the BigDL parameter manager with drizzle GitHub* page; and to learn more about our work, please check out the BigDL GitHub page and Drizzle GitHub page.

↧

Increasing Efficiency and Uptime with Predictive Maintenance

January 10, 2018, 10:12 am

Latest and popular articles on Intel Technologies

≫ Next: Flexible New IoT Platform Empowers Enterprise Applications

≪ Previous: Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark*

In many manufacturing plants today, monitoring is a highly manual process. FOURDOTONE Teknoloji analyzes data from sensors to enable manufacturers to respond immediately to problems, and predict when machines are likely to fail.

Executive Summary

Downtime can be expensive, and in a tightly coupled manufacturing line a problem with one machine can have an impact on the entire factory. For many factories, avoiding downtime is a matter of luck rather than science: machine inspections are infrequent, and only capture what’s visible to the eye.

4.1 Industrial Internet of Things Platform (4.1 IIoTP) from FOURDOTONE Teknoloji enables manufacturers to be more responsive and proactive in their maintenance, so they can aim to minimize downtime. Data is gathered from the machines and analyzed in the factory, enabling an immediate response to emergencies or imminent problems. In the cloud, machine-learning algorithms are used to analyze the combined data from all of the machines, so that future maintenance requirements can be predicted. That enables the manufacturer to plan its maintenance to avoid downtime, and optimize its maintenance costs.

The technology provides a foundation for continuous improvement, and enables manufacturers to cut the risk of unplanned downtime.

FOURDOTONE Teknoloji was founded in 2014 in Turkey and specializes in Industry 4.0 projects. The company works on hardware independent projects including condition monitoring, big data process optimization, predictive maintenance and the digital factory. The company serves enterprises in Turkey, Central and Eastern Europe, and the Middle East.

Business Challenge: Avoiding Downtime in Industry

For manufacturing plants, downtime can have a huge impact on the business. A fault in a single machine could halt the production line. For plants that operate around the clock, that time can never be recovered. An unexpected drop in output can result in the business disappointing customers who are depending on its deliveries. It can have a direct impact on revenue, with orders lost and product unavailable to sell.

In many manufacturing businesses, unplanned downtime is hard to avoid. Maintenance remains reactive. Companies are unable to monitor and analyze their plant in real time, so they don’t know that there is anything wrong until a machine grinds to a halt. Without any reliable data on the past, they are unable to make any predictions on when machines are likely to fail.

Efforts to manage the plant are labor intensive, and prone to missing important signals. People might go from machine to machine, checking with the naked eye for any anomalies and collecting data with clipboards. The manual observation and the often infrequent checks make it difficult to detect potential problems without luck. If a check isn’t carried out in the right place at the right time, it’s not going to find a problem that might already be affecting performance, and might ultimately result in an outage.

The Machinery Monitoring Challenge

One organization facing these challenges is the Scattolini Turkey plant. It manufactures floats and tippers for commercial vehicles. The company is headquartered in Italy, and produces more than 200 types of equipment from its plant in Valeggio sul Mincio, and its seven other sites worldwide. Its plant in Turkey manufactures parts for vans.

Uptime is critical for its operations and its profitability. The company wants to transform from reactive maintenance to predictive maintenance: ensuring it can intervene before there is any downtime. A single day’s outage can cost as much as 35,000 EUR, including the cost of repair.

Its existing regime is based on manual inspection with monthly vibration measurements and reactive maintenance. The company needs a way to:

Gather data from its plant of over 30 machines, without requiring a visit to them. The machines include cranes, fans, and pumps;
Monitor the levels of liquid chemical ingredient pools;
Identify any problems as and when they occur, enabling an immediate response to minimize downtime;
Model likely future failures, so the maintenance team can carry out any repairs or replacements before there is an outage.

Solution Value: Enabling Predictive Maintenance

4.1 IIoTP gathers data wirelessly from the shop floor and analyzes it. In the case of the Scattolini Turkey plant, the data gathered includes axial vibration, surface temperature of motors, pressure of hydraulic and pneumatic systems, liquid levels in tanks and pools, and the status of the main energy line. The solution enables the plant to have access to more information, and on a more timely basis, than was previously possible. As a result, the company has a clearer insight into the status of its plant and its maintenance requirements. The data is analyzed in two stages: first, if there is an anomaly in the incoming data, an alert is raised immediately. An SMS message or email is sent to the predefined user group. In the event that there is no response, or there is a safety issue, the software can be configured to automatically shut down the machine.

The second stage of analysis takes place in the cloud. Combining the data from all the machines, 4.1 IIoTP can predict likely future outages and maintenance requirements. This approach uses machine-learning techniques to compare what is known about past failures, with current data about the plant and its equipment. By replacing parts before a likely failure, Scattolini can avoid unplanned outage.

The team at the Scattolini Turkey plant can use a cross-platform portal on computers, phones or tablets to monitor the state of the plant in real-time.

By reducing the amount of manual monitoring and transforming the factory to become proactive with its maintenance, Scattolini estimates that it has reduced its costs for maintenance operations by 15 percent.

Solution Architecture: Predictive Maintenance

To enable predictive maintenance, 4.1 IIoTP provides a mechanism for collecting data from the shop floor, analyzing it immediately for anything requiring a prompt response, and carrying out in-depth analysis in the cloud for predictive maintenance.

The machines to be monitored are fitted with battery or DC-powered wireless industrial sensors, manufactured for precision and operation in harsh environments.

4.1 IIoTP uses Intel® IoT Gateway Technology in the Dell Edge* Gateway 5000 to collect data from the sensors. Both wired and wireless connections are supported. The rugged and fanless gateway device is based on the Intel® AtomTM processor, which provides intelligence for devices at the edge of the network. Compute power at the edge enables fast decision making which can be critical at sites such as heavy industrial plants, fast moving production lines and chemical substance storage facilities.

Supervisory control and data access (SCADA) industrial control systems generally show current data. 4.1 IIoTP adds the ability to view historical data, and to automatically detect anomalies and threshold violations at the edge. Alerts can be raised by email or SMS, and 4.1 IIoTP can also intervene directly, changing the configuration of the machine or powering it down. This capability is provided by libraries and frameworks that enable 4.1 IIoTP to control the programmable logic controllers (PLCs). Leading PLCs are supported, including those from Siemens, Omron and Mitsubishi.

Additionally, data is sent to the cloud with 256-bit encryption. 4G/GPRS mobile broadband communication technologies are used to send data to the cloud, because they are more stable than Wi-Fi in industrial environments. In the cloud, data from all the gateways can be collected in one place and analyzed with machine-learning algorithms. This analysis can be used to predict likely machine failures, and to identify other opportunities for efficiency and quality gains. New rules can be created through machine learning for the analysis at the edge, to enable continuous improvement.

Figure 1. Using machine learning and edge analysis, 4.1 IIoTP enables an immediate response to urgent issues, and an in-depth analysis in the cloud to support continuous improvement

The cloud infrastructure is built on Amazon Web Services (AWS*). Amazon Kinesis Streams* are designed to continuously capture large volumes of data, and this service is used by 4.1 IIoTP to collect the data from the monitored devices. That data is also added to Amazon Simple Storage Service* (Amazon S3*), where it serves as a data lake, bringing together data from different types of monitored devices. Amazon Elastic MapReduce* (Amazon EMR*) is used to set up Spark* clusters, and map S3 objects to the EMR file system, so that batch analyses can be performed using the data in the S3 buckets. The predictive models are run in EMR, and data can also be consumed from Kinesis streams to enable real-time analysis on sensor data. The database, API and web servers are also hosted on AWS, using Amazon Elastic Compute Cloud* (Amazon EC2*). Users can remotely monitor the platform using a visual interface, on their choice of Internet-enabled device.

Conclusion

Manufacturers can use 4.1 IIoTP together with sensors fitted to the machines, to get an insight into the current and future health of their factory equipment. Analysis at the edge enables a prompt response in the case of an emergency, power outage or technical fault. Machine-learning algorithms in the cloud can analyze all the data generated by all the machines over time to refine the rules for raising alerts, and provide insight into the optimal time to maintain or replace a machine. This intelligence enables the manufacturer to avoid unplanned downtime, reduce the labor costs associated with monitoring machines manually, and optimize the cost of parts and maintenance. In turn, this enables manufacturers to increase the reliability and predictability of their manufacturing infrastructure.

Find the solution that’s right for your organization. Contact your Intel representative or visit www-ssl.intel.com/content/www/us/en/industrial-automation/overview.html

Intel Solutions Architects are technology experts who work with the world’s largest and most successful companies to design business solutions that solve pressing business challenges. These solutions are based on real-world experience gathered from customers who have successfully tested, piloted, and/or deployed these solutions in specific business use cases. Solutions architects and technology experts for this solution brief are listed on the front cover.

Learn More

↧

Flexible New IoT Platform Empowers Enterprise Applications

January 10, 2018, 10:13 am

Latest and popular articles on Intel Technologies

≫ Next: CEMOSoft Delivers Data-Driven Retail from the IoT Edge to the Cloud

≪ Previous: Increasing Efficiency and Uptime with Predictive Maintenance

Infiswift’s IoT platform, powered by Intel® technology, enables scalable and secure connections that deliver real-time, actionable insights.

It’s no secret that the Internet of Things (IoT) is creating a seismic shift in how businesses think, act, and approach the future. Yet despite the promise of game-changing technology in practically every industry—and the real evidence of measurable gains—IoT developers continue to battle inherent challenges such as intermittent connectivity, low power availability, and the struggle to efficiently capture intelligence at the edge.

With years of experience building large-scale IoT implementations in the energy space, the team at California based infiswift* has thought long and hard about how to tackle such obstacles. That’s why the company developed its own innovative, ultra-lightweight IoT platform, designed specifically to help customers seamlessly connect physical products to each other and the cloud. The new platform, which is ideal for intermittently connected and power deficient environments that require real-time operation, provides easy-to-use dashboards and prebuilt, flexible functionality including rule definition, device templates, and more.

Powered by Intel® technology, infiswift’s IoT platform provides world-class security and scalability, a robust development environment, and analytics for custom implementations. This allows infiswift’s customers and partners to develop, deploy, and scale IoT solutions to enterprise standards.

The Infiswift IoT Platform

Infiswift is a powerful, enterprise-grade IoT platform for connecting and managing your most important devices and cloud services. End points—be they physical devices, cloud databases, or applications—connect to each other using infiswift’s unique architecture, enabling ultra-secure, two-way communication that can be scaled to millions or billions of devices with near-zero latency.

Rethinking the Enterprise-Grade IoT Platform

In the IoT world, many innovators focus on what one might call the “big four”: real-time performance, scalability, security, and flexibility (see Figure 1). Years of experience and research proved to infiswift’s team that a mere aggregation of off -the-shelf products could not provide true real-time communications at scale that many enterprise customers seek. That’s why the company took a more comprehensive approach to building its IoT platform, using unique architectural concepts for next-level connectivity.

At its core, infiswift’s IoT platform is based on a broker that efficiently routes and manages communications between end points. The platform was designed with flexibility in mind and can support a variety of development environments—any cloud service and any device that can host MQTT-based client code can be connected. Plus, infiswift is dedicated to mitigating interoperability challenges that can often impair communication between stacks and hinder a customer’s ability to seamlessly integrate legacy and new technology. The overall result is an efficient broker and client platform with a footprint light enough to operate sensors on the edge via Intel® Curie™ modules with the Intel® Quark™ microcontroller D1000.

Whether sold off the shelf, via a license agreement, or as a more customized solution to enterprise partners, infiswift’s software is powered by trusted Intel technology. And moving forward, infiswift—already committed to incorporating state-of-the-art security in its solutions—is excited about additional opportunities to leverage Intel’s leadership in IoT security. Pushing the envelope of innovation is integral to infiswift’s mission. Indeed, in an industry noted for transformational advances, the company is proud to have numerous patent applications in progress.

Figure 1: A look at the importance to enterprise of the “big four” factors for IoT solutions

Energizing the Solar Industry

In the solar arena, owners, operators, energy regulators, and financiers all require some level of monitoring and control of their power plants. Companies in the industry have been connecting and managing their assets for years, but solutions to date have been costly, slow to integrate, and inflexible. For example, SCADA systems deployed in solar do not typically allow for real-time, two-way communication control and integrated data for centralized management and analytics, because the cost to do so is impractical. Additional challenges faced when connecting devices on solar PV plants include rough environments, intermittent connectivity, and low power availability for some sensors.

With decades of experience implementing connectivity solutions for some of the largest global solar developers, infiswift understands how to successfully tackle these challenges. Infiswift’s IoT platform, which can be integrated into existing systems or new plants, is designed to empower faster decision-making based on more accurate information from a wider set of data sources, such as grid pricing, weather, and ground movement. Asset management, energy forecasting, and maintenance planning become more reliable with better data that is more granular, higher resolution, and centralized to empower data science and automated decision-making in the system.

A typical implementation includes distributed intelligence at the edge, as well as centralized control and analytics that can be customized to the specific PV system needs. With the flexibility to integrate a wide range of data sources at low cost (due to wireless hardware advances), the infiswift platform can provide a great foundation to build a cutting-edge performance monitoring and management solution. The wide range of stakeholders—from owner to operator to field technician—each benefit from different types and amounts of information and can access custom dashboards, available via web and mobile, to visualize important data, analyses, and predictive information.

This type of IoT-based connectivity for solar plants is the next cost-effective advance in operations and management that will maximize levelized cost of energy (LCOE) for many solar developers.

Figure 2: Solar PV plant topology with diverse and flexible data sources

Conclusion

Armed with its unique approach to enterprise-grade IoT solutions, and backed by powerful Intel technology, the team at infiswift is successfully moving forward with flexible implementations that deliver scalable, secure, real-time insights for new and existing partners in solar energy and beyond.

Learn More

Infiswift is a general member of the Intel® IoT Solutions Alliance. From modular components to market-ready systems, Intel and the 400+ global member companies of the Alliance provide scalable, interoperable solutions that accelerate deployment of intelligent devices and end-to-end analytics. Close collaboration with Intel and each other enables Alliance members to innovate with the latest IoT technologies, helping developers deliver first-in-market solutions.

For more information about infiswift, please visit infiswift.com.

For more information about Intel® IoT technology and the Intel IoT Solutions Alliance, please visit intel.com/iot.

↧

Problem:

Solution:

Introduction

So What's Different About Running a Neural Network on the Inference Engine?

How Does the Inference Engine Work?

What you’ll Learn

Gather your materials

Introduction

Overview of Persistent Memory (PMEM) Technology

PMEM in Perspective

How is PMEM Different?

The System’s Point of View

Three Logical Architectures

Stencil Applications

The Middleware’s Point of View

PMEM-Aware Checkpoint/Restart

MPI-PMEM Extensions

Persistent Workflow Engines

The Application’s Point of View

Interactive In Situ Visualization with PMEM

Summary

About the Author

Resources

Overview

Use GDB’s backtrace command for tracking the callstack when crash happens

Sometimes backtrace does not work

Use Intel System Studio 2018’s function call history view to review clear call stacks

More debug usages you might want to know.

See also

A Lithuanian Team Tests the Capabilities of AI to Improve the Precision and Accuracy of Cervical Cancer Screening

Abstract

Kaggle Competitions: Data Scientists Solve Real-world Machine Learning Problems

Team TEST Applies AI Expertise to Cervical Cancer Screening

Determining the Steps to a Most Efficient Solution

Augmentations – Color Proves a Key Insight

Simplified model

Training and Inference Methods

Training and Prediction Times

Dependencies

Results and Key Findings…and a Plan to Keep Saving Lives

Learn More About Intel Initiatives in AI

Kaggle* Master Luis Andre Dutra e Silva Develops Two AI Solutions to Improve the Precision and Accuracy of Cervical Cancer Screening

Abstract

Kaggle Competitions: Data Scientists Solve Real-world Problems Using Machine Learning

A Kaggle Master Competitor Rises to the Challenge of Cervical Cancer Screening

Two Approaches to Code Optimization

The Right Choice of Hardware and Software Tools Puts Silva in the Money

A Step-by-step Process

Training the Networks

Overcoming the Lack of Medical Training

Results and Key Findings: What Set This Approach Apart

Learn More About Intel Initiatives in AI

1. Introduction

2. XNOR Networks

3. Generator networks

4. Adapting to multiple styles

Summary and future work

Acknowledgements

References

Indrayana Rustandi Employs Convolutional Neural Networks Using AI to Improve the Precision and Accuracy of Cervical Cancer Screening

Abstract

Kaggle Competitions: Data Scientists Solve Real-world Problems with Machine Learning

A Kaggle Competition Veteran Takes on Cervical Cancer Screening

Choosing an Approach to Code Optimization

Software and Hardware Resources Brought into Play

Machine Learning Model Design for Training and Testing

Learning the neural network weights

Data Augmentation

Training and Prediction Time

Results and Key Findings: ‘The Personal Touch’ Sets This Approach Apart

Learn More About Intel Initiatives in AI

Introduction and Description of Product

How to Get Intel® TBB

Drop-in Use with Interpreter Call (no other code changes)

Simply drop in Intel® TBB and determine if it is the right solution for your problem statement!

Interpreter Flag Reference

Command Line Usage

Get Help from Command Line

List of the currently available interpreter flags

Additional Links